Generating, Indexing and Searching Embeddings (Experimental)

WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.

Installing the Package

For testing purpose, let’s install the latest development version:

[1]:
%cd ../../../
!python3 -m pip install --upgrade .
/home/gpadmin/GreenplumPython
Defaulting to user installation because normal site-packages is not writeable
Processing /home/gpadmin/GreenplumPython
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied, skipping upgrade: psycopg2-binary==2.9.5 in /home/gpadmin/.local/lib/python3.9/site-packages (from greenplum-python==1.0.1) (2.9.5)
Requirement already satisfied, skipping upgrade: dill==0.3.6 in /home/gpadmin/.local/lib/python3.9/site-packages (from greenplum-python==1.0.1) (0.3.6)
Building wheels for collected packages: greenplum-python
  Building wheel for greenplum-python (PEP 517) ... done
  Created wheel for greenplum-python: filename=greenplum_python-1.0.1-py3-none-any.whl size=71903 sha256=305b83c461fb90310fafe09821f5778ef5235a439aee59ee3c1f304e349188d6
  Stored in directory: /tmp/pip-ephem-wheel-cache-w_h4u4oe/wheels/bb/1f/99/ff8594e48ec11df99af6e0ee8611a5e560e9f44d1a3fefb351
Successfully built greenplum-python
Installing collected packages: greenplum-python
Successfully installed greenplum-python-1.0.1

Preparing Data

With GreenplumPython install, let’s create a table with some sample text data:

[2]:
content = ["I have a dog.", "I like eating apples."]

import greenplumpython as gp

db = gp.database("postgresql://localhost:7000")
t = (
    db.create_dataframe(columns={"id": range(len(content)), "content": content})
    .save_as(
        table_name="text_sample",
        column_names=["id", "content"],
        distribution_key={"id"},
        distribution_type="hash",
        drop_if_exists=True,
        drop_cascade=True,
    )
    .check_unique(columns={"id"})
)

Generating and Indexing Embeddings

On the text sample table, we can now create an embedding index with the new embedding module:

[3]:
import greenplumpython.experimental.embedding

t = t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")
t
[3]:
id content
0 I have a dog.
1 I like eating apples.

This will generate embeddings for the text data using the specified model and create vector index on the embeddings for fast k-NN search.

Generating Embeddings without Indexing

If we just want to generate the embeddings without creating a vector index, we can use the function create_embedding() from the embedding module:

[5]:
from greenplumpython.experimental.embedding import create_embedding

Since we’re not indexing vectors, the dataframe doesn’t need to be stored as a table in the database. And we do not need to specify the unique key if the embeddings are in the same dataframe.

Furthermore, if we want to save the embeddings as vector type with embedding dimension, so that we can index them later, we need to cast the result to type gp.type_("vector", modifier=<embedding_dimension>).

[6]:
db.create_dataframe(columns={"id": range(len(content)), "content": content}).assign(
    embedding_col=lambda t: (
        gp.type_("vector", modifier=384)(create_embedding(t["content"], "all-MiniLM-L6-v2"))
    ),
)
[6]:
id content embedding_col
0 I have a dog. [-0.03659846,-0.012087725,0.08805456,0.06115138,-0.043457743,-0.01559289,0.07047544,-0.002039723,0.08257614,-0.027372131,0.04414288,-0.03269939,0.013636172,0.041616578,0.01041031,-0.0015929971,-0.06705982,-0.04409838,-0.0057846354,-0.064064376,-0.065876596,0.07500749,0.012162328,-0.005788606,-0.10990597,0.027304182,-0.039163485,-0.05016219,0.0029829051,-0.03839179,-0.015229771,-0.055909265,-0.011802612,-0.004877074,-0.042732246,-0.041694522,0.0065653333,-0.013692751,0.10301593,0.080455385,0.04717231,0.014515034,0.06301975,-0.008371313,-0.0037640464,0.037010957,-0.08730185,-0.019860014,0.116165005,-0.00917515,-0.029422058,0.057609066,-0.017986145,0.030363863,-0.018659862,-0.02392018,0.0075364686,0.030293366,-0.0017754078,-0.02729239,0.010452815,0.06776974,0.009428493,0.0472378,0.00020276332,0.020744024,-0.06177006,0.06334934,-0.0663036,0.055175003,0.036360443,0.03362923,0.04471841,0.09541594,-0.03548978,-0.107487485,0.06386296,-0.030471282,0.1804235,0.07920901,-0.08705953,-0.06174665,-0.042777628,0.04772076,0.0404552,0.011489488,0.07283992,0.06658615,-0.117522426,0.011569878,-0.02257866,-0.049202636,-0.03411388,0.017634219,-0.0032649843,-0.01003333,-0.022944538,-0.03394829,-0.021662673,0.08960133,0.0084434915,0.028806066,0.071886115,0.045687664,0.09596071,0.02309955,-0.093928486,0.060846634,-0.010293208,0.0019619183,-0.01262796,0.00903265,-0.023953682,0.10200915,0.047290858,0.045499157,-0.075414516,-0.024221132,0.060803182,-0.09191944,0.011989032,0.021896891,-0.04434021,0.021226257,0.019848466,-0.05852586,0.03497773,-9.0260415e-33,-0.005345117,-0.02905326,0.014672193,0.04659996,-0.02827288,0.013217331,-0.038185596,0.030172179,-0.052595694,-0.016775005,0.0034630296,0.0005796092,-0.020373767,-0.034381017,-0.003368548,0.0013990616,0.051134117,0.018485619,0.08034309,-0.00014359028,-0.013998864,-0.021287004,0.039143343,0.017298104,-0.01783786,-0.012515478,-0.013980072,-0.08343152,-0.026559573,0.024582457,0.028264288,0.020893438,0.045632884,-0.041543033,-0.10551889,-0.036366448,-0.053493455,-0.055436466,-0.04398083,0.052545346,0.08640961,-0.0042671273,0.017281737,-0.00034555266,0.004699934,-0.034805812,0.008263813,0.020119106,-0.09260096,0.014703147,0.011787516,-0.033072904,0.0042901696,-0.089319356,-0.029248364,-0.041016947,0.05976217,-0.00918999,0.019669672,0.08591935,0.022527535,0.0075523653,-0.030852487,0.0293062,0.051727384,-0.090517566,-0.095217556,-0.04174029,-0.0011758066,0.014292619,-0.024682239,-0.00352193,0.007736276,-0.017399697,0.071428835,-0.01235873,-0.005342253,-0.0033088347,-0.01875911,-0.07966146,0.019006373,0.0018609434,0.0070682094,0.05770639,0.07751444,0.05984159,-0.029955158,-0.0058063027,-0.023169449,0.0026583143,-0.0657157,-0.04399309,0.033948664,-0.027996266,0.040526785,5.238206e-33,0.010241496,0.03607309,0.046909038,0.01363597,-0.005335445,0.0016521378,-0.020371612,0.04564497,-0.08217566,0.06402261,-0.0017092424,0.04467231,0.10069538,0.00045676067,0.062299315,0.037693474,-0.039460365,-0.019606704,0.05026583,-0.05616923,-0.18455043,0.08040067,0.074261375,0.01932381,-0.026447851,0.040501554,-0.019648919,-0.023729222,-0.0589519,-0.08537439,-0.045682527,-0.12889874,-0.05590042,-0.068548314,-0.0058031343,0.06694754,-0.023167383,-0.14575258,-0.0123237,-0.05953811,0.036701642,-0.0021032344,0.048329204,0.078937724,0.01448631,0.029141147,0.014654006,-0.06743169,0.009763479,0.03308005,-0.026131291,-0.008976251,-0.02805068,-0.06251999,-0.003333123,-0.01415754,-0.07179516,-0.06783281,0.014238787,0.008521243,-0.03168489,0.0996435,-0.052023314,0.13799056,-0.01971767,-0.0868198,-0.007109497,-0.055724714,0.011921491,-0.073369175,-0.007965487,0.07029797,-0.031166457,-0.055607323,0.010831625,0.04010842,0.051589157,-0.0015768374,0.037868574,0.015498447,-0.06851171,-0.040853888,0.0092245145,-0.010765777,-0.0015251125,-0.037699535,-0.0050808727,0.05028556,-0.0018061057,0.047179475,-0.032873698,0.07862571,0.021928754,-0.055561442,0.0068104025,-1.601115e-08,-0.047843583,-0.0016648499,-0.0019612422,-0.002554689,0.051340975,0.03563475,0.008412877,-0.06416777,-0.031938273,-0.019677943,0.031404965,-0.017351918,-0.043358672,0.020338805,0.10461016,0.025110265,0.01756787,8.452355e-06,0.03481564,0.119492605,-0.07120706,0.014109293,0.07982084,-0.006870619,-0.0052823476,-0.029617291,0.0735672,0.06555545,-0.0973324,0.06841363,-0.03208407,0.109986424,-0.03169939,0.018973589,0.024622567,-0.06959749,0.07099971,-0.0502078,0.044230383,0.021497766,0.057419084,0.12532368,-0.08883316,-0.018113941,0.0011768519,0.06459078,-0.0014821336,-0.09094165,-0.0075864964,-0.00019048726,-0.12415704,-0.06488212,0.09381432,0.051018294,-0.020306533,-0.004231261,-0.01809832,-0.07439528,0.05670538,0.03697211,0.038794994,0.04458422,-0.080352895,-0.030577209]
1 I like eating apples. [0.021809125,-0.015531936,0.011607823,0.08773645,-0.060896702,-0.035311054,0.11097563,-0.05388054,0.015478587,0.025643231,0.034682132,-0.09349964,0.018253824,0.0032013033,0.04340516,-0.037074324,0.088959105,-0.0040924014,-0.010021067,0.005995189,-0.078318015,0.066143975,0.042326767,-0.027101004,0.017702203,0.047038272,0.069593005,-0.037545256,-0.08466898,-0.0149313845,-0.05919546,2.3295312e-05,0.013309427,0.012327688,-0.05439113,0.0081964955,0.1404407,-0.07974374,-0.04133351,-0.022248574,0.018386977,0.06675908,0.060005367,0.040904358,-0.057686336,-0.008572924,-0.00069316046,-0.017934205,0.09348524,0.04610809,0.042312037,0.004256497,-0.035399776,-0.031868283,0.055097736,0.030634014,0.017477207,0.007607817,0.0028514373,-0.00848901,0.07058604,-0.065969445,-0.0030018615,0.017515391,0.03681229,-0.051015034,-0.051681925,-0.007240641,-0.05672333,-0.00033159592,-0.016689943,0.050976675,0.09232235,0.04870195,-0.023326442,0.014425969,0.0944048,-0.084106356,-0.06532096,0.010295293,-0.060007896,-0.0066203177,0.018760895,0.006218678,-0.016821053,-0.05153683,-0.019194037,0.019247936,-0.05592109,0.07442912,0.0011268753,-0.01857252,-0.03386638,0.048263445,0.0018756357,0.021458386,0.02670066,-0.07195236,-0.035215963,0.09375799,0.009641698,0.03153929,-0.0065211398,0.059988208,0.029077088,0.006436115,-0.16888268,-0.012192926,0.008317657,-0.0010368789,0.020289453,-0.015101345,-0.036400657,-0.0053182426,0.016343204,0.04836311,0.052492023,0.0022888337,0.013867806,-0.011067135,-0.00632472,0.08962689,-0.056332756,-5.0709128e-05,0.00037433414,-0.043979205,0.030548107,-6.1121328e-33,-0.10044726,-0.04796987,0.050677963,-0.031848893,0.017650908,0.0055781733,0.035132997,0.095104724,0.091575645,-0.026064795,-0.0059387633,-0.023844877,-0.03789114,-0.0062694866,0.024072742,-0.06319934,-0.025684576,0.072659574,-0.04208775,-0.014134044,-0.017349942,-0.09240058,-0.0064091305,0.09291195,-0.027069137,-0.08738226,0.042585023,-0.12305703,0.062073898,0.017139783,0.043850746,-0.005554785,-0.03515969,-0.057963055,-0.0016850779,-0.029315367,0.07211059,0.049894214,-0.028748097,0.0011031141,-0.007046476,0.020515675,0.067191206,0.021492152,0.06486439,0.0060838815,0.025401684,0.0739729,-0.030965952,-0.00762098,-0.04577821,-0.048278432,0.09053183,0.03222762,-0.015725326,-0.0107247075,0.013521895,-0.0360384,-0.09246122,0.01310438,-0.07853673,0.049683314,0.008800179,-0.007872623,-0.11311235,0.11412768,-0.03581802,-0.047303308,0.014969714,0.02396507,-0.04279115,0.031482812,-0.022683978,0.0005804888,-0.11246332,-0.09786996,0.045210768,-0.03159177,-0.05506939,-0.023562698,0.052014776,-0.0024514296,0.003902688,-0.010034765,0.033652794,0.122117504,-0.06718425,-0.066750795,0.108197525,-0.015414996,0.00400915,0.021052254,0.016455496,0.019499239,-0.12814386,5.5857067e-33,-0.0018572184,-0.080794044,-0.013305333,0.01841107,-0.037682977,-0.067594446,-0.087071694,0.013579439,-0.02803438,-0.032445576,-0.026130464,-0.006865185,-0.022305872,-0.016416714,0.023153791,0.024428565,-0.011959914,0.093689434,-0.032577604,0.026465515,-0.046098784,0.008481788,-0.006716845,0.019120447,0.016167238,-0.02313292,-0.0042774454,0.043933924,-0.018111937,0.059962064,0.05109594,-0.07903502,-0.059705768,-0.13360032,0.04902079,0.035442233,-0.09378037,-0.056613892,-0.0022577408,0.030770848,0.015449528,0.0032539356,0.031303164,0.11281754,0.036288805,0.09346795,0.0313906,0.058778953,0.022154897,0.05777495,0.00097196305,-0.02609103,-0.06628839,0.015047393,0.03955508,0.0523623,0.0069718817,0.0009399279,-0.039598145,-0.07549803,-0.102647424,0.06432405,0.018766917,0.013961236,0.06031335,-0.02941947,-0.03033608,-0.053566907,-0.07672768,0.012401397,-0.009276499,-0.054574206,-0.0566019,-0.024081016,-0.039790105,-0.035410725,0.011844968,0.036265053,-0.08490442,0.058963377,-0.030408578,0.10739632,0.010045292,0.06581671,0.049952522,0.05613914,-0.018259415,0.023479586,-0.04595969,0.03890778,-0.005904789,-0.015094089,0.013457783,-0.03914847,0.011510677,-1.5212935e-08,-0.045827758,-0.029699294,0.03503024,-0.010878928,-0.003190462,0.07422462,-0.07662787,0.05413322,0.02137874,-0.040636785,0.062867135,0.08551578,-0.08906489,0.05611474,0.048328113,0.008293776,0.08469364,-0.027762407,-0.015386819,0.067916475,-0.0937729,0.018911839,-0.013140985,0.04376479,-0.018527055,0.021828363,0.0024259402,0.020919863,0.1057404,0.063920595,0.056231383,0.053664792,-0.08300249,0.068553776,-0.0059213005,-0.0768514,0.010081414,-0.011377745,-0.012504746,-0.10047471,-0.049601573,-0.002936166,0.015598577,-0.042786237,-0.0998226,0.022823302,0.063844405,0.011207117,0.020726835,0.08571722,0.041427787,0.026192738,0.09660777,0.08237022,0.036912948,-0.014799402,0.043485742,-0.07760759,0.015751759,0.07816933,0.117991626,0.058715604,0.021846006,-0.016581282]

Semantic Search by Embeddings

With the embedding index, we can search for contents based on the semantic similairy:

[4]:
t.embedding().search(column="content", query="apple", top_k=1)
[4]:
id content
1 I like eating apples.

This is going to be very efficient since we don’t need to scan all the data.

Cleaning All at Once

To ease management, the dependencies of the embedding index and the base table will be recorded in database.

As a result, trying to droping the base table alone will fail:

[6]:
%reload_ext sql
%sql postgresql://localhost:7000
%sql DROP TABLE text_sample
 * postgresql://localhost:7000
(psycopg2.errors.DependentObjectsStillExist) cannot drop table text_sample because other objects depend on it
DETAIL:  table cte_32a769763ae94cd9b4036ceb590c4f0d depends on table text_sample
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

[SQL: DROP TABLE text_sample]
(Background on this error at: https://sqlalche.me/e/20/2j85)

To drop the base table, we need to also drop the embedding index. This can be achieved with CASCADE:

[7]:
%%sql
DROP TABLE text_sample CASCADE;

SELECT oid, relname
FROM gp_dist_random('pg_class')
WHERE relname = 'cte_32a769763ae94cd9b4036ceb590c4f0d';
 * postgresql://localhost:7000
Done.
0 rows affected.
[7]:
oid relname

As we can see, after DROP CASCADE, the embedding index also gets dropped on all segments.