Generating, Indexing and Searching Embeddings (Experimental)
WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.
Installing the Package
For testing purpose, let’s install the latest development version:
[1]:
%cd ../../../
!python3 -m pip install --upgrade .
/home/gpadmin/GreenplumPython
Defaulting to user installation because normal site-packages is not writeable
Processing /home/gpadmin/GreenplumPython
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied, skipping upgrade: psycopg2-binary==2.9.5 in /home/gpadmin/.local/lib/python3.9/site-packages (from greenplum-python==1.0.1) (2.9.5)
Requirement already satisfied, skipping upgrade: dill==0.3.6 in /home/gpadmin/.local/lib/python3.9/site-packages (from greenplum-python==1.0.1) (0.3.6)
Building wheels for collected packages: greenplum-python
Building wheel for greenplum-python (PEP 517) ... done
Created wheel for greenplum-python: filename=greenplum_python-1.0.1-py3-none-any.whl size=71903 sha256=305b83c461fb90310fafe09821f5778ef5235a439aee59ee3c1f304e349188d6
Stored in directory: /tmp/pip-ephem-wheel-cache-w_h4u4oe/wheels/bb/1f/99/ff8594e48ec11df99af6e0ee8611a5e560e9f44d1a3fefb351
Successfully built greenplum-python
Installing collected packages: greenplum-python
Successfully installed greenplum-python-1.0.1
Preparing Data
With GreenplumPython install, let’s create a table with some sample text data:
[2]:
content = ["I have a dog.", "I like eating apples."]
import greenplumpython as gp
db = gp.database("postgresql://localhost:7000")
t = (
db.create_dataframe(columns={"id": range(len(content)), "content": content})
.save_as(
table_name="text_sample",
column_names=["id", "content"],
distribution_key={"id"},
distribution_type="hash",
drop_if_exists=True,
drop_cascade=True,
)
.check_unique(columns={"id"})
)
Generating and Indexing Embeddings
On the text sample table, we can now create an embedding index with the new embedding
module:
[3]:
import greenplumpython.experimental.embedding
t = t.embedding().create_index(column="content", model_name="all-MiniLM-L6-v2")
t
[3]:
id | content |
---|---|
0 | I have a dog. |
1 | I like eating apples. |
This will generate embeddings for the text data using the specified model and create vector index on the embeddings for fast k-NN search.
Generating Embeddings without Indexing
If we just want to generate the embeddings without creating a vector index, we can use the function create_embedding()
from the embedding
module:
[5]:
from greenplumpython.experimental.embedding import create_embedding
Since we’re not indexing vectors, the dataframe doesn’t need to be stored as a table in the database. And we do not need to specify the unique key if the embeddings are in the same dataframe.
Furthermore, if we want to save the embeddings as vector type with embedding dimension, so that we can index them later, we need to cast the result to type gp.type_("vector", modifier=<embedding_dimension>)
.
[6]:
db.create_dataframe(columns={"id": range(len(content)), "content": content}).assign(
embedding_col=lambda t: (
gp.type_("vector", modifier=384)(create_embedding(t["content"], "all-MiniLM-L6-v2"))
),
)
[6]:
id | content | embedding_col |
---|---|---|
0 | I have a dog. | [-0.03659846,-0.012087725,0.08805456,0.06115138,-0.043457743,-0.01559289,0.07047544,-0.002039723,0.08257614,-0.027372131,0.04414288,-0.03269939,0.013636172,0.041616578,0.01041031,-0.0015929971,-0.06705982,-0.04409838,-0.0057846354,-0.064064376,-0.065876596,0.07500749,0.012162328,-0.005788606,-0.10990597,0.027304182,-0.039163485,-0.05016219,0.0029829051,-0.03839179,-0.015229771,-0.055909265,-0.011802612,-0.004877074,-0.042732246,-0.041694522,0.0065653333,-0.013692751,0.10301593,0.080455385,0.04717231,0.014515034,0.06301975,-0.008371313,-0.0037640464,0.037010957,-0.08730185,-0.019860014,0.116165005,-0.00917515,-0.029422058,0.057609066,-0.017986145,0.030363863,-0.018659862,-0.02392018,0.0075364686,0.030293366,-0.0017754078,-0.02729239,0.010452815,0.06776974,0.009428493,0.0472378,0.00020276332,0.020744024,-0.06177006,0.06334934,-0.0663036,0.055175003,0.036360443,0.03362923,0.04471841,0.09541594,-0.03548978,-0.107487485,0.06386296,-0.030471282,0.1804235,0.07920901,-0.08705953,-0.06174665,-0.042777628,0.04772076,0.0404552,0.011489488,0.07283992,0.06658615,-0.117522426,0.011569878,-0.02257866,-0.049202636,-0.03411388,0.017634219,-0.0032649843,-0.01003333,-0.022944538,-0.03394829,-0.021662673,0.08960133,0.0084434915,0.028806066,0.071886115,0.045687664,0.09596071,0.02309955,-0.093928486,0.060846634,-0.010293208,0.0019619183,-0.01262796,0.00903265,-0.023953682,0.10200915,0.047290858,0.045499157,-0.075414516,-0.024221132,0.060803182,-0.09191944,0.011989032,0.021896891,-0.04434021,0.021226257,0.019848466,-0.05852586,0.03497773,-9.0260415e-33,-0.005345117,-0.02905326,0.014672193,0.04659996,-0.02827288,0.013217331,-0.038185596,0.030172179,-0.052595694,-0.016775005,0.0034630296,0.0005796092,-0.020373767,-0.034381017,-0.003368548,0.0013990616,0.051134117,0.018485619,0.08034309,-0.00014359028,-0.013998864,-0.021287004,0.039143343,0.017298104,-0.01783786,-0.012515478,-0.013980072,-0.08343152,-0.026559573,0.024582457,0.028264288,0.020893438,0.045632884,-0.041543033,-0.10551889,-0.036366448,-0.053493455,-0.055436466,-0.04398083,0.052545346,0.08640961,-0.0042671273,0.017281737,-0.00034555266,0.004699934,-0.034805812,0.008263813,0.020119106,-0.09260096,0.014703147,0.011787516,-0.033072904,0.0042901696,-0.089319356,-0.029248364,-0.041016947,0.05976217,-0.00918999,0.019669672,0.08591935,0.022527535,0.0075523653,-0.030852487,0.0293062,0.051727384,-0.090517566,-0.095217556,-0.04174029,-0.0011758066,0.014292619,-0.024682239,-0.00352193,0.007736276,-0.017399697,0.071428835,-0.01235873,-0.005342253,-0.0033088347,-0.01875911,-0.07966146,0.019006373,0.0018609434,0.0070682094,0.05770639,0.07751444,0.05984159,-0.029955158,-0.0058063027,-0.023169449,0.0026583143,-0.0657157,-0.04399309,0.033948664,-0.027996266,0.040526785,5.238206e-33,0.010241496,0.03607309,0.046909038,0.01363597,-0.005335445,0.0016521378,-0.020371612,0.04564497,-0.08217566,0.06402261,-0.0017092424,0.04467231,0.10069538,0.00045676067,0.062299315,0.037693474,-0.039460365,-0.019606704,0.05026583,-0.05616923,-0.18455043,0.08040067,0.074261375,0.01932381,-0.026447851,0.040501554,-0.019648919,-0.023729222,-0.0589519,-0.08537439,-0.045682527,-0.12889874,-0.05590042,-0.068548314,-0.0058031343,0.06694754,-0.023167383,-0.14575258,-0.0123237,-0.05953811,0.036701642,-0.0021032344,0.048329204,0.078937724,0.01448631,0.029141147,0.014654006,-0.06743169,0.009763479,0.03308005,-0.026131291,-0.008976251,-0.02805068,-0.06251999,-0.003333123,-0.01415754,-0.07179516,-0.06783281,0.014238787,0.008521243,-0.03168489,0.0996435,-0.052023314,0.13799056,-0.01971767,-0.0868198,-0.007109497,-0.055724714,0.011921491,-0.073369175,-0.007965487,0.07029797,-0.031166457,-0.055607323,0.010831625,0.04010842,0.051589157,-0.0015768374,0.037868574,0.015498447,-0.06851171,-0.040853888,0.0092245145,-0.010765777,-0.0015251125,-0.037699535,-0.0050808727,0.05028556,-0.0018061057,0.047179475,-0.032873698,0.07862571,0.021928754,-0.055561442,0.0068104025,-1.601115e-08,-0.047843583,-0.0016648499,-0.0019612422,-0.002554689,0.051340975,0.03563475,0.008412877,-0.06416777,-0.031938273,-0.019677943,0.031404965,-0.017351918,-0.043358672,0.020338805,0.10461016,0.025110265,0.01756787,8.452355e-06,0.03481564,0.119492605,-0.07120706,0.014109293,0.07982084,-0.006870619,-0.0052823476,-0.029617291,0.0735672,0.06555545,-0.0973324,0.06841363,-0.03208407,0.109986424,-0.03169939,0.018973589,0.024622567,-0.06959749,0.07099971,-0.0502078,0.044230383,0.021497766,0.057419084,0.12532368,-0.08883316,-0.018113941,0.0011768519,0.06459078,-0.0014821336,-0.09094165,-0.0075864964,-0.00019048726,-0.12415704,-0.06488212,0.09381432,0.051018294,-0.020306533,-0.004231261,-0.01809832,-0.07439528,0.05670538,0.03697211,0.038794994,0.04458422,-0.080352895,-0.030577209] |
1 | I like eating apples. | [0.021809125,-0.015531936,0.011607823,0.08773645,-0.060896702,-0.035311054,0.11097563,-0.05388054,0.015478587,0.025643231,0.034682132,-0.09349964,0.018253824,0.0032013033,0.04340516,-0.037074324,0.088959105,-0.0040924014,-0.010021067,0.005995189,-0.078318015,0.066143975,0.042326767,-0.027101004,0.017702203,0.047038272,0.069593005,-0.037545256,-0.08466898,-0.0149313845,-0.05919546,2.3295312e-05,0.013309427,0.012327688,-0.05439113,0.0081964955,0.1404407,-0.07974374,-0.04133351,-0.022248574,0.018386977,0.06675908,0.060005367,0.040904358,-0.057686336,-0.008572924,-0.00069316046,-0.017934205,0.09348524,0.04610809,0.042312037,0.004256497,-0.035399776,-0.031868283,0.055097736,0.030634014,0.017477207,0.007607817,0.0028514373,-0.00848901,0.07058604,-0.065969445,-0.0030018615,0.017515391,0.03681229,-0.051015034,-0.051681925,-0.007240641,-0.05672333,-0.00033159592,-0.016689943,0.050976675,0.09232235,0.04870195,-0.023326442,0.014425969,0.0944048,-0.084106356,-0.06532096,0.010295293,-0.060007896,-0.0066203177,0.018760895,0.006218678,-0.016821053,-0.05153683,-0.019194037,0.019247936,-0.05592109,0.07442912,0.0011268753,-0.01857252,-0.03386638,0.048263445,0.0018756357,0.021458386,0.02670066,-0.07195236,-0.035215963,0.09375799,0.009641698,0.03153929,-0.0065211398,0.059988208,0.029077088,0.006436115,-0.16888268,-0.012192926,0.008317657,-0.0010368789,0.020289453,-0.015101345,-0.036400657,-0.0053182426,0.016343204,0.04836311,0.052492023,0.0022888337,0.013867806,-0.011067135,-0.00632472,0.08962689,-0.056332756,-5.0709128e-05,0.00037433414,-0.043979205,0.030548107,-6.1121328e-33,-0.10044726,-0.04796987,0.050677963,-0.031848893,0.017650908,0.0055781733,0.035132997,0.095104724,0.091575645,-0.026064795,-0.0059387633,-0.023844877,-0.03789114,-0.0062694866,0.024072742,-0.06319934,-0.025684576,0.072659574,-0.04208775,-0.014134044,-0.017349942,-0.09240058,-0.0064091305,0.09291195,-0.027069137,-0.08738226,0.042585023,-0.12305703,0.062073898,0.017139783,0.043850746,-0.005554785,-0.03515969,-0.057963055,-0.0016850779,-0.029315367,0.07211059,0.049894214,-0.028748097,0.0011031141,-0.007046476,0.020515675,0.067191206,0.021492152,0.06486439,0.0060838815,0.025401684,0.0739729,-0.030965952,-0.00762098,-0.04577821,-0.048278432,0.09053183,0.03222762,-0.015725326,-0.0107247075,0.013521895,-0.0360384,-0.09246122,0.01310438,-0.07853673,0.049683314,0.008800179,-0.007872623,-0.11311235,0.11412768,-0.03581802,-0.047303308,0.014969714,0.02396507,-0.04279115,0.031482812,-0.022683978,0.0005804888,-0.11246332,-0.09786996,0.045210768,-0.03159177,-0.05506939,-0.023562698,0.052014776,-0.0024514296,0.003902688,-0.010034765,0.033652794,0.122117504,-0.06718425,-0.066750795,0.108197525,-0.015414996,0.00400915,0.021052254,0.016455496,0.019499239,-0.12814386,5.5857067e-33,-0.0018572184,-0.080794044,-0.013305333,0.01841107,-0.037682977,-0.067594446,-0.087071694,0.013579439,-0.02803438,-0.032445576,-0.026130464,-0.006865185,-0.022305872,-0.016416714,0.023153791,0.024428565,-0.011959914,0.093689434,-0.032577604,0.026465515,-0.046098784,0.008481788,-0.006716845,0.019120447,0.016167238,-0.02313292,-0.0042774454,0.043933924,-0.018111937,0.059962064,0.05109594,-0.07903502,-0.059705768,-0.13360032,0.04902079,0.035442233,-0.09378037,-0.056613892,-0.0022577408,0.030770848,0.015449528,0.0032539356,0.031303164,0.11281754,0.036288805,0.09346795,0.0313906,0.058778953,0.022154897,0.05777495,0.00097196305,-0.02609103,-0.06628839,0.015047393,0.03955508,0.0523623,0.0069718817,0.0009399279,-0.039598145,-0.07549803,-0.102647424,0.06432405,0.018766917,0.013961236,0.06031335,-0.02941947,-0.03033608,-0.053566907,-0.07672768,0.012401397,-0.009276499,-0.054574206,-0.0566019,-0.024081016,-0.039790105,-0.035410725,0.011844968,0.036265053,-0.08490442,0.058963377,-0.030408578,0.10739632,0.010045292,0.06581671,0.049952522,0.05613914,-0.018259415,0.023479586,-0.04595969,0.03890778,-0.005904789,-0.015094089,0.013457783,-0.03914847,0.011510677,-1.5212935e-08,-0.045827758,-0.029699294,0.03503024,-0.010878928,-0.003190462,0.07422462,-0.07662787,0.05413322,0.02137874,-0.040636785,0.062867135,0.08551578,-0.08906489,0.05611474,0.048328113,0.008293776,0.08469364,-0.027762407,-0.015386819,0.067916475,-0.0937729,0.018911839,-0.013140985,0.04376479,-0.018527055,0.021828363,0.0024259402,0.020919863,0.1057404,0.063920595,0.056231383,0.053664792,-0.08300249,0.068553776,-0.0059213005,-0.0768514,0.010081414,-0.011377745,-0.012504746,-0.10047471,-0.049601573,-0.002936166,0.015598577,-0.042786237,-0.0998226,0.022823302,0.063844405,0.011207117,0.020726835,0.08571722,0.041427787,0.026192738,0.09660777,0.08237022,0.036912948,-0.014799402,0.043485742,-0.07760759,0.015751759,0.07816933,0.117991626,0.058715604,0.021846006,-0.016581282] |
Semantic Search by Embeddings
With the embedding index, we can search for contents based on the semantic similairy:
[4]:
t.embedding().search(column="content", query="apple", top_k=1)
[4]:
id | content |
---|---|
1 | I like eating apples. |
This is going to be very efficient since we don’t need to scan all the data.
Cleaning All at Once
To ease management, the dependencies of the embedding index and the base table will be recorded in database.
As a result, trying to droping the base table alone will fail:
[6]:
%reload_ext sql
%sql postgresql://localhost:7000
%sql DROP TABLE text_sample
* postgresql://localhost:7000
(psycopg2.errors.DependentObjectsStillExist) cannot drop table text_sample because other objects depend on it
DETAIL: table cte_32a769763ae94cd9b4036ceb590c4f0d depends on table text_sample
HINT: Use DROP ... CASCADE to drop the dependent objects too.
[SQL: DROP TABLE text_sample]
(Background on this error at: https://sqlalche.me/e/20/2j85)
To drop the base table, we need to also drop the embedding index. This can be achieved with CASCADE
:
[7]:
%%sql
DROP TABLE text_sample CASCADE;
SELECT oid, relname
FROM gp_dist_random('pg_class')
WHERE relname = 'cte_32a769763ae94cd9b4036ceb590c4f0d';
* postgresql://localhost:7000
Done.
0 rows affected.
[7]:
oid | relname |
---|
As we can see, after DROP CASCADE
, the embedding index also gets dropped on all segments.