Operators and Indexes
module:: greenplumpython
Indexes are essential for fast data searching.
With indexes, we can retrieve rows we want by scanning the index, rather than the entire dataframe. As a result, when the dataframe is large, the amount of data to be scanned is typically much smaller with an index.
In pandas, a DataFrame
or Series
can have only one index. This
means that we can do index scan based on one column. While for other columns,
we will have to go through all rows. This can be rather inefficient.
Backed by database systems, GreenplumPython overcomes this limitation by allowing the creation of
multiple indexes, each for one set of columns, and
multiple types of indexes for same set of columns, each for one order.
In this way, we can search a GreenplumPython’s dataframe with index scan, on more than one column set in more than one order.
For example, a dataframe containing AI-generated embeddings may contain
a column of IDs and
a column of vectors.
We can search for embeddings by either IDs or vectors using index scan after creating an index on the ID column and another index on the vector column.
As another example, suppose we want to search for Approximate Nearest Neighbors for a given vector based on not only cosine similarity, but also L_2 distance. We can create two indexes on the vector column, each for one similarity metric.
How to search a dataframe with index is defined by a set of operators on the
indexed columns. For example, when scanning a B-tree index, relational
operators, such as >
, <
, and =
, are required for
comparing two values. These operators are encapsulated as an operator class.
Different data types have different operator classes for an index. For example,
integers and floats are compared in different ways. Even for the same data type,
we can change how two values are compared by changing the operator class.
Since indexes depend on operators to work, to use index scan, we need to specify
the filering predicate in the []
operator and where()
with operators when doing comparison or computing similarity. To ease the use of
operators in database,
for Python’s built-in operators, we map it to the database operators of the same name, and
for others that do not have a built-in equivalence, we can use the
operator()
function to map a database operator to a PythonCallable
. Calling the Python function will apply the operator.
- class op.Operator
Bases:
object
Represents an operator in database.
As a Python object, an
Operator
can be called like a function. This is because unlike SQL, Python does not support defining new operators.When an
Operator
is called, the corresponding operator in database will be applied.