DataFrame
DataFrame is the core data structure in GreenplumPython.
Conceptually, a DataFrame is a two-dimensional structure containing data.
In the data science world, a DataFrame in GreenplumPython, referred to as gp.DataFrame
, is similar to a
DataFrame in pandas,
except that:
Data in a
gp.DataFrame
is lazily evaluated rather than eagerly. That is, they are computed only when they are observed. This can improve efficiency in many cases.Data in a
gp.DataFrame
is located and manipulated on a remote database system rather than locally. As a consequence,Retrieving them from the database system can be expensive. Therefore, once the data of a
gp.DataFrame
is fetched from the database system, it will be cached locally for later use.They might be modified concurrently by other users of the database system. You might need to use
refresh()
to sync the updates if the data becomes stale.
In the database world, a gp.DataFrame
is similar to a materialized view in a database system
in that:
They both result from a possibly complex query.
They both hold data, as opposed to views.
The data can become stale due to concurrent modification. And the
refresh()
method is similar to theREFRESH MATERIALIZED VIEW
command in PostgreSQL for syncing updates.
- class dataframe.DataFrame
Bases:
object
Representation of GreenplumPython DataFrame object.
- property is_saved: bool
Check whether the current dataframe is saved in database.
- __getitem__(_) DataFrame
- __getitem__(column_names: List[str]) DataFrame
- __getitem__(predicate: Callable[[DataFrame], Expr]) DataFrame
- __getitem__(column_name: str) Expr
- __getitem__(rows: slice) DataFrame
Select parts of the
DataFrame
.Returns: a
Column
of the currentDataFrame
When want to use
Column
for computation rather for observing data:Args: key:
string
id_col = dataframe["id"]
Returns: a new
DataFrame
from the currentDataFrame
per the type of key:When want to retrieve some columns of
DataFrame
:Args: key:
list
of columnsReturns:
DataFrame
with the subset of columns, a.k.a. targets
id_dataframe = dataframe[["id"]]
When want to filter
DataFrame
onColumn
with conditions:Args: key:
Expr
Returns:
DataFrame
with subset of rows per the value of the Expr
id_cond_dataframe = dataframe[lambda t: t["id"] == 0]
When want to retrieve a portion of
DataFrame
:Args: key:
slice
Returns:
DataFrame
with the portion of consecutive rows
slice_dataframe = tab[2:5]
- where(predicate)
Filter the
DataFrame
by applying the predicate.Return the
DataFrame
filtered byExpr
.- Parameters
predicate (Callable[[DataFrame], Expr]) –
Expr
: where condition statement.- Returns
DataFrame
filtered according toExpr
that is passed in argument.- Return type
Example
>>> rows = [(i,) for i in range(-10, 10)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> series.where(lambda df: df["id"] > 0) ---- id ---- 1 2 3 4 5 6 7 8 9 ---- (9 rows)
- apply(func, expand=False, column_name=None)
Apply a dataframe function to the self
DataFrame
.- Parameters
func (Callable[[DataFrame], FunctionExpr]) –
A Python function that
takes the self
DataFrame
as the only parameter, andreturns the result of a dataframe function, which can be a
NormalFunction
, aAggregateFunction
or aColumnFunction
expand (bool) – whether to expand the multi-valued result into columns of the resulting
DataFrame
.column_name (Optional[str]) – name of the column of the return value in the resulting
DataFrame
.
- Returns
A
DataFrame
of returned values of the function.- Return type
Example
To compute the absolute value of a serie of numbers:
>>> rows = [(i,) for i in range(-10, 0)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> abs = gp.function("abs") >>> result = series.apply(lambda df: abs(df["id"])) >>> result ----- abs ----- 10 9 8 7 6 5 4 3 2 1 ----- (10 rows)
To transform colums into other types, see the following example. Suppose label function takes a str and a int, it joins them into a string and returns:
>>> rows = [(i,) for i in range(10)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> @gp.create_function ... def label(prefix: str, id: int) -> str: ... prefix = "id" ... return f"{prefix}_{id}" >>> result = series.apply(lambda t: label("label", t["id"]), ... column_name = "label") >>> result ------- label ------- id_0 id_1 id_2 id_3 id_4 id_5 id_6 id_7 id_8 id_9 ------- (10 rows)
- assign(**new_columns)
Add new columns to the current
DataFrame
. Existing columns cannot be reassigned.- Parameters
new_columns (Callable[[DataFrame], Any]) – a dict whose keys are column names and values are
Callable
which returns column data when is applied to the currentDataFrame
.- Returns
a new
DataFrame
including the new assigned columns.- Return type
Example
>>> rows = [(i,) for i in range(-10, 0)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> abs = gp.function("abs") >>> results = series.assign(abs=lambda nums: abs(nums["id"])) >>> results ----------- id | abs -----+----- -10 | 10 -9 | 9 -8 | 8 -7 | 7 -6 | 6 -5 | 5 -4 | 4 -3 | 3 -2 | 2 -1 | 1 ----------- (10 rows)
- order_by(column_name, ascending=None, nulls_first=None, operator=None)
Sort
DataFrame
based on the configuration.- Parameters
column_name (str) – name of column to order the dataframe by.
ascending (Optional[bool]) – Define ascending of order, True = ASC / False = DESC.
nulls_first (Optional[bool]) – Define if nulls will be ordered first or last, True = First / False = Last.
operator (Optional[str]) – Define order by using operator. Can’t combine with ascending.
- Returns
Specification on ordering of the current
DataFrame
.- Return type
Example
>>> columns = {"id": [3, 1, 2], "b": [1, 2, 3]} >>> t = gp.DataFrame.from_columns(columns, db=db) >>> result = t.order_by("id")[:] >>> result -------- id | b ----+--- 1 | 2 2 | 3 3 | 1 -------- (3 rows)
- join(other, how='', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Join the current
DataFrame
with another using the given arguments.- Parameters
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
How the two
DataFrame
are joined. The value can be one of:”INNER”: inner join,
”LEFT”: left outer join,
”RIGHT”: right outer join,
”FULL”: full outer join, or
”CROSS”: cross join, i.e. the Cartesian product
The default value “” is equivalent to “INNER”.
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
Callable
lambda function as the join conditionon (Optional[Union[str, Iterable[str]]]) – a list of column names that exists in both DataFrames to join on.
cond
andon
cannot be used together.self_columns (Union[Dict[str, Optional[str]], Set[str]]) – A
dict
whose keys are the column names of the current dataframe to be included in the resulting dataframe. The value, if not None, is used for renaming the corresponding key to avoid name conflicts. Asterisk"*"
can be used as a key to indicate all columns.other_columns (Union[Dict[str, Optional[str]], Set[str]]) – Same as self_columns, but for the other
DataFrame
.
- Return type
Note
When using
"*"
as key in self_columns or other_columns, please ensure that there will not be more than one column with the same name by applying proper renaming. Otherwise, there will be an error.Example
>>> age_rows = [("alice", 18), ("bob", 19), ("carol", 19)] >>> student = gp.DataFrame.from_rows( ... age_rows, column_names=["name", "age"], db=db) >>> result = student.join( ... student, ... on="age", ... self_columns={"*"}, ... other_columns={"name": "name_2"}) >>> result ---------------------- name | age | name_2 -------+-----+-------- alice | 18 | alice bob | 19 | carol bob | 19 | bob carol | 19 | carol carol | 19 | bob ---------------------- (5 rows)
- inner_join(other, *, how='INNER', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Inner joins the current
DataFrame
with anotherDataFrame
.Equivalent to calling
join()
with how=”INNER”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- left_join(other, *, how='LEFT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Left-outer joins the current
DataFrame
with anotherDataFrame
.Equivalent to calling
join()
with how=”LEFT”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- right_join(other, *, how='RIGHT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Right-outer joins the current
DataFrame
with anotherDataFrame
.Equivalent to calling
join()
with how=”RIGHT”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- full_join(other, *, how='FULL', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Full-outer joins the current
DataFrame
with anotherDataFrame
.Equivalent to calling
join()
with argutment how=”FULL”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- cross_join(other, *, how='CROSS', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Cross joins the current
DataFrame
with anotherDataFrame
, i.e. the Cartesian product.Equivalent to calling
join()
with how=”CROSS”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- refresh()
Refresh the local cache of
DataFrame
.After displayed dataframe, its content has been cached in local. All modifications made between last cache and this refresh are not updated in local.
The local cache if used to iterate the
DataFrame
instance locally.- Returns
self
- Return type
Example
>>> cursor.execute("DROP TABLE IF EXISTS t_refresh;") >>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"]) >>> df = nums.save_as("t_refresh", column_names=["num"], temp=False).order_by("num")[:] >>> df ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> cursor.execute("INSERT INTO t_refresh(num) VALUES (5);") >>> df ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> df.refresh() ----- num ----- 0 1 2 3 4 5 ----- (6 rows) >>> cursor.execute("DROP TABLE t_refresh;")
Note
cursor is a predefined Psycopg Cursor which connects to the same database in another session with auto-commit enabled.
- save_as(table_name=None, column_names=[], temp=False, storage_params={}, drop_if_exists=False, drop_cascade=False, schema=None, distribution_type=None, distribution_key=None)
Save the GreenplumPython
Dataframe
as a table into the database.And return a new instance of
DataFrame
that represents the newly saved table.After running this function, if temp is False, you can also use
create_dataframe(table_name)()
to create a newDataframe
next time.- Parameters
table_name (Optional[str]) – name of table in database, required to be unique in the schema.
temp (bool) – whether table is temporary. Temp tables will be dropped after the database connection is closed.
column_names (List[str]) – list of column names
storage_params (dict[str, Any]) – storage_parameter of gpdb, reference https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-ref_guide-sql_commands-CREATE_TABLE_AS.html
schema (Optional[str]) – schema of the table for avoiding name conflicts.
distribution_type (Optional[Literal[None, 'randomly', 'replicated', 'hash']]) – type of distribution by.
distribution_key (Optional[Set[str]]) – distribution key.
drop_if_exists (bool) – bool to indicate if drop table if exists.
drop_cascade (bool) – bool to indicate if drop cascade table.
- Returns
DataFrame
represents the newly saved table- Return type
Example
>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"]) >>> df = nums.save_as("t_saved", column_names=["num"], temp=True) >>> df.order_by("num")[:] ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> t_saved = db.create_dataframe(table_name="t_saved") >>> t_saved.order_by("num")[:] ----- num ----- 0 1 2 3 4 ----- (5 rows)
- create_index(columns, method='btree', name=None)
Create an index for the current dataframe for fast searching.
The current dataframe is required to be saved before creating index.
- Parameters
columns (Union[Set[str], Dict[str, str]]) –
Set
of columns of the current dataframe to create index on. Optionally, an operator class can be specified for each column by passing aDict
with column names as keys and their operator class names as values.method (str) – name of the index access method.
name (Optional[str]) – name of the index.
- Returns
Dataframe with key columns indexed.
- Return type
- group_by(*column_names)
Group the current GreenplumPython
DataFrame
by column_names.
- distinct_on(*column_names)
Deduplicate the current
DataFrame
with respect to the given columns.This function follows the DISTINCT ON syntax in PostgreSQL.
- Parameters
column_names (str) – names of the current
DataFrame
’s columns.- Returns
the
DataFrame
containing only the distinct values of the given columns.- Return type
Example
>>> students = [("alice", 18), ("bob", 19), ("carol", 19)] >>> student = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db) >>> student.distinct_on("age")[['age']] ----- age ----- 18 19 ----- (2 rows)
Note
Since both “bob” and “carol” have the same age 19, student.distinct_on(“age”) will randomly pick one of them for the name column. Use “[[‘age’]]” to make sure the result is stable.
- property unique_key: List[str]
Return unique key.
- check_unique(columns)
Check whether a given set of columns, i.e. key, is unique.
- Parameters
columns (set[str]) – set of columns name to be checked
- Returns
self checked
- Return type
- classmethod from_table(table_name, db, schema=None)
Return a
DataFrame
which represents the given table in theDatabase
.- Parameters
table_name (str) – table name
db (Database) – database of the table
schema (Optional[str]) – schema of table in database
- Return type
df = gp.DataFrame.from_table("pg_class", db=db)
- classmethod from_rows(rows, db, column_names=None)
Return a
DataFrame
using a given list of values.- Parameters
rows (Iterable[Union[Tuple[Any], Dict[str, Any]]]) –
Iterable[Tuple[Any]]: a list of row values.
Iterable[Dict[str, Any]]: a list of key value pairs to determine the columns and rows. The column names are decided by the keys of the first dictionary element if the column_names is not specified.
db (Database) –
Database
: database which will be associated with theDataFrame
.column_names (Optional[List[str]]) – Iterable[str]: list of given column names.
- Returns
DataFrame
generated with given values.- Return type
>>> rows = [(1,), (2,), (3,)] >>> df = gp.DataFrame.from_rows(rows, db=db, column_names=["id"]) >>> df ---- id ---- 1 2 3 ---- (3 rows) >>> dict_list = [{"id": 1, "val": "11"}, {"id": 2, "val": "22"}] >>> df = gp.DataFrame.from_rows(dict_list, db=db) >>> df ---------- id | val ----+----- 1 | 11 2 | 22 ---------- (2 rows)
- classmethod from_columns(columns, db)
Return a
DataFrame
using list of columns values given.- Parameters
- Returns
the
DataFrame
generated with given values.- Return type
Example
>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]} >>> t = gp.DataFrame.from_columns(columns, db=db) >>> t ------- a | b ---+--- 1 | 1 2 | 2 3 | 3 ------- (3 rows)
- embedding()
Enable embedding-based similarity search on columns of the current
DataFrame
.Example
See Generating, Indexing and Searching Embeddings (Experimental) for more details.
Warning
This function is currently experimental and the interface is subject to change.
- Return type
- classmethod from_files(files, parser, db)
Create a DataFrame with data read from files.
- Parameters
files (list[str]) – list of file paths. Each path ends with the path of the same file on client, without links resolved.
parser (NormalFunction) – a UDF that parses the given files on server. The UDF is required to - take the file path as its only argument and - returns a set of parsed records in the returing DataFrame.
db (Database) – Database that the DataFrame to be created in.
- Returns
DataFrame containing the parsed data from the given files.
- Return type
Warning
This function is currently experimental and the interface is subject to change.