DataFrame

DataFrame is the core data structure in GreenplumPython.

Conceptually, a DataFrame is a two-dimensional structure containing data.

In the data science world, a DataFrame in GreenplumPython, referred to as gp.DataFrame, is similar to a DataFrame in pandas, except that:

Data in a gp.DataFrame is lazily evaluated rather than eagerly. That is, they are computed only when they are observed. This can improve efficiency in many cases.
Data in a gp.DataFrame is located and manipulated on a remote database system rather than locally. As a consequence,
- Retrieving them from the database system can be expensive. Therefore, once the data of a gp.DataFrame is fetched from the database system, it will be cached locally for later use.
- They might be modified concurrently by other users of the database system. You might need to use refresh() to sync the updates if the data becomes stale.

In the database world, a gp.DataFrame is similar to a materialized view in a database system in that:

They both result from a possibly complex query.
They both hold data, as opposed to views.
The data can become stale due to concurrent modification. And the refresh() method is similar to the REFRESH MATERIALIZED VIEW command in PostgreSQL for syncing updates.

class dataframe.DataFrame

Bases: object

Representation of GreenplumPython DataFrame object.

property is_saved: bool: Check whether the current dataframe is saved in database.

__getitem__(_) → DataFrame

__getitem__(column_names: List[str]) → DataFrame

__getitem__(predicate: Callable[[DataFrame], Expr]) → DataFrame

__getitem__(column_name: str) → Expr

__getitem__(rows: slice) → DataFrame

Select parts of the DataFrame.

Returns: Column or DataFrame

Returns: a Column of the current DataFrame

When want to use Column for computation rather for observing data:

Args: key: string
id_col = dataframe["id"]
Returns: a new DataFrame from the current DataFrame per the type of key:
When want to retrieve some columns of DataFrame:

Args: key: list of columns

Returns: DataFrame with the subset of columns, a.k.a. targets
id_dataframe = dataframe[["id"]]
When want to filter DataFrame on Column with conditions:

Args: key: Expr

Returns: DataFrame with subset of rows per the value of the Expr
id_cond_dataframe = dataframe[lambda t: t["id"] == 0]
When want to retrieve a portion of DataFrame:

Args: key: slice

Returns: DataFrame with the portion of consecutive rows
slice_dataframe = tab[2:5]

where(predicate)

Filter the DataFrame by applying the predicate.

Return the DataFrame filtered by Expr.

Parameters: predicate (Callable[[DataFrame], Expr]) – Expr : where condition statement.
Returns: DataFrame filtered according to Expr that is passed in argument.
Return type: DataFrame

Example

>>> rows = [(i,) for i in range(-10, 10)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> series.where(lambda df: df["id"] > 0)
----
 id
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
----
(9 rows)

apply(func, expand=False, column_name=None)

Apply a dataframe function to the self DataFrame.

Parameters

func (Callable[[DataFrame], FunctionExpr]) –
A Python function that
- takes the self DataFrame as the only parameter, and
- returns the result of a dataframe function, which can be a NormalFunction, a AggregateFunction or a ColumnFunction
expand (bool) – whether to expand the multi-valued result into columns of the resulting DataFrame.
column_name (Optional[str]) – name of the column of the return value in the resulting DataFrame.

Returns

A DataFrame of returned values of the function.

Return type

DataFrame

Example

To compute the absolute value of a serie of numbers:

>>> rows = [(i,) for i in range(-10, 0)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> abs = gp.function("abs")
>>> result = series.apply(lambda df: abs(df["id"]))
>>> result
-----
 abs
-----
  10
  9
  8
  7
  6
  5
  4
  3
  2
  1
-----
(10 rows)

To transform colums into other types, see the following example. Suppose label function takes a str and a int, it joins them into a string and returns:

>>> rows = [(i,) for i in range(10)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> @gp.create_function
... def label(prefix: str, id: int) -> str:
...     prefix = "id"
...     return f"{prefix}_{id}"

>>> result = series.apply(lambda t: label("label", t["id"]),
...                       column_name = "label")
>>> result
-------
 label
-------
 id_0
 id_1
 id_2
 id_3
 id_4
 id_5
 id_6
 id_7
 id_8
 id_9
-------
(10 rows)

assign(**new_columns)

Add new columns to the current DataFrame. Existing columns cannot be reassigned.

Parameters: new_columns (Callable[[DataFrame], Any]) – a dict whose keys are column names and values are Callable which returns column data when is applied to the current DataFrame.
Returns: a new DataFrame including the new assigned columns.
Return type: DataFrame

Example

>>> rows = [(i,) for i in range(-10, 0)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> abs = gp.function("abs")
>>> results = series.assign(abs=lambda nums: abs(nums["id"]))
>>> results
-----------
 id  | abs
-----+-----
 -10 |  10
 -9  |   9
 -8  |   8
 -7  |   7
 -6  |   6
 -5  |   5
 -4  |   4
 -3  |   3
 -2  |   2
 -1  |   1
-----------
(10 rows)

order_by(column_name, ascending=None, nulls_first=None, operator=None)

Sort DataFrame based on the configuration.

Parameters

column_name (str) – name of column to order the dataframe by.
ascending (Optional[bool]) – Define ascending of order, True = ASC / False = DESC.
nulls_first (Optional[bool]) – Define if nulls will be ordered first or last, True = First / False = Last.
operator (Optional[str]) – Define order by using operator. Can’t combine with ascending.

Returns

Specification on ordering of the current DataFrame.

Return type

DataFrameOrdering

Example

>>> columns = {"id": [3, 1, 2], "b": [1, 2, 3]}
>>> t = gp.DataFrame.from_columns(columns, db=db)
>>> result = t.order_by("id")[:]
>>> result
--------
 id | b
----+---
  1 | 2
  2 | 3
  3 | 1
--------
(3 rows)

join(other, how='', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Join the current DataFrame with another using the given arguments.

Parameters

other (DataFrame) – DataFrame to join with
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
How the two DataFrame are joined. The value can be one of:
- ”INNER”: inner join,
- ”LEFT”: left outer join,
- ”RIGHT”: right outer join,
- ”FULL”: full outer join, or
- ”CROSS”: cross join, i.e. the Cartesian product
The default value “” is equivalent to “INNER”.
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) – Callable lambda function as the join condition
on (Optional[Union[str, Iterable[str]]]) – a list of column names that exists in both DataFrames to join on. cond and on cannot be used together.
self_columns (Union[Dict[str, Optional[str]], Set[str]]) – A dict whose keys are the column names of the current dataframe to be included in the resulting dataframe. The value, if not None, is used for renaming the corresponding key to avoid name conflicts. Asterisk "*" can be used as a key to indicate all columns.
other_columns (Union[Dict[str, Optional[str]], Set[str]]) – Same as self_columns, but for the other DataFrame.

Return type

DataFrame

Note

When using "*" as key in self_columns or other_columns, please ensure that there will not be more than one column with the same name by applying proper renaming. Otherwise, there will be an error.

Example

>>> age_rows = [("alice", 18), ("bob", 19), ("carol", 19)]
>>> student = gp.DataFrame.from_rows(
...     age_rows, column_names=["name", "age"], db=db)
>>> result = student.join(
...     student,
...     on="age",
...     self_columns={"*"},
...     other_columns={"name": "name_2"})
>>> result
----------------------
 name  | age | name_2
-------+-----+--------
 alice |  18 | alice
 bob   |  19 | carol
 bob   |  19 | bob
 carol |  19 | carol
 carol |  19 | bob
----------------------
(5 rows)

inner_join(other, *, how='INNER', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Inner joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”INNER”.

Parameters

other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

left_join(other, *, how='LEFT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Left-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”LEFT”.

Parameters

other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

right_join(other, *, how='RIGHT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Right-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”RIGHT”.

Parameters

other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

full_join(other, *, how='FULL', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Full-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with argutment how=”FULL”.

Parameters

other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

cross_join(other, *, how='CROSS', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Cross joins the current DataFrame with another DataFrame, i.e. the Cartesian product.

Equivalent to calling join() with how=”CROSS”.

Parameters

other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

refresh()

Refresh the local cache of DataFrame.

After displayed dataframe, its content has been cached in local. All modifications made between last cache and this refresh are not updated in local.

The local cache if used to iterate the DataFrame instance locally.

Returns: self
Return type: DataFrame

Example

>>> cursor.execute("DROP TABLE IF EXISTS t_refresh;")
>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"])
>>> df = nums.save_as("t_refresh", column_names=["num"], temp=False).order_by("num")[:]
>>> df
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> cursor.execute("INSERT INTO t_refresh(num) VALUES (5);")
>>> df
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> df.refresh()
-----
 num
-----
   0
   1
   2
   3
   4
   5
-----
(6 rows)
>>> cursor.execute("DROP TABLE t_refresh;")

Note

cursor is a predefined Psycopg Cursor which connects to the same database in another session with auto-commit enabled.

save_as(table_name=None, column_names=[], temp=False, storage_params={}, drop_if_exists=False, drop_cascade=False, schema=None, distribution_type=None, distribution_key=None)

Save the GreenplumPython Dataframe as a table into the database.

And return a new instance of DataFrame that represents the newly saved table.

After running this function, if temp is False, you can also use create_dataframe(table_name)() to create a new Dataframe next time.

Parameters

table_name (Optional[str]) – name of table in database, required to be unique in the schema.
temp (bool) – whether table is temporary. Temp tables will be dropped after the database connection is closed.
column_names (List[str]) – list of column names
storage_params (dict[str, Any]) – storage_parameter of gpdb, reference https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-ref_guide-sql_commands-CREATE_TABLE_AS.html
schema (Optional[str]) – schema of the table for avoiding name conflicts.
distribution_type (Optional[Literal[None, 'randomly', 'replicated', 'hash']]) – type of distribution by.
distribution_key (Optional[Set[str]]) – distribution key.
drop_if_exists (bool) – bool to indicate if drop table if exists.
drop_cascade (bool) – bool to indicate if drop cascade table.

Returns

DataFrame represents the newly saved table

Return type

DataFrame

Example

>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"])
>>> df = nums.save_as("t_saved", column_names=["num"], temp=True)
>>> df.order_by("num")[:]
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> t_saved = db.create_dataframe(table_name="t_saved")
>>> t_saved.order_by("num")[:]
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)

create_index(columns, method='btree', name=None)

Create an index for the current dataframe for fast searching.

The current dataframe is required to be saved before creating index.

Parameters

columns (Union[Set[str], Dict[str, str]]) – Set of columns of the current dataframe to create index on. Optionally, an operator class can be specified for each column by passing a Dict with column names as keys and their operator class names as values.
method (str) – name of the index access method.
name (Optional[str]) – name of the index.

Returns

Dataframe with key columns indexed.

Return type

DataFrame

group_by(*column_names)

Group the current GreenplumPython DataFrame by column_names.

Parameters: column_names (str) – one or more column names of the DataFrame.
Returns: a set of groups of the current DataFrame. Each group is identified by a different set of values of the columns in the arguments.
Return type: DataFrameGroupingSet

distinct_on(*column_names)

Deduplicate the current DataFrame with respect to the given columns.

This function follows the DISTINCT ON syntax in PostgreSQL.

Parameters: column_names (str) – names of the current DataFrame’s columns.
Returns: the DataFrame containing only the distinct values of the given columns.
Return type: DataFrame

Example

>>> students = [("alice", 18), ("bob", 19), ("carol", 19)]
>>> student = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
>>> student.distinct_on("age")[['age']]
-----
 age
-----
  18
  19
-----
(2 rows)

Note

Since both “bob” and “carol” have the same age 19, student.distinct_on(“age”) will randomly pick one of them for the name column. Use “[[‘age’]]” to make sure the result is stable.

property unique_key: List[str]: Return unique key.

check_unique(columns)

Check whether a given set of columns, i.e. key, is unique.

Parameters: columns (set[str]) – set of columns name to be checked
Returns: self checked
Return type: DataFrame

classmethod from_table(table_name, db, schema=None)

Return a DataFrame which represents the given table in the Database.

Parameters

table_name (str) – table name
db (Database) – database of the table
schema (Optional[str]) – schema of table in database

Return type

DataFrame

df = gp.DataFrame.from_table("pg_class", db=db)

classmethod from_rows(rows, db, column_names=None)

Return a DataFrame using a given list of values.

Parameters

rows (Iterable[Union[Tuple[Any], Dict[str, Any]]]) –
- Iterable[Tuple[Any]]: a list of row values.
- Iterable[Dict[str, Any]]: a list of key value pairs to determine the columns and rows. The column names are decided by the keys of the first dictionary element if the column_names is not specified.
db (Database) – Database: database which will be associated with the DataFrame.
column_names (Optional[List[str]]) – Iterable[str]: list of given column names.

Returns

DataFrame generated with given values.

Return type

DataFrame

>>> rows = [(1,), (2,), (3,)]
>>> df = gp.DataFrame.from_rows(rows, db=db, column_names=["id"])
>>> df
 ----
  id
 ----
  1
  2
  3
 ----
 (3 rows)

>>> dict_list = [{"id": 1, "val": "11"}, {"id": 2, "val": "22"}]
>>> df = gp.DataFrame.from_rows(dict_list, db=db)
>>> df
 ----------
  id | val
 ----+-----
   1 | 11
   2 | 22
 ----------
 (2 rows)

classmethod from_columns(columns, db)

Return a DataFrame using list of columns values given.

Parameters

columns (Dict[str, Iterable[Any]]) – Dict[str, List[Any]]: List of column values.
db (Database) – Database: database which will be associated with the DataFrame.

Returns

the DataFrame generated with given values.

Return type

DataFrame

Example

>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]}
>>> t = gp.DataFrame.from_columns(columns, db=db)
>>> t
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)

embedding()

Enable embedding-based similarity search on columns of the current DataFrame.

Example

See Generating, Indexing and Searching Embeddings (Experimental) for more details.

Warning

This function is currently experimental and the interface is subject to change.

Return type: Embedding

classmethod from_files(files, parser, db)

Create a DataFrame with data read from files.

Parameters

files (list[str]) – list of file paths. Each path ends with the path of the same file on client, without links resolved.
parser (NormalFunction) – a UDF that parses the given files on server. The UDF is required to - take the file path as its only argument and - returns a set of parsed records in the returing DataFrame.
db (Database) – Database that the DataFrame to be created in.

Returns

DataFrame containing the parsed data from the given files.

Return type

DataFrame

Warning

This function is currently experimental and the interface is subject to change.