DataFrame

DataFrame is the core data structure in GreenplumPython.

Conceptually, a DataFrame is a two-dimensional structure containing data.

In the data science world, a DataFrame in GreenplumPython, referred to as gp.DataFrame, is similar to a DataFrame in pandas, except that:

  • Data in a gp.DataFrame is lazily evaluated rather than eagerly. That is, they are computed only when they are observed. This can improve efficiency in many cases.

  • Data in a gp.DataFrame is located and manipulated on a remote database system rather than locally. As a consequence,

    • Retrieving them from the database system can be expensive. Therefore, once the data of a gp.DataFrame is fetched from the database system, it will be cached locally for later use.

    • They might be modified concurrently by other users of the database system. You might need to use refresh() to sync the updates if the data becomes stale.

In the database world, a gp.DataFrame is similar to a materialized view in a database system in that:

  • They both result from a possibly complex query.

  • They both hold data, as opposed to views.

  • The data can become stale due to concurrent modification. And the refresh() method is similar to the REFRESH MATERIALIZED VIEW command in PostgreSQL for syncing updates.

class dataframe.DataFrame

Bases: object

Representation of GreenplumPython DataFrame object.

property is_saved: bool

Check whether the current dataframe is saved in database.

__getitem__(_) DataFrame
__getitem__(column_names: List[str]) DataFrame
__getitem__(predicate: Callable[[DataFrame], Expr]) DataFrame
__getitem__(column_name: str) Expr
__getitem__(rows: slice) DataFrame

Select parts of the DataFrame.

Returns: Column or DataFrame

  • Returns: a Column of the current DataFrame

    When want to use Column for computation rather for observing data:

    Args: key: string

id_col = dataframe["id"]
  • Returns: a new DataFrame from the current DataFrame per the type of key:

    • When want to retrieve some columns of DataFrame:

      Args: key: list of columns

      Returns: DataFrame with the subset of columns, a.k.a. targets

    id_dataframe = dataframe[["id"]]
    
    id_cond_dataframe = dataframe[lambda t: t["id"] == 0]
    
    • When want to retrieve a portion of DataFrame:

      Args: key: slice

      Returns: DataFrame with the portion of consecutive rows

    slice_dataframe = tab[2:5]
    
where(predicate)

Filter the DataFrame by applying the predicate.

Return the DataFrame filtered by Expr.

Parameters

predicate (Callable[[DataFrame], Expr]) – Expr : where condition statement.

Returns

DataFrame filtered according to Expr that is passed in argument.

Return type

DataFrame

Example

>>> rows = [(i,) for i in range(-10, 10)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> series.where(lambda df: df["id"] > 0)
----
 id
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
----
(9 rows)
apply(func, expand=False, column_name=None)

Apply a dataframe function to the self DataFrame.

Parameters
Returns

A DataFrame of returned values of the function.

Return type

DataFrame

Example

To compute the absolute value of a serie of numbers:

>>> rows = [(i,) for i in range(-10, 0)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> abs = gp.function("abs")
>>> result = series.apply(lambda df: abs(df["id"]))
>>> result
-----
 abs
-----
  10
  9
  8
  7
  6
  5
  4
  3
  2
  1
-----
(10 rows)

To transform colums into other types, see the following example. Suppose label function takes a str and a int, it joins them into a string and returns:

>>> rows = [(i,) for i in range(10)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> @gp.create_function
... def label(prefix: str, id: int) -> str:
...     prefix = "id"
...     return f"{prefix}_{id}"

>>> result = series.apply(lambda t: label("label", t["id"]),
...                       column_name = "label")
>>> result
-------
 label
-------
 id_0
 id_1
 id_2
 id_3
 id_4
 id_5
 id_6
 id_7
 id_8
 id_9
-------
(10 rows)
assign(**new_columns)

Add new columns to the current DataFrame. Existing columns cannot be reassigned.

Parameters

new_columns (Callable[[DataFrame], Any]) – a dict whose keys are column names and values are Callable which returns column data when is applied to the current DataFrame.

Returns

a new DataFrame including the new assigned columns.

Return type

DataFrame

Example

>>> rows = [(i,) for i in range(-10, 0)]
>>> series = db.create_dataframe(rows=rows, column_names=["id"])
>>> abs = gp.function("abs")
>>> results = series.assign(abs=lambda nums: abs(nums["id"]))
>>> results
-----------
 id  | abs
-----+-----
 -10 |  10
 -9  |   9
 -8  |   8
 -7  |   7
 -6  |   6
 -5  |   5
 -4  |   4
 -3  |   3
 -2  |   2
 -1  |   1
-----------
(10 rows)
order_by(column_name, ascending=None, nulls_first=None, operator=None)

Sort DataFrame based on the configuration.

Parameters
  • column_name (str) – name of column to order the dataframe by.

  • ascending (Optional[bool]) – Define ascending of order, True = ASC / False = DESC.

  • nulls_first (Optional[bool]) – Define if nulls will be ordered first or last, True = First / False = Last.

  • operator (Optional[str]) – Define order by using operator. Can’t combine with ascending.

Returns

Specification on ordering of the current DataFrame.

Return type

DataFrameOrdering

Example

>>> columns = {"id": [3, 1, 2], "b": [1, 2, 3]}
>>> t = gp.DataFrame.from_columns(columns, db=db)
>>> result = t.order_by("id")[:]
>>> result
--------
 id | b
----+---
  1 | 2
  2 | 3
  3 | 1
--------
(3 rows)
join(other, how='', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Join the current DataFrame with another using the given arguments.

Parameters
  • other (DataFrame) – DataFrame to join with

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

    How the two DataFrame are joined. The value can be one of:

    • ”INNER”: inner join,

    • ”LEFT”: left outer join,

    • ”RIGHT”: right outer join,

    • ”FULL”: full outer join, or

    • ”CROSS”: cross join, i.e. the Cartesian product

    The default value “” is equivalent to “INNER”.

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) – Callable lambda function as the join condition

  • on (Optional[Union[str, Iterable[str]]]) – a list of column names that exists in both DataFrames to join on. cond and on cannot be used together.

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) – A dict whose keys are the column names of the current dataframe to be included in the resulting dataframe. The value, if not None, is used for renaming the corresponding key to avoid name conflicts. Asterisk "*" can be used as a key to indicate all columns.

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) – Same as self_columns, but for the other DataFrame.

Return type

DataFrame

Note

When using "*" as key in self_columns or other_columns, please ensure that there will not be more than one column with the same name by applying proper renaming. Otherwise, there will be an error.

Example

>>> age_rows = [("alice", 18), ("bob", 19), ("carol", 19)]
>>> student = gp.DataFrame.from_rows(
...     age_rows, column_names=["name", "age"], db=db)
>>> result = student.join(
...     student,
...     on="age",
...     self_columns={"*"},
...     other_columns={"name": "name_2"})
>>> result
----------------------
 name  | age | name_2
-------+-----+--------
 alice |  18 | alice
 bob   |  19 | carol
 bob   |  19 | bob
 carol |  19 | carol
 carol |  19 | bob
----------------------
(5 rows)
inner_join(other, *, how='INNER', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Inner joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”INNER”.

Parameters
  • other (DataFrame) –

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –

  • on (Optional[Union[str, Iterable[str]]]) –

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) –

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

left_join(other, *, how='LEFT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Left-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”LEFT”.

Parameters
  • other (DataFrame) –

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –

  • on (Optional[Union[str, Iterable[str]]]) –

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) –

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

right_join(other, *, how='RIGHT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Right-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with how=”RIGHT”.

Parameters
  • other (DataFrame) –

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –

  • on (Optional[Union[str, Iterable[str]]]) –

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) –

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

full_join(other, *, how='FULL', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Full-outer joins the current DataFrame with another DataFrame.

Equivalent to calling join() with argutment how=”FULL”.

Parameters
  • other (DataFrame) –

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –

  • on (Optional[Union[str, Iterable[str]]]) –

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) –

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

cross_join(other, *, how='CROSS', cond=None, on=None, self_columns={'*'}, other_columns={'*'})

Cross joins the current DataFrame with another DataFrame, i.e. the Cartesian product.

Equivalent to calling join() with how=”CROSS”.

Parameters
  • other (DataFrame) –

  • how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –

  • cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –

  • on (Optional[Union[str, Iterable[str]]]) –

  • self_columns (Union[Dict[str, Optional[str]], Set[str]]) –

  • other_columns (Union[Dict[str, Optional[str]], Set[str]]) –

Return type

DataFrame

refresh()

Refresh the local cache of DataFrame.

After displayed dataframe, its content has been cached in local. All modifications made between last cache and this refresh are not updated in local.

The local cache if used to iterate the DataFrame instance locally.

Returns

self

Return type

DataFrame

Example

>>> cursor.execute("DROP TABLE IF EXISTS t_refresh;")
>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"])
>>> df = nums.save_as("t_refresh", column_names=["num"], temp=False).order_by("num")[:]
>>> df
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> cursor.execute("INSERT INTO t_refresh(num) VALUES (5);")
>>> df
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> df.refresh()
-----
 num
-----
   0
   1
   2
   3
   4
   5
-----
(6 rows)
>>> cursor.execute("DROP TABLE t_refresh;")

Note

cursor is a predefined Psycopg Cursor which connects to the same database in another session with auto-commit enabled.

save_as(table_name=None, column_names=[], temp=False, storage_params={}, drop_if_exists=False, drop_cascade=False, schema=None, distribution_type=None, distribution_key=None)

Save the GreenplumPython Dataframe as a table into the database.

And return a new instance of DataFrame that represents the newly saved table.

After running this function, if temp is False, you can also use create_dataframe(table_name)() to create a new Dataframe next time.

Parameters
  • table_name (Optional[str]) – name of table in database, required to be unique in the schema.

  • temp (bool) – whether table is temporary. Temp tables will be dropped after the database connection is closed.

  • column_names (List[str]) – list of column names

  • storage_params (dict[str, Any]) – storage_parameter of gpdb, reference https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-ref_guide-sql_commands-CREATE_TABLE_AS.html

  • schema (Optional[str]) – schema of the table for avoiding name conflicts.

  • distribution_type (Optional[Literal[None, 'randomly', 'replicated', 'hash']]) – type of distribution by.

  • distribution_key (Optional[Set[str]]) – distribution key.

  • drop_if_exists (bool) – bool to indicate if drop table if exists.

  • drop_cascade (bool) – bool to indicate if drop cascade table.

Returns

DataFrame represents the newly saved table

Return type

DataFrame

Example

>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"])
>>> df = nums.save_as("t_saved", column_names=["num"], temp=True)
>>> df.order_by("num")[:]
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
>>> t_saved = db.create_dataframe(table_name="t_saved")
>>> t_saved.order_by("num")[:]
-----
 num
-----
   0
   1
   2
   3
   4
-----
(5 rows)
create_index(columns, method='btree', name=None)

Create an index for the current dataframe for fast searching.

The current dataframe is required to be saved before creating index.

Parameters
  • columns (Union[Set[str], Dict[str, str]]) – Set of columns of the current dataframe to create index on. Optionally, an operator class can be specified for each column by passing a Dict with column names as keys and their operator class names as values.

  • method (str) – name of the index access method.

  • name (Optional[str]) – name of the index.

Returns

Dataframe with key columns indexed.

Return type

DataFrame

group_by(*column_names)

Group the current GreenplumPython DataFrame by column_names.

Parameters

column_names (str) – one or more column names of the DataFrame.

Returns

a set of groups of the current DataFrame. Each group is identified by a different set of values of the columns in the arguments.

Return type

DataFrameGroupingSet

distinct_on(*column_names)

Deduplicate the current DataFrame with respect to the given columns.

This function follows the DISTINCT ON syntax in PostgreSQL.

Parameters

column_names (str) – names of the current DataFrame’s columns.

Returns

the DataFrame containing only the distinct values of the given columns.

Return type

DataFrame

Example

>>> students = [("alice", 18), ("bob", 19), ("carol", 19)]
>>> student = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
>>> student.distinct_on("age")[['age']]
-----
 age
-----
  18
  19
-----
(2 rows)

Note

Since both “bob” and “carol” have the same age 19, student.distinct_on(“age”) will randomly pick one of them for the name column. Use “[[‘age’]]” to make sure the result is stable.

property unique_key: List[str]

Return unique key.

check_unique(columns)

Check whether a given set of columns, i.e. key, is unique.

Parameters

columns (set[str]) – set of columns name to be checked

Returns

self checked

Return type

DataFrame

classmethod from_table(table_name, db, schema=None)

Return a DataFrame which represents the given table in the Database.

Parameters
  • table_name (str) – table name

  • db (Database) – database of the table

  • schema (Optional[str]) – schema of table in database

Return type

DataFrame

df = gp.DataFrame.from_table("pg_class", db=db)
classmethod from_rows(rows, db, column_names=None)

Return a DataFrame using a given list of values.

Parameters
  • rows (Iterable[Union[Tuple[Any], Dict[str, Any]]]) –

    • Iterable[Tuple[Any]]: a list of row values.

    • Iterable[Dict[str, Any]]: a list of key value pairs to determine the columns and rows. The column names are decided by the keys of the first dictionary element if the column_names is not specified.

  • db (Database) – Database: database which will be associated with the DataFrame.

  • column_names (Optional[List[str]]) – Iterable[str]: list of given column names.

Returns

DataFrame generated with given values.

Return type

DataFrame

>>> rows = [(1,), (2,), (3,)]
>>> df = gp.DataFrame.from_rows(rows, db=db, column_names=["id"])
>>> df
 ----
  id
 ----
  1
  2
  3
 ----
 (3 rows)

>>> dict_list = [{"id": 1, "val": "11"}, {"id": 2, "val": "22"}]
>>> df = gp.DataFrame.from_rows(dict_list, db=db)
>>> df
 ----------
  id | val
 ----+-----
   1 | 11
   2 | 22
 ----------
 (2 rows)
classmethod from_columns(columns, db)

Return a DataFrame using list of columns values given.

Parameters
  • columns (Dict[str, Iterable[Any]]) – Dict[str, List[Any]]: List of column values.

  • db (Database) – Database: database which will be associated with the DataFrame.

Returns

the DataFrame generated with given values.

Return type

DataFrame

Example

>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]}
>>> t = gp.DataFrame.from_columns(columns, db=db)
>>> t
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)
embedding()

Enable embedding-based similarity search on columns of the current DataFrame.

Example

See Generating, Indexing and Searching Embeddings (Experimental) for more details.

Warning

This function is currently experimental and the interface is subject to change.

Return type

Embedding

classmethod from_files(files, parser, db)

Create a DataFrame with data read from files.

Parameters
  • files (list[str]) – list of file paths. Each path ends with the path of the same file on client, without links resolved.

  • parser (NormalFunction) – a UDF that parses the given files on server. The UDF is required to - take the file path as its only argument and - returns a set of parsed records in the returing DataFrame.

  • db (Database) – Database that the DataFrame to be created in.

Returns

DataFrame containing the parsed data from the given files.

Return type

DataFrame

Warning

This function is currently experimental and the interface is subject to change.