DataFrame
DataFrame is the core data structure in GreenplumPython.
Conceptually, a DataFrame is a two-dimensional structure containing data.
In the data science world, a DataFrame in GreenplumPython, referred to as gp.DataFrame, is similar to a
DataFrame in pandas,
except that:
Data in a
gp.DataFrameis lazily evaluated rather than eagerly. That is, they are computed only when they are observed. This can improve efficiency in many cases.Data in a
gp.DataFrameis located and manipulated on a remote database system rather than locally. As a consequence,Retrieving them from the database system can be expensive. Therefore, once the data of a
gp.DataFrameis fetched from the database system, it will be cached locally for later use.They might be modified concurrently by other users of the database system. You might need to use
refresh()to sync the updates if the data becomes stale.
In the database world, a gp.DataFrame is similar to a materialized view in a database system
in that:
They both result from a possibly complex query.
They both hold data, as opposed to views.
The data can become stale due to concurrent modification. And the
refresh()method is similar to theREFRESH MATERIALIZED VIEWcommand in PostgreSQL for syncing updates.
- class dataframe.DataFrame
Bases:
objectRepresentation of GreenplumPython DataFrame object.
- property is_saved: bool
Check whether the current dataframe is saved in database.
- __getitem__(_) DataFrame
- __getitem__(column_names: List[str]) DataFrame
- __getitem__(predicate: Callable[[DataFrame], Expr]) DataFrame
- __getitem__(column_name: str) Expr
- __getitem__(rows: slice) DataFrame
Select parts of the
DataFrame.Returns: a
Columnof the currentDataFrameWhen want to use
Columnfor computation rather for observing data:Args: key:
string
id_col = dataframe["id"]
Returns: a new
DataFramefrom the currentDataFrameper the type of key:When want to retrieve some columns of
DataFrame:Args: key:
listof columnsReturns:
DataFramewith the subset of columns, a.k.a. targets
id_dataframe = dataframe[["id"]]
When want to filter
DataFrameonColumnwith conditions:Args: key:
ExprReturns:
DataFramewith subset of rows per the value of the Expr
id_cond_dataframe = dataframe[lambda t: t["id"] == 0]
When want to retrieve a portion of
DataFrame:Args: key:
sliceReturns:
DataFramewith the portion of consecutive rows
slice_dataframe = tab[2:5]
- where(predicate)
Filter the
DataFrameby applying the predicate.Return the
DataFramefiltered byExpr.- Parameters
predicate (Callable[[DataFrame], Expr]) –
Expr: where condition statement.- Returns
DataFramefiltered according toExprthat is passed in argument.- Return type
Example
>>> rows = [(i,) for i in range(-10, 10)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> series.where(lambda df: df["id"] > 0) ---- id ---- 1 2 3 4 5 6 7 8 9 ---- (9 rows)
- apply(func, expand=False, column_name=None)
Apply a dataframe function to the self
DataFrame.- Parameters
func (Callable[[DataFrame], FunctionExpr]) –
A Python function that
takes the self
DataFrameas the only parameter, andreturns the result of a dataframe function, which can be a
NormalFunction, aAggregateFunctionor aColumnFunction
expand (bool) – whether to expand the multi-valued result into columns of the resulting
DataFrame.column_name (Optional[str]) – name of the column of the return value in the resulting
DataFrame.
- Returns
A
DataFrameof returned values of the function.- Return type
Example
To compute the absolute value of a serie of numbers:
>>> rows = [(i,) for i in range(-10, 0)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> abs = gp.function("abs") >>> result = series.apply(lambda df: abs(df["id"])) >>> result ----- abs ----- 10 9 8 7 6 5 4 3 2 1 ----- (10 rows)
To transform colums into other types, see the following example. Suppose label function takes a str and a int, it joins them into a string and returns:
>>> rows = [(i,) for i in range(10)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> @gp.create_function ... def label(prefix: str, id: int) -> str: ... prefix = "id" ... return f"{prefix}_{id}" >>> result = series.apply(lambda t: label("label", t["id"]), ... column_name = "label") >>> result ------- label ------- id_0 id_1 id_2 id_3 id_4 id_5 id_6 id_7 id_8 id_9 ------- (10 rows)
- assign(**new_columns)
Add new columns to the current
DataFrame. Existing columns cannot be reassigned.- Parameters
new_columns (Callable[[DataFrame], Any]) – a dict whose keys are column names and values are
Callablewhich returns column data when is applied to the currentDataFrame.- Returns
a new
DataFrameincluding the new assigned columns.- Return type
Example
>>> rows = [(i,) for i in range(-10, 0)] >>> series = db.create_dataframe(rows=rows, column_names=["id"]) >>> abs = gp.function("abs") >>> results = series.assign(abs=lambda nums: abs(nums["id"])) >>> results ----------- id | abs -----+----- -10 | 10 -9 | 9 -8 | 8 -7 | 7 -6 | 6 -5 | 5 -4 | 4 -3 | 3 -2 | 2 -1 | 1 ----------- (10 rows)
- order_by(column_name, ascending=None, nulls_first=None, operator=None)
Sort
DataFramebased on the configuration.- Parameters
column_name (str) – name of column to order the dataframe by.
ascending (Optional[bool]) – Define ascending of order, True = ASC / False = DESC.
nulls_first (Optional[bool]) – Define if nulls will be ordered first or last, True = First / False = Last.
operator (Optional[str]) – Define order by using operator. Can’t combine with ascending.
- Returns
Specification on ordering of the current
DataFrame.- Return type
Example
>>> columns = {"id": [3, 1, 2], "b": [1, 2, 3]} >>> t = gp.DataFrame.from_columns(columns, db=db) >>> result = t.order_by("id")[:] >>> result -------- id | b ----+--- 1 | 2 2 | 3 3 | 1 -------- (3 rows)
- join(other, how='', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Join the current
DataFramewith another using the given arguments.- Parameters
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
How the two
DataFrameare joined. The value can be one of:”INNER”: inner join,
”LEFT”: left outer join,
”RIGHT”: right outer join,
”FULL”: full outer join, or
”CROSS”: cross join, i.e. the Cartesian product
The default value “” is equivalent to “INNER”.
cond (Optional[Callable[[DataFrame, DataFrame], Expr]]) –
Callablelambda function as the join conditionon (Optional[Union[str, Iterable[str]]]) – a list of column names that exists in both DataFrames to join on.
condandoncannot be used together.self_columns (Union[Dict[str, Optional[str]], Set[str]]) – A
dictwhose keys are the column names of the current dataframe to be included in the resulting dataframe. The value, if not None, is used for renaming the corresponding key to avoid name conflicts. Asterisk"*"can be used as a key to indicate all columns.other_columns (Union[Dict[str, Optional[str]], Set[str]]) – Same as self_columns, but for the other
DataFrame.
- Return type
Note
When using
"*"as key in self_columns or other_columns, please ensure that there will not be more than one column with the same name by applying proper renaming. Otherwise, there will be an error.Example
>>> age_rows = [("alice", 18), ("bob", 19), ("carol", 19)] >>> student = gp.DataFrame.from_rows( ... age_rows, column_names=["name", "age"], db=db) >>> result = student.join( ... student, ... on="age", ... self_columns={"*"}, ... other_columns={"name": "name_2"}) >>> result ---------------------- name | age | name_2 -------+-----+-------- alice | 18 | alice bob | 19 | carol bob | 19 | bob carol | 19 | carol carol | 19 | bob ---------------------- (5 rows)
- inner_join(other, *, how='INNER', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Inner joins the current
DataFramewith anotherDataFrame.Equivalent to calling
join()with how=”INNER”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- left_join(other, *, how='LEFT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Left-outer joins the current
DataFramewith anotherDataFrame.Equivalent to calling
join()with how=”LEFT”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- right_join(other, *, how='RIGHT', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Right-outer joins the current
DataFramewith anotherDataFrame.Equivalent to calling
join()with how=”RIGHT”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- full_join(other, *, how='FULL', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Full-outer joins the current
DataFramewith anotherDataFrame.Equivalent to calling
join()with argutment how=”FULL”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- cross_join(other, *, how='CROSS', cond=None, on=None, self_columns={'*'}, other_columns={'*'})
Cross joins the current
DataFramewith anotherDataFrame, i.e. the Cartesian product.Equivalent to calling
join()with how=”CROSS”.- Parameters
other (DataFrame) –
how (Literal['', 'left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, Iterable[str]]]) –
self_columns (Union[Dict[str, Optional[str]], Set[str]]) –
other_columns (Union[Dict[str, Optional[str]], Set[str]]) –
- Return type
- refresh()
Refresh the local cache of
DataFrame.After displayed dataframe, its content has been cached in local. All modifications made between last cache and this refresh are not updated in local.
The local cache if used to iterate the
DataFrameinstance locally.- Returns
self
- Return type
Example
>>> cursor.execute("DROP TABLE IF EXISTS t_refresh;") >>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"]) >>> df = nums.save_as("t_refresh", column_names=["num"], temp=False).order_by("num")[:] >>> df ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> cursor.execute("INSERT INTO t_refresh(num) VALUES (5);") >>> df ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> df.refresh() ----- num ----- 0 1 2 3 4 5 ----- (6 rows) >>> cursor.execute("DROP TABLE t_refresh;")
Note
cursor is a predefined Psycopg Cursor which connects to the same database in another session with auto-commit enabled.
- save_as(table_name=None, column_names=[], temp=False, storage_params={}, drop_if_exists=False, drop_cascade=False, schema=None, distribution_type=None, distribution_key=None)
Save the GreenplumPython
Dataframeas a table into the database.And return a new instance of
DataFramethat represents the newly saved table.After running this function, if temp is False, you can also use
create_dataframe(table_name)()to create a newDataframenext time.- Parameters
table_name (Optional[str]) – name of table in database, required to be unique in the schema.
temp (bool) – whether table is temporary. Temp tables will be dropped after the database connection is closed.
column_names (List[str]) – list of column names
storage_params (dict[str, Any]) – storage_parameter of gpdb, reference https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-ref_guide-sql_commands-CREATE_TABLE_AS.html
schema (Optional[str]) – schema of the table for avoiding name conflicts.
distribution_type (Optional[Literal[None, 'randomly', 'replicated', 'hash']]) – type of distribution by.
distribution_key (Optional[Set[str]]) – distribution key.
drop_if_exists (bool) – bool to indicate if drop table if exists.
drop_cascade (bool) – bool to indicate if drop cascade table.
- Returns
DataFramerepresents the newly saved table- Return type
Example
>>> nums = db.create_dataframe(rows=[(i,) for i in range(5)], column_names=["num"]) >>> df = nums.save_as("t_saved", column_names=["num"], temp=True) >>> df.order_by("num")[:] ----- num ----- 0 1 2 3 4 ----- (5 rows) >>> t_saved = db.create_dataframe(table_name="t_saved") >>> t_saved.order_by("num")[:] ----- num ----- 0 1 2 3 4 ----- (5 rows)
- create_index(columns, method='btree', name=None)
Create an index for the current dataframe for fast searching.
The current dataframe is required to be saved before creating index.
- Parameters
columns (Union[Set[str], Dict[str, str]]) –
Setof columns of the current dataframe to create index on. Optionally, an operator class can be specified for each column by passing aDictwith column names as keys and their operator class names as values.method (str) – name of the index access method.
name (Optional[str]) – name of the index.
- Returns
Dataframe with key columns indexed.
- Return type
- group_by(*column_names)
Group the current GreenplumPython
DataFrameby column_names.
- distinct_on(*column_names)
Deduplicate the current
DataFramewith respect to the given columns.This function follows the DISTINCT ON syntax in PostgreSQL.
- Parameters
column_names (str) – names of the current
DataFrame’s columns.- Returns
the
DataFramecontaining only the distinct values of the given columns.- Return type
Example
>>> students = [("alice", 18), ("bob", 19), ("carol", 19)] >>> student = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db) >>> student.distinct_on("age")[['age']] ----- age ----- 18 19 ----- (2 rows)
Note
Since both “bob” and “carol” have the same age 19, student.distinct_on(“age”) will randomly pick one of them for the name column. Use “[[‘age’]]” to make sure the result is stable.
- property unique_key: List[str]
Return unique key.
- check_unique(columns)
Check whether a given set of columns, i.e. key, is unique.
- Parameters
columns (set[str]) – set of columns name to be checked
- Returns
self checked
- Return type
- classmethod from_table(table_name, db, schema=None)
Return a
DataFramewhich represents the given table in theDatabase.- Parameters
table_name (str) – table name
db (Database) – database of the table
schema (Optional[str]) – schema of table in database
- Return type
df = gp.DataFrame.from_table("pg_class", db=db)
- classmethod from_rows(rows, db, column_names=None)
Return a
DataFrameusing a given list of values.- Parameters
rows (Iterable[Union[Tuple[Any], Dict[str, Any]]]) –
Iterable[Tuple[Any]]: a list of row values.
Iterable[Dict[str, Any]]: a list of key value pairs to determine the columns and rows. The column names are decided by the keys of the first dictionary element if the column_names is not specified.
db (Database) –
Database: database which will be associated with theDataFrame.column_names (Optional[List[str]]) – Iterable[str]: list of given column names.
- Returns
DataFramegenerated with given values.- Return type
>>> rows = [(1,), (2,), (3,)] >>> df = gp.DataFrame.from_rows(rows, db=db, column_names=["id"]) >>> df ---- id ---- 1 2 3 ---- (3 rows) >>> dict_list = [{"id": 1, "val": "11"}, {"id": 2, "val": "22"}] >>> df = gp.DataFrame.from_rows(dict_list, db=db) >>> df ---------- id | val ----+----- 1 | 11 2 | 22 ---------- (2 rows)
- classmethod from_columns(columns, db)
Return a
DataFrameusing list of columns values given.- Parameters
- Returns
the
DataFramegenerated with given values.- Return type
Example
>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]} >>> t = gp.DataFrame.from_columns(columns, db=db) >>> t ------- a | b ---+--- 1 | 1 2 | 2 3 | 3 ------- (3 rows)
- embedding()
Enable embedding-based similarity search on columns of the current
DataFrame.Example
See Generating, Indexing and Searching Embeddings (Experimental) for more details.
Warning
This function is currently experimental and the interface is subject to change.
- Return type
- classmethod from_files(files, parser, db)
Create a DataFrame with data read from files.
- Parameters
files (list[str]) – list of file paths. Each path ends with the path of the same file on client, without links resolved.
parser (NormalFunction) – a UDF that parses the given files on server. The UDF is required to - take the file path as its only argument and - returns a set of parsed records in the returing DataFrame.
db (Database) – Database that the DataFrame to be created in.
- Returns
DataFrame containing the parsed data from the given files.
- Return type
Warning
This function is currently experimental and the interface is subject to change.