Database

Manage connection to Greenplum/PostgreSQL database.

class db.Database

Bases: object

Representation of a database in which data is located and computation is performed.

Each Database object is tied to a connection to the remote database system.

close()

Close the database connection.

Return type

None

create_dataframe(table_name=None, schema=None, rows=None, columns=None, column_names=None, files=None, parser=None)

Create a DataFrame from a database table, or a set of data.

Parameters
  • table_name (Optional[str]) – str: name of table in Database

  • schema (Optional[str]) – str: name of schema in Database

  • rows (Optional[List[Union[Tuple[Any, ...], Dict[str, Any]]]]) – List[Union[Tuple[Any, …], Dict[str, Any]]]: a List of rows

  • columns (Optional[Dict[str, Iterable[Any]]]) – Dict[str, List[Any]]: a dict of columns

  • column_names (Optional[Iterable[str]]) – Iterable[str]: List of given column names

  • files (Optional[List[str]]) –

  • parser (Optional[NormalFunction]) –

Example

To create DataFrame from a database table:

>>> cursor.execute("DROP TABLE IF EXISTS one_column_table")
>>> cursor.execute(
...     "CREATE TABLE one_column_table AS SELECT 42 as id;")
>>> df_from_table = db.create_dataframe(table_name="one_column_table")
>>> df_from_table
----
 id
----
 42
----
(1 row)

To create DataFrame from a predefined set of data:

>>> rows = [(1,), (2,), (3,)]
>>> df_from_rows = db.create_dataframe(rows=rows, column_names=["id"])
>>> df_from_rows
----
 id
----
  1
  2
  3
----
(3 rows)
>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]}
>>> t_from_columns = db.create_dataframe(columns=columns)
>>> t_from_columns
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)
apply(func, expand=False, column_name=None)

Apply a function in database without depending on a DataFrame.

This is primarily for applying functions on adaptable Python objects as constants in database.

The arguments and return type is similar to apply() except that parameter func takes no argument.

Example

>>> @gp.create_function
... def add(a: int, b: int) -> int:
...     return a + b
>>> db.apply(lambda: add(1, 2), column_name="sum")
-----
 sum
-----
   3
-----
(1 row)
Parameters
  • func (Callable[[], FunctionExpr]) –

  • expand (bool) –

  • column_name (Optional[str]) –

Return type

DataFrame

assign(**new_columns)

Assign new columns by calling functions in database.

Parameters

new_columns (Callable[[], Any]) – a dict whose keys are column names and values are Callable returning column data when applied to constant value in database.

Returns

GreenplumPython DataFrame resulted with assigned columns

Return type

DataFrame

Example

>>> abs = gp.function("abs")
>>> db.assign(abs=lambda: abs(-42))
-----
 abs
-----
  42
-----
(1 row)
install_packages(requirements)

Install the required Python packages on the server host.

The packages will be installed to the currently activated Python environment.

Parameters

requirements (str) – specification of what packages are required to be installed. The format is the same as the rquirements file used by pip. It can be obtained by reading an existing requirements file in text mode.

Return type

None

Example

See Installing Python Packages on Server without Internet (Experimental) for more details.

Note

This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.

One simple way to achieve this is to setup an NFS share on all hosts. Please refer to Installing Python Packages on Server without Internet (Experimental) for a simple working example.

Warning

This function is currently experimental and the interface is subject to change.

db.database(uri=None, params={})

Open a connection to database with connection URI or parameters.

Parameters
  • uri (Optional[str]) – connection URI to the database. Please refer to the libpq documentation on connection URI for detailed usage.

  • params (Dict[str, Optional[str]]) – connection parameters to the database. Please refer to the libpq documentation on parameter keywords for detailed usage. The parameter will be ignored and will not be passed to the remote database server if its value is None.

Return type

Database