Function

To create and call Greenplum/PostgreSQL UDFs or UDAs.

class func.AggregateFunction

Represent an aggregate function.

The function can be applied to:

a DataFrame with apply(), where the function will aggregate data in the entire dataframe;
a DataFrameGroupingSet with assign() or apply(), where the function will aggregate each group of data.

An AggregateFunction is mapped to a User-Defined Aggregate (UDA) function in database.

When called, the arguments of an AggregateFunction can be

Column of a DataFrame; or
constant values represented as Python objects.

And the AggregateFunction returns one value aggregating data in all rows of the DataFrame or a group in the DataFrameGroupingSet.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters: args (Any) –
Return type: FunctionExpr

distinct(*args)

Apply the current aggregate function to each distinct set of the arguments.

Parameters: args (Any) – Argument of the aggregate function.
Returns: An expression represents the function call.
Return type: FunctionExpr

Example

>>> rows = [(1,), (2,), (2,), (3,), (3,), (4,)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> count = gp.aggregate_function("count")
>>> results = numbers.group_by().assign(
...     unique_numbers_count=lambda t: count.distinct(t["val"]))
>>> results
----------------------
 unique_numbers_count
----------------------
                    4
----------------------
(1 row)

property transition_function: NormalFunction: Return the transition function of the aggregate function.

class func.ArrayFunctionExpr

Inherited from FunctionExpr.

Specialized for an Array Function. It will array aggregate all the columns given by the user.

class func.ColumnFunction

Represent a dataframe column function.

a DataFrame with apply(), where the function will operate on columns in the entire dataframe;
a DataFrameGroupingSet with assign() or apply(), where the function will operate on columns of each group of data.

As NormalFunction, a ColumnFunction is mapped to a UDF in database.

The calling convention of a ColumnFunction is the same as a AggregateFunction. However, rather than operating on one row at a time, all rows of the entire column are aggregated into a list before passing to ColumnFunction as argument, except when the column is used as the grouping attribute in group_by().

A ColumnFunction returns

One value of the return type when applied to a DataFrame; or
One value for each group when applied to a DataFrameGroupingSet.

Note

The primary use case for column function is to implement complex analytics such as machine learning using your favorite Python packages.

Inside a column function, the user can operate on all the data, rather than only part of it. As a result, the operation does not to satisfy certain restrictions such as Additivity. This makes it possible to implement complex functions.

Warning

However, such good usability comes at the cost of scalability.

Gathering data into one place makes it hard to exploit inter-machine parallelism when backed by an MPP database system like Greenplum, especially when the number of groups is small. Fortunately, this can be alleviated because many Python packages are SIMD optimized.
When the backing database system is PostgreSQL-derived, such as Greenplum, the size of one value cannot be larger than 1 GB. This limits the size of problems column functions can solve. Currently, one way to mitigate this issue is to break a large DataFrame into smaller groups and somehow combine the results of the column function for all groups.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters: args (Any) –
Return type: ArrayFunctionExpr

class func.FunctionExpr

Inherited from Expr.

A Function Expression object associated with a Greenplum/PostgreSQL function which can be called and applied to the data in the database.

class func.NormalFunction

Represent a (normal) dataframe function.

The function can be applied to:

a DataFrame with assign() or apply();
a Database when no DataFrame is involved with assign() or apply().

A NormalFunction is mapped to a User-Defined Function (UDF) in database.

When called, the arguments of an AggregateFunction can be

Column of a DataFrame; or
constant values represented as Python objects

and the AggregateFunction returns one value of the return type for each row of values in its arguments.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters: args (Any) –
Return type: FunctionExpr

unwrap()

Get the wrapped Python function in the database function.

Return type: Callable[[…], Any]

func.aggregate_function(name, schema=None)

Get access to a predefined AggregateFunction from the database.

Parameters

name (str) – Name of the aggregate function.
schema (Optional[str]) – Schema (a.k.a namespace) of the aggregate function in database.

Returns

The AggregateFunction with the specified name and schema.

Return type

AggregateFunction

Example

>>> array_agg = gp.aggregate_function("array_agg")
>>> df = db.create_dataframe(columns={"i": range(3)})
>>> result = df.apply(lambda t: array_agg(t['i']), column_name="aggregate_result")
>>> result
------------------
 aggregate_result
------------------
 [0, 1, 2]
------------------
(1 row)

func.create_aggregate(transition_func=None, language_handler='plpython3u')

Create an AggregateFunction from the given Python function.

Parameters

transition_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out the state transition. It needs to follow the same convention as the wrapped_func parameter of create_function(), and the notes on serialization also applied here.
language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.

Returns

The newly created AggregateFunction.

Return type

AggregateFunction

Example

>>> @gp.create_aggregate
... def my_sum(cur_sum: int, val: int) -> int:
...     if cur_sum is None:
...         return val
...     return cur_sum + val

>>> rows = [(1,) for _ in range(10)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> results = numbers.group_by().assign(result=lambda t: my_sum(t["val"]))
>>> results
--------
 result
--------
     10
--------
(1 row)

func.create_column_function(wrapped_func=None, language_handler='plpython3u')

Create an ColumnFunction from the given Python function.

Parameters

wrapped_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out computation on columns. It needs to follow the same convention as the wrapped_func parameter of create_function(), and the notes on serialization also applied here.
language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.

Returns

The newly created ColumnFunction.

Return type

ColumnFunction

Example

>>> @gp.create_column_function
... def my_array_summary(val_list: List[int]) -> str:
...     return f'Length: {len(val_list)}, Sum: {sum(val_list)}'

>>> rows = [(1,), (2,), (3,)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> results = numbers.group_by().assign(
...     summary=lambda t: my_array_summary(t["val"]))
>>> results
-------------------
 summary
-------------------
 Length: 3, Sum: 6
-------------------
(1 row)

func.create_function(wrapped_func=None, language_handler='plpython3u')

Create a NormalFunction from the given Python function.

Parameters

wrapped_func (Optional[Callable[[...], Any]]) –
the Python function carrying out the computation. Its definition need to follow the conventions below:
- The function needs to be defined with the def keyword. Lambda expressions as the wrapped function are not supported yet.
- Each parameter and the return value needs to be annotated with native Python type. The type annotations will be mapped to the types in database automatically.
- A NormalFunction can return multiple values. In that case, the return type of the wrapped Python function needs to be a Python class with members annotated. It is recommended to use dataclasses.dataclass as return type.
language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
schema – schema name

Returns

The newly created NormalFunction.

Return type

NormalFunction

Note

The created function is actually executed on the remote database server. To send it to the server, when creating the dataframe function,

Package dill, by the Uncertainty Quantification Foundation, is used to serialize the wrapped Python function and its dependencies when applicable. Therefore, it is recommended to install dill on the host of the backing database server.
If dill is not installed on the server, or the Python versions between client and server does not match, the source code of the wrapped Python function will be transmitted to the server, along with all the import statements for dependencies used by the function. In that case, the modules imported need to be installed on server in advance.

Example

>>> @gp.create_function
... def multiply(a: int, b: int) -> int:
...    return a * b

>>> db.assign(result=lambda: multiply(1, 2))
--------
 result
--------
      2
--------
(1 row)

func.function(name, schema=None)

Get access to a predefined dataframe NormalFunction from database.

Parameters

name (str) – Name of the function.
schema (Optional[str]) – Schema (a.k.a namespace) of the function in database.

Return type

NormalFunction

Returns: The NormalFunction with the specified name and schema.

Example

>>> generate_series = gp.function("generate_series")
>>> db.apply(lambda: generate_series(0, 2))
-----------------
 generate_series
-----------------
               0
               1
               2
-----------------
(3 rows)