Function

To create and call Greenplum/PostgreSQL UDFs or UDAs.

class func.AggregateFunction

Represent an aggregate function.

The function can be applied to:

An AggregateFunction is mapped to a User-Defined Aggregate (UDA) function in database.

When called, the arguments of an AggregateFunction can be

And the AggregateFunction returns one value aggregating data in all rows of the DataFrame or a group in the DataFrameGroupingSet.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters

args (Any) –

Return type

FunctionExpr

distinct(*args)

Apply the current aggregate function to each distinct set of the arguments.

Parameters

args (Any) – Argument of the aggregate function.

Returns

An expression represents the function call.

Return type

FunctionExpr

Example

>>> rows = [(1,), (2,), (2,), (3,), (3,), (4,)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> count = gp.aggregate_function("count")
>>> results = numbers.group_by().assign(
...     unique_numbers_count=lambda t: count.distinct(t["val"]))
>>> results
----------------------
 unique_numbers_count
----------------------
                    4
----------------------
(1 row)
property transition_function: NormalFunction

Return the transition function of the aggregate function.

class func.ArrayFunctionExpr

Inherited from FunctionExpr.

Specialized for an Array Function. It will array aggregate all the columns given by the user.

class func.ColumnFunction

Represent a dataframe column function.

As NormalFunction, a ColumnFunction is mapped to a UDF in database.

The calling convention of a ColumnFunction is the same as a AggregateFunction. However, rather than operating on one row at a time, all rows of the entire column are aggregated into a list before passing to ColumnFunction as argument, except when the column is used as the grouping attribute in group_by().

A ColumnFunction returns

Note

The primary use case for column function is to implement complex analytics such as machine learning using your favorite Python packages.

Inside a column function, the user can operate on all the data, rather than only part of it. As a result, the operation does not to satisfy certain restrictions such as Additivity. This makes it possible to implement complex functions.

Warning

However, such good usability comes at the cost of scalability.

  • Gathering data into one place makes it hard to exploit inter-machine parallelism when backed by an MPP database system like Greenplum, especially when the number of groups is small. Fortunately, this can be alleviated because many Python packages are SIMD optimized.

  • When the backing database system is PostgreSQL-derived, such as Greenplum, the size of one value cannot be larger than 1 GB. This limits the size of problems column functions can solve. Currently, one way to mitigate this issue is to break a large DataFrame into smaller groups and somehow combine the results of the column function for all groups.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters

args (Any) –

Return type

ArrayFunctionExpr

class func.FunctionExpr

Inherited from Expr.

A Function Expression object associated with a Greenplum/PostgreSQL function which can be called and applied to the data in the database.

class func.NormalFunction

Represent a (normal) dataframe function.

The function can be applied to:

A NormalFunction is mapped to a User-Defined Function (UDF) in database.

When called, the arguments of an AggregateFunction can be

and the AggregateFunction returns one value of the return type for each row of values in its arguments.

__call__(*args)

Call the dataframe function with the given arguments.

Parameters

args (Any) –

Return type

FunctionExpr

unwrap()

Get the wrapped Python function in the database function.

Return type

Callable[[…], Any]

func.aggregate_function(name, schema=None)

Get access to a predefined AggregateFunction from the database.

Parameters
  • name (str) – Name of the aggregate function.

  • schema (Optional[str]) – Schema (a.k.a namespace) of the aggregate function in database.

Returns

The AggregateFunction with the specified name and schema.

Return type

AggregateFunction

Example

>>> array_agg = gp.aggregate_function("array_agg")
>>> df = db.create_dataframe(columns={"i": range(3)})
>>> result = df.apply(lambda t: array_agg(t['i']), column_name="aggregate_result")
>>> result
------------------
 aggregate_result
------------------
 [0, 1, 2]
------------------
(1 row)
func.create_aggregate(transition_func=None, language_handler='plpython3u')

Create an AggregateFunction from the given Python function.

Parameters
  • transition_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out the state transition. It needs to follow the same convention as the wrapped_func parameter of create_function(), and the notes on serialization also applied here.

  • language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.

Returns

The newly created AggregateFunction.

Return type

AggregateFunction

Example

>>> @gp.create_aggregate
... def my_sum(cur_sum: int, val: int) -> int:
...     if cur_sum is None:
...         return val
...     return cur_sum + val

>>> rows = [(1,) for _ in range(10)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> results = numbers.group_by().assign(result=lambda t: my_sum(t["val"]))
>>> results
--------
 result
--------
     10
--------
(1 row)
func.create_column_function(wrapped_func=None, language_handler='plpython3u')

Create an ColumnFunction from the given Python function.

Parameters
  • wrapped_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out computation on columns. It needs to follow the same convention as the wrapped_func parameter of create_function(), and the notes on serialization also applied here.

  • language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.

Returns

The newly created ColumnFunction.

Return type

ColumnFunction

Example

>>> @gp.create_column_function
... def my_array_summary(val_list: List[int]) -> str:
...     return f'Length: {len(val_list)}, Sum: {sum(val_list)}'

>>> rows = [(1,), (2,), (3,)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val"])
>>> results = numbers.group_by().assign(
...     summary=lambda t: my_array_summary(t["val"]))
>>> results
-------------------
 summary
-------------------
 Length: 3, Sum: 6
-------------------
(1 row)
func.create_function(wrapped_func=None, language_handler='plpython3u')

Create a NormalFunction from the given Python function.

Parameters
  • wrapped_func (Optional[Callable[[...], Any]]) –

    the Python function carrying out the computation. Its definition need to follow the conventions below:

    • The function needs to be defined with the def keyword. Lambda expressions as the wrapped function are not supported yet.

    • Each parameter and the return value needs to be annotated with native Python type. The type annotations will be mapped to the types in database automatically.

    • A NormalFunction can return multiple values. In that case, the return type of the wrapped Python function needs to be a Python class with members annotated. It is recommended to use dataclasses.dataclass as return type.

  • language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.

  • schema – schema name

Returns

The newly created NormalFunction.

Return type

NormalFunction

Note

The created function is actually executed on the remote database server. To send it to the server, when creating the dataframe function,

  • Package dill, by the Uncertainty Quantification Foundation, is used to serialize the wrapped Python function and its dependencies when applicable. Therefore, it is recommended to install dill on the host of the backing database server.

  • If dill is not installed on the server, or the Python versions between client and server does not match, the source code of the wrapped Python function will be transmitted to the server, along with all the import statements for dependencies used by the function. In that case, the modules imported need to be installed on server in advance.

Example

>>> @gp.create_function
... def multiply(a: int, b: int) -> int:
...    return a * b

>>> db.assign(result=lambda: multiply(1, 2))
--------
 result
--------
      2
--------
(1 row)
func.function(name, schema=None)

Get access to a predefined dataframe NormalFunction from database.

Parameters
  • name (str) – Name of the function.

  • schema (Optional[str]) – Schema (a.k.a namespace) of the function in database.

Return type

NormalFunction

Returns

The NormalFunction with the specified name and schema.

Example

>>> generate_series = gp.function("generate_series")
>>> db.apply(lambda: generate_series(0, 2))
-----------------
 generate_series
-----------------
               0
               1
               2
-----------------
(3 rows)