Function
To create and call Greenplum/PostgreSQL UDFs or UDAs.
- class func.AggregateFunction
Represent an aggregate function.
The function can be applied to:
a
DataFramewithapply(), where the function will aggregate data in the entire dataframe;a
DataFrameGroupingSetwithassign()orapply(), where the function will aggregate each group of data.
An
AggregateFunctionis mapped to a User-Defined Aggregate (UDA) function in database.When called, the arguments of an
AggregateFunctioncan beAnd the
AggregateFunctionreturns one value aggregating data in all rows of theDataFrameor a group in theDataFrameGroupingSet.- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- distinct(*args)
Apply the current aggregate function to each distinct set of the arguments.
- Parameters
args (Any) – Argument of the aggregate function.
- Returns
An expression represents the function call.
- Return type
Example
>>> rows = [(1,), (2,), (2,), (3,), (3,), (4,)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> count = gp.aggregate_function("count") >>> results = numbers.group_by().assign( ... unique_numbers_count=lambda t: count.distinct(t["val"])) >>> results ---------------------- unique_numbers_count ---------------------- 4 ---------------------- (1 row)
- property transition_function: NormalFunction
Return the transition function of the aggregate function.
- class func.ArrayFunctionExpr
Inherited from
FunctionExpr.Specialized for an Array Function. It will array aggregate all the columns given by the user.
- class func.ColumnFunction
Represent a dataframe column function.
a
DataFramewithapply(), where the function will operate on columns in the entire dataframe;a
DataFrameGroupingSetwithassign()orapply(), where the function will operate on columns of each group of data.
As
NormalFunction, aColumnFunctionis mapped to a UDF in database.The calling convention of a
ColumnFunctionis the same as aAggregateFunction. However, rather than operating on one row at a time, all rows of the entire column are aggregated into alistbefore passing toColumnFunctionas argument, except when the column is used as the grouping attribute ingroup_by().A
ColumnFunctionreturnsOne value of the return type when applied to a
DataFrame; orOne value for each group when applied to a
DataFrameGroupingSet.
Note
The primary use case for column function is to implement complex analytics such as machine learning using your favorite Python packages.
Inside a column function, the user can operate on all the data, rather than only part of it. As a result, the operation does not to satisfy certain restrictions such as Additivity. This makes it possible to implement complex functions.
Warning
However, such good usability comes at the cost of scalability.
Gathering data into one place makes it hard to exploit inter-machine parallelism when backed by an MPP database system like Greenplum, especially when the number of groups is small. Fortunately, this can be alleviated because many Python packages are SIMD optimized.
When the backing database system is PostgreSQL-derived, such as Greenplum, the size of one value cannot be larger than 1 GB. This limits the size of problems column functions can solve. Currently, one way to mitigate this issue is to break a large
DataFrameinto smaller groups and somehow combine the results of the column function for all groups.
- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- class func.FunctionExpr
Inherited from
Expr.A Function Expression object associated with a Greenplum/PostgreSQL function which can be called and applied to the data in the database.
- class func.NormalFunction
Represent a (normal) dataframe function.
The function can be applied to:
A
NormalFunctionis mapped to a User-Defined Function (UDF) in database.When called, the arguments of an
AggregateFunctioncan beand the
AggregateFunctionreturns one value of the return type for each row of values in its arguments.- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- unwrap()
Get the wrapped Python function in the database function.
- Return type
Callable[[…], Any]
- func.aggregate_function(name, schema=None)
Get access to a predefined
AggregateFunctionfrom the database.- Parameters
name (str) – Name of the aggregate function.
schema (Optional[str]) – Schema (a.k.a namespace) of the aggregate function in database.
- Returns
The
AggregateFunctionwith the specifiednameandschema.- Return type
Example
>>> array_agg = gp.aggregate_function("array_agg") >>> df = db.create_dataframe(columns={"i": range(3)}) >>> result = df.apply(lambda t: array_agg(t['i']), column_name="aggregate_result") >>> result ------------------ aggregate_result ------------------ [0, 1, 2] ------------------ (1 row)
- func.create_aggregate(transition_func=None, language_handler='plpython3u')
Create an
AggregateFunctionfrom the given Python function.- Parameters
transition_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out the state transition. It needs to follow the same convention as the
wrapped_funcparameter ofcreate_function(), and the notes on serialization also applied here.language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
- Returns
The newly created
AggregateFunction.- Return type
Example
>>> @gp.create_aggregate ... def my_sum(cur_sum: int, val: int) -> int: ... if cur_sum is None: ... return val ... return cur_sum + val >>> rows = [(1,) for _ in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> results = numbers.group_by().assign(result=lambda t: my_sum(t["val"])) >>> results -------- result -------- 10 -------- (1 row)
- func.create_column_function(wrapped_func=None, language_handler='plpython3u')
Create an
ColumnFunctionfrom the given Python function.- Parameters
wrapped_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out computation on columns. It needs to follow the same convention as the
wrapped_funcparameter ofcreate_function(), and the notes on serialization also applied here.language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
- Returns
The newly created
ColumnFunction.- Return type
Example
>>> @gp.create_column_function ... def my_array_summary(val_list: List[int]) -> str: ... return f'Length: {len(val_list)}, Sum: {sum(val_list)}' >>> rows = [(1,), (2,), (3,)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> results = numbers.group_by().assign( ... summary=lambda t: my_array_summary(t["val"])) >>> results ------------------- summary ------------------- Length: 3, Sum: 6 ------------------- (1 row)
- func.create_function(wrapped_func=None, language_handler='plpython3u')
Create a
NormalFunctionfrom the given Python function.- Parameters
wrapped_func (Optional[Callable[[...], Any]]) –
the Python function carrying out the computation. Its definition need to follow the conventions below:
The function needs to be defined with the
defkeyword. Lambda expressions as the wrapped function are not supported yet.Each parameter and the return value needs to be annotated with native Python type. The type annotations will be mapped to the types in database automatically.
A
NormalFunctioncan return multiple values. In that case, the return type of the wrapped Python function needs to be a Pythonclasswith members annotated. It is recommended to usedataclasses.dataclassas return type.
language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
schema – schema name
- Returns
The newly created
NormalFunction.- Return type
Note
The created function is actually executed on the remote database server. To send it to the server, when creating the dataframe function,
Package dill, by the Uncertainty Quantification Foundation, is used to serialize the wrapped Python function and its dependencies when applicable. Therefore, it is recommended to install dill on the host of the backing database server.
If dill is not installed on the server, or the Python versions between client and server does not match, the source code of the wrapped Python function will be transmitted to the server, along with all the import statements for dependencies used by the function. In that case, the modules imported need to be installed on server in advance.
Example
>>> @gp.create_function ... def multiply(a: int, b: int) -> int: ... return a * b >>> db.assign(result=lambda: multiply(1, 2)) -------- result -------- 2 -------- (1 row)
- func.function(name, schema=None)
Get access to a predefined dataframe
NormalFunctionfrom database.- Parameters
name (str) – Name of the function.
schema (Optional[str]) – Schema (a.k.a namespace) of the function in database.
- Return type
- Returns
The
NormalFunctionwith the specifiednameandschema.
Example
>>> generate_series = gp.function("generate_series") >>> db.apply(lambda: generate_series(0, 2)) ----------------- generate_series ----------------- 0 1 2 ----------------- (3 rows)