Function
To create and call Greenplum/PostgreSQL UDFs or UDAs.
- class func.AggregateFunction
Represent an aggregate function.
The function can be applied to:
a
DataFrame
withapply()
, where the function will aggregate data in the entire dataframe;a
DataFrameGroupingSet
withassign()
orapply()
, where the function will aggregate each group of data.
An
AggregateFunction
is mapped to a User-Defined Aggregate (UDA) function in database.When called, the arguments of an
AggregateFunction
can beAnd the
AggregateFunction
returns one value aggregating data in all rows of theDataFrame
or a group in theDataFrameGroupingSet
.- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- distinct(*args)
Apply the current aggregate function to each distinct set of the arguments.
- Parameters
args (Any) – Argument of the aggregate function.
- Returns
An expression represents the function call.
- Return type
Example
>>> rows = [(1,), (2,), (2,), (3,), (3,), (4,)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> count = gp.aggregate_function("count") >>> results = numbers.group_by().assign( ... unique_numbers_count=lambda t: count.distinct(t["val"])) >>> results ---------------------- unique_numbers_count ---------------------- 4 ---------------------- (1 row)
- property transition_function: NormalFunction
Return the transition function of the aggregate function.
- class func.ArrayFunctionExpr
Inherited from
FunctionExpr
.Specialized for an Array Function. It will array aggregate all the columns given by the user.
- class func.ColumnFunction
Represent a dataframe column function.
a
DataFrame
withapply()
, where the function will operate on columns in the entire dataframe;a
DataFrameGroupingSet
withassign()
orapply()
, where the function will operate on columns of each group of data.
As
NormalFunction
, aColumnFunction
is mapped to a UDF in database.The calling convention of a
ColumnFunction
is the same as aAggregateFunction
. However, rather than operating on one row at a time, all rows of the entire column are aggregated into alist
before passing toColumnFunction
as argument, except when the column is used as the grouping attribute ingroup_by()
.A
ColumnFunction
returnsOne value of the return type when applied to a
DataFrame
; orOne value for each group when applied to a
DataFrameGroupingSet
.
Note
The primary use case for column function is to implement complex analytics such as machine learning using your favorite Python packages.
Inside a column function, the user can operate on all the data, rather than only part of it. As a result, the operation does not to satisfy certain restrictions such as Additivity. This makes it possible to implement complex functions.
Warning
However, such good usability comes at the cost of scalability.
Gathering data into one place makes it hard to exploit inter-machine parallelism when backed by an MPP database system like Greenplum, especially when the number of groups is small. Fortunately, this can be alleviated because many Python packages are SIMD optimized.
When the backing database system is PostgreSQL-derived, such as Greenplum, the size of one value cannot be larger than 1 GB. This limits the size of problems column functions can solve. Currently, one way to mitigate this issue is to break a large
DataFrame
into smaller groups and somehow combine the results of the column function for all groups.
- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- class func.FunctionExpr
Inherited from
Expr
.A Function Expression object associated with a Greenplum/PostgreSQL function which can be called and applied to the data in the database.
- class func.NormalFunction
Represent a (normal) dataframe function.
The function can be applied to:
A
NormalFunction
is mapped to a User-Defined Function (UDF) in database.When called, the arguments of an
AggregateFunction
can beand the
AggregateFunction
returns one value of the return type for each row of values in its arguments.- __call__(*args)
Call the dataframe function with the given arguments.
- Parameters
args (Any) –
- Return type
- unwrap()
Get the wrapped Python function in the database function.
- Return type
Callable[[…], Any]
- func.aggregate_function(name, schema=None)
Get access to a predefined
AggregateFunction
from the database.- Parameters
name (str) – Name of the aggregate function.
schema (Optional[str]) – Schema (a.k.a namespace) of the aggregate function in database.
- Returns
The
AggregateFunction
with the specifiedname
andschema
.- Return type
Example
>>> array_agg = gp.aggregate_function("array_agg") >>> df = db.create_dataframe(columns={"i": range(3)}) >>> result = df.apply(lambda t: array_agg(t['i']), column_name="aggregate_result") >>> result ------------------ aggregate_result ------------------ [0, 1, 2] ------------------ (1 row)
- func.create_aggregate(transition_func=None, language_handler='plpython3u')
Create an
AggregateFunction
from the given Python function.- Parameters
transition_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out the state transition. It needs to follow the same convention as the
wrapped_func
parameter ofcreate_function()
, and the notes on serialization also applied here.language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
- Returns
The newly created
AggregateFunction
.- Return type
Example
>>> @gp.create_aggregate ... def my_sum(cur_sum: int, val: int) -> int: ... if cur_sum is None: ... return val ... return cur_sum + val >>> rows = [(1,) for _ in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> results = numbers.group_by().assign(result=lambda t: my_sum(t["val"])) >>> results -------- result -------- 10 -------- (1 row)
- func.create_column_function(wrapped_func=None, language_handler='plpython3u')
Create an
ColumnFunction
from the given Python function.- Parameters
wrapped_func (Optional[Callable[[...], Any]]) – the wrapped Python function carrying out computation on columns. It needs to follow the same convention as the
wrapped_func
parameter ofcreate_function()
, and the notes on serialization also applied here.language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
- Returns
The newly created
ColumnFunction
.- Return type
Example
>>> @gp.create_column_function ... def my_array_summary(val_list: List[int]) -> str: ... return f'Length: {len(val_list)}, Sum: {sum(val_list)}' >>> rows = [(1,), (2,), (3,)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val"]) >>> results = numbers.group_by().assign( ... summary=lambda t: my_array_summary(t["val"])) >>> results ------------------- summary ------------------- Length: 3, Sum: 6 ------------------- (1 row)
- func.create_function(wrapped_func=None, language_handler='plpython3u')
Create a
NormalFunction
from the given Python function.- Parameters
wrapped_func (Optional[Callable[[...], Any]]) –
the Python function carrying out the computation. Its definition need to follow the conventions below:
The function needs to be defined with the
def
keyword. Lambda expressions as the wrapped function are not supported yet.Each parameter and the return value needs to be annotated with native Python type. The type annotations will be mapped to the types in database automatically.
A
NormalFunction
can return multiple values. In that case, the return type of the wrapped Python function needs to be a Pythonclass
with members annotated. It is recommended to usedataclasses.dataclass
as return type.
language_handler (Literal['plpython3u']) – language handler to run the function in database, defaults to plpython3u, will also support plcontainer later.
schema – schema name
- Returns
The newly created
NormalFunction
.- Return type
Note
The created function is actually executed on the remote database server. To send it to the server, when creating the dataframe function,
Package dill, by the Uncertainty Quantification Foundation, is used to serialize the wrapped Python function and its dependencies when applicable. Therefore, it is recommended to install dill on the host of the backing database server.
If dill is not installed on the server, or the Python versions between client and server does not match, the source code of the wrapped Python function will be transmitted to the server, along with all the import statements for dependencies used by the function. In that case, the modules imported need to be installed on server in advance.
Example
>>> @gp.create_function ... def multiply(a: int, b: int) -> int: ... return a * b >>> db.assign(result=lambda: multiply(1, 2)) -------- result -------- 2 -------- (1 row)
- func.function(name, schema=None)
Get access to a predefined dataframe
NormalFunction
from database.- Parameters
name (str) – Name of the function.
schema (Optional[str]) – Schema (a.k.a namespace) of the function in database.
- Return type
- Returns
The
NormalFunction
with the specifiedname
andschema
.
Example
>>> generate_series = gp.function("generate_series") >>> db.apply(lambda: generate_series(0, 2)) ----------------- generate_series ----------------- 0 1 2 ----------------- (3 rows)