Grouping
Definitions for the result of grouping DataFrame
.
- class group.DataFrameGroupingSet
Represent a set of groups of a
DataFrame
.It can be created from:
group_by()
, orDataFrameGroupingSet.union()
of multiplegroup_by()
.
An
AggregateFunction
can be applied to each of the groups inDataFrameGroupingSet
to obtain a summary.- apply(func, expand=False, column_name=None)
Apply a dataframe function to each group of the
self
grouping set.The arguemnts and the return type is the same as
apply()
.The differences between them are
apply()
operates on the entireDataFrame
, while this method operate on only one group.For
apply()
, the resultingDataFrame
will only contain the return value of the function, while for this method, the resultingDataFrame
will contain the grouping attributes as columns.
Warning
An exception will be raised when the data of the resulting
DataFrame
is observed if there is name conflict, possibly due toThe assigned column name in
column_name
orThe names of members in the composite type if
expend
isTrue
conflict with the name of the grouping attributes.
Example
>>> rows = [(i, i % 2 == 0) for i in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"]) >>> count = gp.aggregate_function("count") >>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"])) >>> results.order_by("is_even")[:] ----------------- count | is_even -------+--------- 5 | 0 5 | 1 ----------------- (2 rows) >>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"]), column_name='cnt') >>> results.order_by("is_even")[:] --------------- cnt | is_even -----+--------- 5 | 0 5 | 1 --------------- (2 rows) >>> class array_sum: ... sum: int ... count: int ... >>> @gp.create_column_function ... def my_count_sum(val_list: List[int]) -> array_sum: ... return {"sum": sum(val_list), "count": len(val_list)} ... >>> results = numbers.group_by("is_even").apply(lambda t: my_count_sum(t["val"]), expand=True) >>> results ----------------------- is_even | sum | count ---------+-----+------- 0 | 25 | 5 1 | 20 | 5 ----------------------- (2 rows)
- Parameters
func (Callable[[DataFrame], FunctionExpr]) –
expand (bool) –
column_name (Optional[str]) –
- Return type
- assign(**new_columns)
Assign new columns to the current grouping set.
NOTE: Existing columns cannot be reassigned.
- Parameters
new_columns (Callable[[DataFrame], Any]) – a
dict
whose keys are column names and values areCallable
’s returning column data when applied to the currentDataFrameGroupingSet
.- Returns
DataFrame
with the new columns.- Return type
Example
>>> rows = [(i, i % 2 == 0) for i in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"]) >>> count = gp.aggregate_function("count") >>> sum = gp.aggregate_function("sum") >>> results = numbers.group_by("is_even").assign( ... count=lambda t: count(t["val"]), ... sum=lambda t: sum(t["val"])) >>> results.order_by("is_even")[:] ----------------------- is_even | count | sum ---------+-------+----- 0 | 5 | 25 1 | 5 | 20 ----------------------- (2 rows)
- union(other)
Union with another
DataFrameGroupingSet
.When applying an aggregate function to the list, the function will be applied to each group in the grouping set individually.
- Parameters
other (Callable[[DataFrame], DataFrameGroupingSet]) – a
Callable
returning the result ofgroup_by()
when applied to the currentDataFrame
.- Returns
a new instance of
DataFrameGroupingSet
.- Return type
Example
>>> rows = [(i, i % 2 == 0, i % 3 == 0) for i in range(6)] # 0, 1, 2, 3, 4, 5 >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even", "is_multiple_of_3"]) >>> count = gp.aggregate_function("count") >>> results = ( ... numbers.group_by("is_even") ... .union(lambda t: t.group_by("is_multiple_of_3")) ... .assign(count=lambda t: count(t["val"])) ... ) >>> results.order_by("is_even")[:].order_by("count")[:] ------------------------------------ is_even | is_multiple_of_3 | count ---------+------------------+------- | 1 | 2 0 | | 3 1 | | 3 | 0 | 4 ------------------------------------ (4 rows)