Grouping
Definitions for the result of grouping DataFrame.
- class group.DataFrameGroupingSet
Represent a set of groups of a
DataFrame.It can be created from:
group_by(), orDataFrameGroupingSet.union()of multiplegroup_by().
An
AggregateFunctioncan be applied to each of the groups inDataFrameGroupingSetto obtain a summary.- apply(func, expand=False, column_name=None)
Apply a dataframe function to each group of the
selfgrouping set.The arguemnts and the return type is the same as
apply().The differences between them are
apply()operates on the entireDataFrame, while this method operate on only one group.For
apply(), the resultingDataFramewill only contain the return value of the function, while for this method, the resultingDataFramewill contain the grouping attributes as columns.
Warning
An exception will be raised when the data of the resulting
DataFrameis observed if there is name conflict, possibly due toThe assigned column name in
column_nameorThe names of members in the composite type if
expendisTrue
conflict with the name of the grouping attributes.
Example
>>> rows = [(i, i % 2 == 0) for i in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"]) >>> count = gp.aggregate_function("count") >>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"])) >>> results.order_by("is_even")[:] ----------------- count | is_even -------+--------- 5 | 0 5 | 1 ----------------- (2 rows) >>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"]), column_name='cnt') >>> results.order_by("is_even")[:] --------------- cnt | is_even -----+--------- 5 | 0 5 | 1 --------------- (2 rows) >>> class array_sum: ... sum: int ... count: int ... >>> @gp.create_column_function ... def my_count_sum(val_list: List[int]) -> array_sum: ... return {"sum": sum(val_list), "count": len(val_list)} ... >>> results = numbers.group_by("is_even").apply(lambda t: my_count_sum(t["val"]), expand=True) >>> results ----------------------- is_even | sum | count ---------+-----+------- 0 | 25 | 5 1 | 20 | 5 ----------------------- (2 rows)
- Parameters
func (Callable[[DataFrame], FunctionExpr]) –
expand (bool) –
column_name (Optional[str]) –
- Return type
- assign(**new_columns)
Assign new columns to the current grouping set.
NOTE: Existing columns cannot be reassigned.
- Parameters
new_columns (Callable[[DataFrame], Any]) – a
dictwhose keys are column names and values areCallable’s returning column data when applied to the currentDataFrameGroupingSet.- Returns
DataFramewith the new columns.- Return type
Example
>>> rows = [(i, i % 2 == 0) for i in range(10)] >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"]) >>> count = gp.aggregate_function("count") >>> sum = gp.aggregate_function("sum") >>> results = numbers.group_by("is_even").assign( ... count=lambda t: count(t["val"]), ... sum=lambda t: sum(t["val"])) >>> results.order_by("is_even")[:] ----------------------- is_even | count | sum ---------+-------+----- 0 | 5 | 25 1 | 5 | 20 ----------------------- (2 rows)
- union(other)
Union with another
DataFrameGroupingSet.When applying an aggregate function to the list, the function will be applied to each group in the grouping set individually.
- Parameters
other (Callable[[DataFrame], DataFrameGroupingSet]) – a
Callablereturning the result ofgroup_by()when applied to the currentDataFrame.- Returns
a new instance of
DataFrameGroupingSet.- Return type
Example
>>> rows = [(i, i % 2 == 0, i % 3 == 0) for i in range(6)] # 0, 1, 2, 3, 4, 5 >>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even", "is_multiple_of_3"]) >>> count = gp.aggregate_function("count") >>> results = ( ... numbers.group_by("is_even") ... .union(lambda t: t.group_by("is_multiple_of_3")) ... .assign(count=lambda t: count(t["val"])) ... ) >>> results.order_by("is_even")[:].order_by("count")[:] ------------------------------------ is_even | is_multiple_of_3 | count ---------+------------------+------- | 1 | 2 0 | | 3 1 | | 3 | 0 | 4 ------------------------------------ (4 rows)