Grouping

Definitions for the result of grouping DataFrame.

class group.DataFrameGroupingSet

Represent a set of groups of a DataFrame.

It can be created from:

An AggregateFunction can be applied to each of the groups in DataFrameGroupingSet to obtain a summary.

apply(func, expand=False, column_name=None)

Apply a dataframe function to each group of the self grouping set.

The arguemnts and the return type is the same as apply().

The differences between them are

  • apply() operates on the entire DataFrame, while this method operate on only one group.

  • For apply(), the resulting DataFrame will only contain the return value of the function, while for this method, the resulting DataFrame will contain the grouping attributes as columns.

Warning

An exception will be raised when the data of the resulting DataFrame is observed if there is name conflict, possibly due to

  • The assigned column name in column_name or

  • The names of members in the composite type if expend is True

conflict with the name of the grouping attributes.

Example

>>> rows = [(i, i % 2 == 0) for i in range(10)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"])
>>> count = gp.aggregate_function("count")
>>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"]))
>>> results.order_by("is_even")[:]
-----------------
 count | is_even
-------+---------
     5 |       0
     5 |       1
-----------------
(2 rows)

>>> results = numbers.group_by("is_even").apply(lambda row: count(row["*"]), column_name='cnt')
>>> results.order_by("is_even")[:]
---------------
 cnt | is_even
-----+---------
   5 |       0
   5 |       1
---------------
(2 rows)

>>> class array_sum:
...     sum: int
...     count: int
...
>>> @gp.create_column_function
... def my_count_sum(val_list: List[int]) -> array_sum:
...     return {"sum": sum(val_list), "count": len(val_list)}
...
>>> results = numbers.group_by("is_even").apply(lambda t: my_count_sum(t["val"]), expand=True)
>>> results
-----------------------
 is_even | sum | count
---------+-----+-------
       0 |  25 |     5
       1 |  20 |     5
-----------------------
(2 rows)
Parameters
Return type

DataFrame

assign(**new_columns)

Assign new columns to the current grouping set.

NOTE: Existing columns cannot be reassigned.

Parameters

new_columns (Callable[[DataFrame], Any]) – a dict whose keys are column names and values are Callable’s returning column data when applied to the current DataFrameGroupingSet.

Returns

DataFrame with the new columns.

Return type

DataFrame

Example

>>> rows = [(i, i % 2 == 0) for i in range(10)]
>>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even"])
>>> count = gp.aggregate_function("count")
>>> sum = gp.aggregate_function("sum")
>>> results = numbers.group_by("is_even").assign(
...     count=lambda t: count(t["val"]),
...     sum=lambda t: sum(t["val"]))
>>> results.order_by("is_even")[:]
-----------------------
 is_even | count | sum
---------+-------+-----
       0 |     5 |  25
       1 |     5 |  20
-----------------------
(2 rows)
union(other)

Union with another DataFrameGroupingSet.

When applying an aggregate function to the list, the function will be applied to each group in the grouping set individually.

Parameters

other (Callable[[DataFrame], DataFrameGroupingSet]) – a Callable returning the result of group_by() when applied to the current DataFrame.

Returns

a new instance of DataFrameGroupingSet.

Return type

DataFrameGroupingSet

Example

>>> rows = [(i, i % 2 == 0, i % 3 == 0) for i in range(6)]  # 0, 1, 2, 3, 4, 5
>>> numbers = db.create_dataframe(rows=rows, column_names=["val", "is_even", "is_multiple_of_3"])
>>> count = gp.aggregate_function("count")
>>> results = (
...     numbers.group_by("is_even")
...     .union(lambda t: t.group_by("is_multiple_of_3"))
...     .assign(count=lambda t: count(t["val"]))
... )
>>> results.order_by("is_even")[:].order_by("count")[:]
------------------------------------
 is_even | is_multiple_of_3 | count
---------+------------------+-------
         |                1 |     2
       0 |                  |     3
       1 |                  |     3
         |                0 |     4
------------------------------------
(4 rows)