Comparison with pandas

pandas is a popular Python library for data analysis and manipulation.

GreenplumPython strives to provide a pandas-like interface so that people can get started with it quickly.

[1]:
import pandas as pd
import greenplumpython as gp

gp
[1]:
<module 'greenplumpython' from '/home/gpadmin/.local/lib/python3.9/site-packages/greenplumpython/__init__.py'>

However, the two packages serve very different purposes, i.e.,

  • GreenplumPython is an interface to a remote database system, while

  • pandas is a library for manipulating local in-memory data.

There are still some important differences in their interface.

This document covers the similarities and the differences between GreenplumPython and pandas, as well as the rationales behind.

Data Structure

GreenplumPython’s core data structure DataFrame is fundamentally similar to Dataframe in pandas in that

  • Data are organized into rows and columns;

  • Columns can be of different types, and can be accessed by name;

  • Rows are of the same type, and are iterable.

Next, we will see similarities and differences between them in detail with examples.

Getting Access to the Structure

For example, suppose we have some information of students, including students’ names and ages:

[2]:
students = [("alice", 18), ("bob", 19), ("carol", 19)]
students
[2]:
[('alice', 18), ('bob', 19), ('carol', 19)]

For analyzing them, we might want to create a pandas DataFrame as follows:

[3]:
pd_df = pd.DataFrame.from_records(students, columns=["name", "age"])
pd_df
[3]:
name age
0 alice 18
1 bob 19
2 carol 19

We can also create a DataFrame in GreenplumPython from the same data in a very similar way:

[4]:
db = gp.database("postgresql://localhost/gpadmin")
gp_df = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
gp_df
[4]:
name age
alice 18
bob 19
carol 19

But here is an important difference:

  • a DataFrame in GreenplumPython must be created in a database, while

  • pandas does not have the concept of “database”.

Database in GreenplumPython is like “directory” in file systems, which helps to avoid name conflict on persistence.

A GreenplumPython DataFrame can be saved persistently in the database system as a table with

[5]:
gp_df.save_as("student", column_names=["name", "age"], temp=True)
[5]:
name age
alice 18
bob 19
carol 19

This is similar to how a pandas DataFrame is persisted as file:

[6]:
pd_df.to_csv("/tmp/student.csv")

Saving a DataFrame with temp=True in GreenplumPython is similar to saving a DataFrame into the /tmp directory in pandas:

  • GreenplumPython’s DataFrames saved with temp=True will be dropped automatically by the database system when the database session is terminated, while

  • pandas’ DataFrames saved in /tmp will be clean automatically by the operating system.

In order to access the data of a table in database, we can create a DataFrame from the table:

[7]:
student = db.create_dataframe(table_name="student")
student
[7]:
name age
alice 18
bob 19
carol 19

This is similar to loading a DataFrame from file in pandas, that is,

[8]:
pd.read_csv("/tmp/student.csv")
[8]:
Unnamed: 0 name age
0 0 alice 18
1 1 bob 19
2 2 carol 19

Accessing Data in Rows and Columns

In both GreenplumPython DataFrames and pandas DataFrames, rows can be accessed by iterating over the dataframe.

For example, in GreenplumPython:

[9]:
for row in gp_df:
    print(row["name"], row["age"])
alice 18
bob 19
carol 19

This is similar to how rows in a pandas DataFrame can be accessed:

[10]:
for row in pd_df.iterrows():
    print(row[1]["name"], row[1]["age"])
alice 18
bob 19
carol 19

Similar to pandas, Row in GreenplumPython is dict-like. The value of each column can be accessed by name.

In conclusion, from the user’s perspective, DataFrame in GreenplumPython is very similar to DataFrame in pandas. We expect this would make it easier for whoever interested to get started using GreenplumPython.

Data Selection

Data selection is probably the most fundamental set of operations on data.

In both pandas and GreenplumPython, data selection is done primarily with the [] operator.

Selecting Columns

In both pandas and GreenplumPython, columns are accessed by name.

For example, to select a subset of columns, such as name and age, from the dataframe containing student info, in pandas we can do:

[11]:
pd_df[["name", "age"]]
[11]:
name age
0 alice 18
1 bob 19
2 carol 19

The result of the [] operator is a new pandas DataFrame. In GreenplumPython, this is exactly the same:

[12]:
student[["name", "age"]]
[12]:
name age
alice 18
bob 19
carol 19

The result is a new GreenplumPython DataFrame containing the selected columns.

Accessing a Single Column

To refer to a single column, we can use the [] operator with the column name in both pandas and GreenplumPython.

For example, to access the names of the students in pandas, we can do:

[13]:
pd_df["name"]
[13]:
0    alice
1      bob
2    carol
Name: name, dtype: object

While in GreenplumPython, the column can be refered to in the same way:

[14]:
gp_df["name"]
[14]:
<greenplumpython.col.Column at 0x7fa4e536f880>

But you might notice the difference here:

  • for pandas, using the [] operator gives us the data immediately if we refer to a column, while

  • for GreenplumPython, it only gives a symbolic Column object. Column is supposed to be used for computation rather than for observing data.

The reasons behind this difference are:

  • Database systems behind GreenplumPython does not provide native one-dimensional data structure like Series in pandas.

  • It is much more efficient to retrieve all columns needed in a GreenplumPython DataFrame at once than one at a time.

We will see later how to add new columns to a GreenplumPython DataFrame so that they can be retrieved all at once.

Selecting Rows by Predicates

The [] operator can also be used to select a subset of rows, a.k.a filtering.

Say we want the information of student named “alice”, with pandas we can do:

[15]:
pd_df[lambda df: df["name"] == "alice"]
[15]:
name age
0 alice 18

With GreenplumPython, we can do it in exactly the same way:

[16]:
student[lambda t: t["name"] == "alice"]
[16]:
name age
alice 18

Here we see how a column in GreenplumPython, t["name"] in this case, is used for computation to form a more complex expression.

In this example, When the expression t["name"] == "alice" is evaluated, t will be bound to the current dataframe, i.e. student.

GreenplumPython provides such a functional interface so that the user does not have to refer to the possibly long intermediate variable name like student again and again when the expression becomes complicated.

Selecting Rows by Slices

We can get a quick glance of the data by selecting the first several rows. This can be achieved with slice in Python.

Like many built-in data structures in Python, such as list and tuple, DataFrame in GreenplumPython supports slicing.

For example, if we want only the first two rows of the DataFrame of students in GreenplumPython, we can do:

[17]:
student[:2]
[17]:
name age
alice 18
bob 19

In pandas, we can do exactly the same thing on a DataFrame:

[18]:
pd_df[:2]
[18]:
name age
0 alice 18
1 bob 19

But you might notice the difference: When selecting rows,

  • for pandas, rows in the output DataFrame preserves the same order as the input, while

  • for GreenplumPython, the order of rows in DataFrame might not be preserved.

The difference is due to the fact that database systems behind will not guarantee the order of rows unless otherwise specified.

Data Transformation

Data transformation is about changing the data to a desired form.

Like pandas, GreenplumPython provides powerful building blocks to make transformation easier.

Data Ordering

Having the data sorted in a desired order makes it convenient for many analytical tasks, such as statistics.

pandas supports sorting the data (a.k.a values) of a DataFrame by columns.

For example, we can sort in pandas the dataframe of student info by “age” and then “name”, both in descending order with:

[19]:
pd_df.sort_values(["age", "name"], ascending=[False, False])
[19]:
name age
2 carol 19
1 bob 19
0 alice 18

In GreenplumPython, order of data can be defined with the order_by() method:

[20]:
student.order_by("age", ascending=False).order_by("name", ascending=False)[:]
[20]:
name age
carol 19
bob 19
alice 18

There are some important difference compared with pandas:

  • GreenplumPython does not provide something like DataFrame.sort_index() in pandas, because DataFrame in GreenplumPython does not have an “index column”.

  • In GreenplumPython, slicing is requied after order_by() to get an ordered DataFrame due to the limitations of relational database systems.

Column Transformation

Column transformation is to transform one or more existing columns into a new one of the same length.

A new column may contains data resulting from whatever computation we want.

Both GreeplumPython and pandas support transforming columns by adding a new column. Specifically, we need to:

  • define the transfomation as an expression, and then

  • bind the expression to a new column of the source DataFrame or DataFrame to form a new one.

We can use assign() method to add new columns in both packages. For example, suppose we would like to know the year of birth for each student in the previous example.

In pandas, we can add a new column named year_of_birth like this:

[21]:
import datetime

this_year = datetime.date.today().year
pd_df.assign(year_of_birth=lambda df: -df["age"] + this_year)
[21]:
name age year_of_birth
0 alice 18 2005
1 bob 19 2004
2 carol 19 2004

In GreenplymPython, we can do exactly the same:

[22]:
student.assign(year_of_birth=lambda t: -t["age"] + this_year)
[22]:
name age year_of_birth
bob 19 2004
carol 19 2004
alice 18 2005

The column data can result from any expression, which can contain complex computations.

For example, in order to hide the names of students to protect privacy, we can write a function transforming names to something not human-readable:

[23]:
from hashlib import sha256


@gp.create_function
def hash_name(name: str) -> str:
    return sha256(name.encode("utf-8")).hexdigest()

The gp.create_function decorator converts a Python function into a User-Defined Function (UDF) in database so that it can be applied to Columns.

With the function defined, we can then apply it to generate a new Column:

[24]:
student.assign(name_=lambda t: hash_name(t["name"]))
[24]:
name age name_
bob 19 81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9
carol 19 4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5
alice 18 2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90

After adding the new column, we can select the columns we care about into a new DataFrame with the [] operator.

To be more concise, GreenplumPython and pandas support transforming columns directly into a new DataFrame or DataFrame by apply()-ing the function.

In the previous example, using apply(), we can obtain the GreenplumPython DataFrame with the original names hidden:

[25]:
from dataclasses import dataclass, asdict


@dataclass
class Student:
    name: str
    age: int


def hide_name(name: str, age: int) -> Student:
    return Student(name=sha256(name.encode("utf-8")).hexdigest(), age=age)


student.apply(lambda t: gp.create_function(hide_name)(t["name"], t["age"]), expand=True)
[25]:
name age
2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90 18
81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9 19
4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5 19

We can directly apply the same Python function without any modification to the DataFrame in pandas:

[26]:
pd_df.apply(lambda df: asdict(hide_name(df["name"], df["age"])), axis=1, result_type="expand")
[26]:
name age
0 2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac... 18
1 81b637d8fcd2c6da6359e6963113a1170de795e4b725b8... 19
2 4c26d9074c27d89ede59270c0ac14b71e071b15239519f... 19

But there are still some important differences between the two cases:

  • In pandas, what we apply to a DataFrame is a Python function, while in GreenplumPython, a Python function must be converted to a database function before being applied to a DataFrame.

  • pandas supports applying a function along different axes, while GreenplumPython only supports applying a function to each row due to the limitation of the database system behind.

Data Grouping

Like ordering, Grouping data based on distinct set of values of columns can also facilitate analytical tasks.

Data grouping is often associated with aggregate functions to obtain data summaries.

For example, suppose we want to the number of students of different age. In pandas, we can do:

[27]:
import numpy as np

pd_df.groupby("age").apply(lambda df: np.count_nonzero(df["name"]))
[27]:
age
18    1
19    2
dtype: int64

In GreenplumPython, what we need to do is:

[28]:
count = gp.aggregate_function("count")

student.group_by("age").apply(lambda t: count(t["name"]))
[28]:
count age
2 19
1 18

Data Deduplication

Data deduplication is to return a new data structure containing only the distinct set of values in the selected columns.

This operation is well supported in GreenplumPython, also in a very similar way to pandas.

For example, suppose we want to draw a representative sample containing students for each distinct age,

With pandas, we can do:

[29]:
pd_df.drop_duplicates("age")
[29]:
name age
0 alice 18
1 bob 19

With GreenplumPython, what we need to do is:

[30]:
student.distinct_on("age")
[30]:
name age
alice 18
carol 19

Moreover, GreenplumPython also supports aggregation on only the distinct values.

Suppose we want to know the number of different ages of the students, we can do:

[31]:
student.apply(lambda t: count.distinct(t["age"]))
[31]:
count
2

Joins

Joins are operations combining two data structures horizontally in a sensible way.

This makes it easier and more efficient to query one data structure based on the other.

For example, suppose we want to retrieve all pairs of sturents of the same age.

In pandas, we can join the DataFrame with itself on the “age” column:

[32]:
pd_df.merge(pd_df, on="age", suffixes=("", "_2"))
[32]:
name age name_2
0 alice 18 alice
1 bob 19 bob
2 bob 19 carol
3 carol 19 bob
4 carol 19 carol

Similarily, in GreenplumPython, we can do:

[33]:
student.join(student, on="age", other_columns={"name": "name_2"})
[33]:
name age name_2
bob 19 carol
bob 19 bob
carol 19 carol
carol 19 bob
alice 18 alice

In terms of associative query, there is an important difference between GreenplumPython and pandas.

pandas allows querying one DataFrame based on another without previously joining them.

For example, suppose we have two pandas DataFrame of numbers, the operation below is legal even though not sensible:

[34]:
num_1 = pd.DataFrame({"val": [1, 3, 5, 7, 9]})
num_2 = pd.DataFrame({"val": [2, 4, 6, 8, 10]})

num_1[num_2["val"] % 2 == 0]  # Even numbers?
[34]:
val
0 1
1 3
2 5
3 7
4 9

To avoid such kind of misuse, in GreenplumPython, it is impossible to refer to other DataFrames except for the “current” one in an expression except when using the in_() expression.

This is because GreenplumPython only accepts a Callable as argument for expression and will bind it to the current DataFrame automatically.