Comparison with pandas

pandas is a popular Python library for data analysis and manipulation.

GreenplumPython strives to provide a pandas-like interface so that people can get started with it quickly.

[1]:

import pandas as pd
import greenplumpython as gp

gp

[1]:

<module 'greenplumpython' from '/home/gpadmin/.local/lib/python3.9/site-packages/greenplumpython/__init__.py'>

However, the two packages serve very different purposes, i.e.,

GreenplumPython is an interface to a remote database system, while
pandas is a library for manipulating local in-memory data.

There are still some important differences in their interface.

This document covers the similarities and the differences between GreenplumPython and pandas, as well as the rationales behind.

Data Structure

GreenplumPython’s core data structure DataFrame is fundamentally similar to Dataframe in pandas in that

Data are organized into rows and columns;
Columns can be of different types, and can be accessed by name;
Rows are of the same type, and are iterable.

Next, we will see similarities and differences between them in detail with examples.

Getting Access to the Structure

For example, suppose we have some information of students, including students’ names and ages:

[2]:

students = [("alice", 18), ("bob", 19), ("carol", 19)]
students

[2]:

[('alice', 18), ('bob', 19), ('carol', 19)]

For analyzing them, we might want to create a pandas DataFrame as follows:

[3]:

pd_df = pd.DataFrame.from_records(students, columns=["name", "age"])
pd_df

[3]:

	name	age
0	alice	18
1	bob	19
2	carol	19

We can also create a DataFrame in GreenplumPython from the same data in a very similar way:

[4]:

db = gp.database("postgresql://localhost/gpadmin")
gp_df = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
gp_df

[4]:

name	age
alice	18
bob	19
carol	19

But here is an important difference:

a DataFrame in GreenplumPython must be created in a database, while
pandas does not have the concept of “database”.

Database in GreenplumPython is like “directory” in file systems, which helps to avoid name conflict on persistence.

A GreenplumPython DataFrame can be saved persistently in the database system as a table with

[5]:

gp_df.save_as("student", column_names=["name", "age"], temp=True)

[5]:

name	age
alice	18
bob	19
carol	19

This is similar to how a pandas DataFrame is persisted as file:

[6]:

pd_df.to_csv("/tmp/student.csv")

Saving a DataFrame with temp=True in GreenplumPython is similar to saving a DataFrame into the /tmp directory in pandas:

GreenplumPython’s DataFrames saved with temp=True will be dropped automatically by the database system when the database session is terminated, while
pandas’ DataFrames saved in /tmp will be clean automatically by the operating system.

In order to access the data of a table in database, we can create a DataFrame from the table:

[7]:

student = db.create_dataframe(table_name="student")
student

[7]:

name	age
alice	18
bob	19
carol	19

This is similar to loading a DataFrame from file in pandas, that is,

[8]:

pd.read_csv("/tmp/student.csv")

[8]:

	Unnamed: 0	name	age
0	0	alice	18
1	1	bob	19
2	2	carol	19

Accessing Data in Rows and Columns

In both GreenplumPython DataFrames and pandas DataFrames, rows can be accessed by iterating over the dataframe.

For example, in GreenplumPython:

[9]:

for row in gp_df:
    print(row["name"], row["age"])

alice 18
bob 19
carol 19

This is similar to how rows in a pandas DataFrame can be accessed:

[10]:

for row in pd_df.iterrows():
    print(row[1]["name"], row[1]["age"])

alice 18
bob 19
carol 19

Similar to pandas, Row in GreenplumPython is dict-like. The value of each column can be accessed by name.

In conclusion, from the user’s perspective, DataFrame in GreenplumPython is very similar to DataFrame in pandas. We expect this would make it easier for whoever interested to get started using GreenplumPython.

Data Selection

Data selection is probably the most fundamental set of operations on data.

In both pandas and GreenplumPython, data selection is done primarily with the [] operator.

Selecting Columns

In both pandas and GreenplumPython, columns are accessed by name.

For example, to select a subset of columns, such as name and age, from the dataframe containing student info, in pandas we can do:

[11]:

pd_df[["name", "age"]]

[11]:

	name	age
0	alice	18
1	bob	19
2	carol	19

The result of the [] operator is a new pandas DataFrame. In GreenplumPython, this is exactly the same:

[12]:

student[["name", "age"]]

[12]:

name	age
alice	18
bob	19
carol	19

The result is a new GreenplumPython DataFrame containing the selected columns.

Accessing a Single Column

To refer to a single column, we can use the [] operator with the column name in both pandas and GreenplumPython.

For example, to access the names of the students in pandas, we can do:

[13]:

pd_df["name"]

[13]:

0    alice
1      bob
2    carol
Name: name, dtype: object

While in GreenplumPython, the column can be refered to in the same way:

[14]:

gp_df["name"]

[14]:

<greenplumpython.col.Column at 0x7fa4e536f880>

But you might notice the difference here:

for pandas, using the [] operator gives us the data immediately if we refer to a column, while
for GreenplumPython, it only gives a symbolic Column object. Column is supposed to be used for computation rather than for observing data.

The reasons behind this difference are:

Database systems behind GreenplumPython does not provide native one-dimensional data structure like Series in pandas.
It is much more efficient to retrieve all columns needed in a GreenplumPython DataFrame at once than one at a time.

We will see later how to add new columns to a GreenplumPython DataFrame so that they can be retrieved all at once.

Selecting Rows by Predicates

The [] operator can also be used to select a subset of rows, a.k.a filtering.

Say we want the information of student named “alice”, with pandas we can do:

[15]:

pd_df[lambda df: df["name"] == "alice"]

[15]:

	name	age
0	alice	18

With GreenplumPython, we can do it in exactly the same way:

[16]:

student[lambda t: t["name"] == "alice"]

[16]:

name	age
alice	18

Here we see how a column in GreenplumPython, t["name"] in this case, is used for computation to form a more complex expression.

In this example, When the expression t["name"] == "alice" is evaluated, t will be bound to the current dataframe, i.e. student.

GreenplumPython provides such a functional interface so that the user does not have to refer to the possibly long intermediate variable name like student again and again when the expression becomes complicated.

Selecting Rows by Slices

We can get a quick glance of the data by selecting the first several rows. This can be achieved with slice in Python.

Like many built-in data structures in Python, such as list and tuple, DataFrame in GreenplumPython supports slicing.

For example, if we want only the first two rows of the DataFrame of students in GreenplumPython, we can do:

[17]:

student[:2]

[17]:

name	age
alice	18
bob	19

In pandas, we can do exactly the same thing on a DataFrame:

[18]:

pd_df[:2]

[18]:

	name	age
0	alice	18
1	bob	19

But you might notice the difference: When selecting rows,

for pandas, rows in the output DataFrame preserves the same order as the input, while
for GreenplumPython, the order of rows in DataFrame might not be preserved.

The difference is due to the fact that database systems behind will not guarantee the order of rows unless otherwise specified.

Data Transformation

Data transformation is about changing the data to a desired form.

Like pandas, GreenplumPython provides powerful building blocks to make transformation easier.

Data Ordering

Having the data sorted in a desired order makes it convenient for many analytical tasks, such as statistics.

pandas supports sorting the data (a.k.a values) of a DataFrame by columns.

For example, we can sort in pandas the dataframe of student info by “age” and then “name”, both in descending order with:

[19]:

pd_df.sort_values(["age", "name"], ascending=[False, False])

[19]:

	name	age
2	carol	19
1	bob	19
0	alice	18

In GreenplumPython, order of data can be defined with the order_by() method:

[20]:

student.order_by("age", ascending=False).order_by("name", ascending=False)[:]

[20]:

name	age
carol	19
bob	19
alice	18

There are some important difference compared with pandas:

GreenplumPython does not provide something like DataFrame.sort_index() in pandas, because DataFrame in GreenplumPython does not have an “index column”.
In GreenplumPython, slicing is requied after order_by() to get an ordered DataFrame due to the limitations of relational database systems.

Column Transformation

Column transformation is to transform one or more existing columns into a new one of the same length.

A new column may contains data resulting from whatever computation we want.

Both GreeplumPython and pandas support transforming columns by adding a new column. Specifically, we need to:

define the transfomation as an expression, and then
bind the expression to a new column of the source DataFrame or DataFrame to form a new one.

We can use assign() method to add new columns in both packages. For example, suppose we would like to know the year of birth for each student in the previous example.

In pandas, we can add a new column named year_of_birth like this:

[21]:

import datetime

this_year = datetime.date.today().year
pd_df.assign(year_of_birth=lambda df: -df["age"] + this_year)

[21]:

	name	age	year_of_birth
0	alice	18	2005
1	bob	19	2004
2	carol	19	2004

In GreenplymPython, we can do exactly the same:

[22]:

student.assign(year_of_birth=lambda t: -t["age"] + this_year)

[22]:

name	age	year_of_birth
bob	19	2004
carol	19	2004
alice	18	2005

The column data can result from any expression, which can contain complex computations.

For example, in order to hide the names of students to protect privacy, we can write a function transforming names to something not human-readable:

[23]:

from hashlib import sha256


@gp.create_function
def hash_name(name: str) -> str:
    return sha256(name.encode("utf-8")).hexdigest()

The gp.create_function decorator converts a Python function into a User-Defined Function (UDF) in database so that it can be applied to Columns.

With the function defined, we can then apply it to generate a new Column:

[24]:

student.assign(name_=lambda t: hash_name(t["name"]))

[24]:

name	age	name_
bob	19	81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9
carol	19	4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5
alice	18	2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90

After adding the new column, we can select the columns we care about into a new DataFrame with the [] operator.

To be more concise, GreenplumPython and pandas support transforming columns directly into a new DataFrame or DataFrame by apply()-ing the function.

In the previous example, using apply(), we can obtain the GreenplumPython DataFrame with the original names hidden:

[25]:

from dataclasses import dataclass, asdict

@dataclass
class Student:
    name: str
    age: int

def hide_name(name: str, age: int) -> Student:
    return Student(name=sha256(name.encode("utf-8")).hexdigest(), age=age)

student.apply(lambda t: gp.create_function(hide_name)(t["name"], t["age"]), expand=True)

[25]:

name	age
2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90	18
81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9	19
4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5	19

We can directly apply the same Python function without any modification to the DataFrame in pandas:

[26]:

pd_df.apply(lambda df: asdict(hide_name(df["name"], df["age"])), axis=1, result_type="expand")

[26]:

	name	age
0	2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac...	18
1	81b637d8fcd2c6da6359e6963113a1170de795e4b725b8...	19
2	4c26d9074c27d89ede59270c0ac14b71e071b15239519f...	19

But there are still some important differences between the two cases:

In pandas, what we apply to a DataFrame is a Python function, while in GreenplumPython, a Python function must be converted to a database function before being applied to a DataFrame.
pandas supports applying a function along different axes, while GreenplumPython only supports applying a function to each row due to the limitation of the database system behind.

Data Grouping

Like ordering, Grouping data based on distinct set of values of columns can also facilitate analytical tasks.

Data grouping is often associated with aggregate functions to obtain data summaries.

For example, suppose we want to the number of students of different age. In pandas, we can do:

[27]:

import numpy as np

pd_df.groupby("age").apply(lambda df: np.count_nonzero(df["name"]))

[27]:

age
18    1
19    2
dtype: int64

In GreenplumPython, what we need to do is:

[28]:

count = gp.aggregate_function("count")

student.group_by("age").apply(lambda t: count(t["name"]))

[28]:

count	age
2	19
1	18

Data Deduplication

Data deduplication is to return a new data structure containing only the distinct set of values in the selected columns.

This operation is well supported in GreenplumPython, also in a very similar way to pandas.

For example, suppose we want to draw a representative sample containing students for each distinct age,

With pandas, we can do:

[29]:

pd_df.drop_duplicates("age")

[29]:

	name	age
0	alice	18
1	bob	19

With GreenplumPython, what we need to do is:

[30]:

student.distinct_on("age")

[30]:

name	age
alice	18
carol	19

Moreover, GreenplumPython also supports aggregation on only the distinct values.

Suppose we want to know the number of different ages of the students, we can do:

[31]:

student.apply(lambda t: count.distinct(t["age"]))

[31]:

count
2

Joins

Joins are operations combining two data structures horizontally in a sensible way.

This makes it easier and more efficient to query one data structure based on the other.

For example, suppose we want to retrieve all pairs of sturents of the same age.

In pandas, we can join the DataFrame with itself on the “age” column:

[32]:

pd_df.merge(pd_df, on="age", suffixes=("", "_2"))

[32]:

	name	age	name_2
0	alice	18	alice
1	bob	19	bob
2	bob	19	carol
3	carol	19	bob
4	carol	19	carol

Similarily, in GreenplumPython, we can do:

[33]:

student.join(student, on="age", other_columns={"name": "name_2"})

[33]:

name	age	name_2
bob	19	carol
bob	19	bob
carol	19	carol
carol	19	bob
alice	18	alice

In terms of associative query, there is an important difference between GreenplumPython and pandas.

pandas allows querying one DataFrame based on another without previously joining them.

For example, suppose we have two pandas DataFrame of numbers, the operation below is legal even though not sensible:

[34]:

num_1 = pd.DataFrame({"val": [1, 3, 5, 7, 9]})
num_2 = pd.DataFrame({"val": [2, 4, 6, 8, 10]})

num_1[num_2["val"] % 2 == 0]  # Even numbers?

[34]:

	val
0	1
1	3
2	5
3	7
4	9

To avoid such kind of misuse, in GreenplumPython, it is impossible to refer to other DataFrames except for the “current” one in an expression except when using the in_() expression.

This is because GreenplumPython only accepts a Callable as argument for expression and will bind it to the current DataFrame automatically.