Comparison with pandas
pandas is a popular Python library for data analysis and manipulation.
GreenplumPython strives to provide a pandas-like interface so that people can get started with it quickly.
[1]:
import pandas as pd
import greenplumpython as gp
gp
[1]:
<module 'greenplumpython' from '/home/gpadmin/.local/lib/python3.9/site-packages/greenplumpython/__init__.py'>
However, the two packages serve very different purposes, i.e.,
GreenplumPython is an interface to a remote database system, while
pandas is a library for manipulating local in-memory data.
There are still some important differences in their interface.
This document covers the similarities and the differences between GreenplumPython and pandas, as well as the rationales behind.
Data Structure
GreenplumPython’s core data structure DataFrame
is fundamentally similar to Dataframe
in pandas in that
Data are organized into rows and columns;
Columns can be of different types, and can be accessed by name;
Rows are of the same type, and are iterable.
Next, we will see similarities and differences between them in detail with examples.
Getting Access to the Structure
For example, suppose we have some information of students, including students’ names and ages:
[2]:
students = [("alice", 18), ("bob", 19), ("carol", 19)]
students
[2]:
[('alice', 18), ('bob', 19), ('carol', 19)]
For analyzing them, we might want to create a pandas DataFrame
as follows:
[3]:
pd_df = pd.DataFrame.from_records(students, columns=["name", "age"])
pd_df
[3]:
name | age | |
---|---|---|
0 | alice | 18 |
1 | bob | 19 |
2 | carol | 19 |
We can also create a DataFrame
in GreenplumPython from the same data in a very similar way:
[4]:
db = gp.database("postgresql://localhost/gpadmin")
gp_df = gp.DataFrame.from_rows(students, column_names=["name", "age"], db=db)
gp_df
[4]:
name | age |
---|---|
alice | 18 |
bob | 19 |
carol | 19 |
But here is an important difference:
a
DataFrame
in GreenplumPython must be created in a database, whilepandas does not have the concept of “database”.
Database
in GreenplumPython is like “directory” in file systems, which helps to avoid name conflict on persistence.
A GreenplumPython DataFrame
can be saved persistently in the database system as a table with
[5]:
gp_df.save_as("student", column_names=["name", "age"], temp=True)
[5]:
name | age |
---|---|
alice | 18 |
bob | 19 |
carol | 19 |
This is similar to how a pandas DataFrame
is persisted as file:
[6]:
pd_df.to_csv("/tmp/student.csv")
Saving a DataFrame
with temp=True
in GreenplumPython is similar to saving a DataFrame
into the /tmp
directory in pandas:
GreenplumPython’s DataFrames saved with
temp=True
will be dropped automatically by the database system when the database session is terminated, whilepandas’ DataFrames saved in
/tmp
will be clean automatically by the operating system.
In order to access the data of a table in database, we can create a DataFrame
from the table:
[7]:
student = db.create_dataframe(table_name="student")
student
[7]:
name | age |
---|---|
alice | 18 |
bob | 19 |
carol | 19 |
This is similar to loading a DataFrame
from file in pandas, that is,
[8]:
pd.read_csv("/tmp/student.csv")
[8]:
Unnamed: 0 | name | age | |
---|---|---|---|
0 | 0 | alice | 18 |
1 | 1 | bob | 19 |
2 | 2 | carol | 19 |
Accessing Data in Rows and Columns
In both GreenplumPython DataFrame
s and pandas DataFrame
s, rows can be accessed by iterating over the dataframe.
For example, in GreenplumPython:
[9]:
for row in gp_df:
print(row["name"], row["age"])
alice 18
bob 19
carol 19
This is similar to how rows in a pandas DataFrame
can be accessed:
[10]:
for row in pd_df.iterrows():
print(row[1]["name"], row[1]["age"])
alice 18
bob 19
carol 19
Similar to pandas, Row
in GreenplumPython is dict
-like. The value of each column can be accessed by name.
In conclusion, from the user’s perspective, DataFrame
in GreenplumPython is very similar to DataFrame
in pandas. We expect this would make it easier for whoever interested to get started using GreenplumPython.
Data Selection
Data selection is probably the most fundamental set of operations on data.
In both pandas and GreenplumPython, data selection is done primarily with the []
operator.
Selecting Columns
In both pandas and GreenplumPython, columns are accessed by name.
For example, to select a subset of columns, such as name
and age
, from the dataframe containing student info, in pandas we can do:
[11]:
pd_df[["name", "age"]]
[11]:
name | age | |
---|---|---|
0 | alice | 18 |
1 | bob | 19 |
2 | carol | 19 |
The result of the []
operator is a new pandas DataFrame
. In GreenplumPython, this is exactly the same:
[12]:
student[["name", "age"]]
[12]:
name | age |
---|---|
alice | 18 |
bob | 19 |
carol | 19 |
The result is a new GreenplumPython DataFrame
containing the selected columns.
Accessing a Single Column
To refer to a single column, we can use the []
operator with the column name in both pandas and GreenplumPython.
For example, to access the names of the students in pandas, we can do:
[13]:
pd_df["name"]
[13]:
0 alice
1 bob
2 carol
Name: name, dtype: object
While in GreenplumPython, the column can be refered to in the same way:
[14]:
gp_df["name"]
[14]:
<greenplumpython.col.Column at 0x7fa4e536f880>
But you might notice the difference here:
for pandas, using the
[]
operator gives us the data immediately if we refer to a column, whilefor GreenplumPython, it only gives a symbolic
Column
object.Column
is supposed to be used for computation rather than for observing data.
The reasons behind this difference are:
Database systems behind GreenplumPython does not provide native one-dimensional data structure like
Series
in pandas.It is much more efficient to retrieve all columns needed in a GreenplumPython
DataFrame
at once than one at a time.
We will see later how to add new columns to a GreenplumPython DataFrame
so that they can be retrieved all at once.
Selecting Rows by Predicates
The []
operator can also be used to select a subset of rows, a.k.a filtering.
Say we want the information of student named “alice”, with pandas we can do:
[15]:
pd_df[lambda df: df["name"] == "alice"]
[15]:
name | age | |
---|---|---|
0 | alice | 18 |
With GreenplumPython, we can do it in exactly the same way:
[16]:
student[lambda t: t["name"] == "alice"]
[16]:
name | age |
---|---|
alice | 18 |
Here we see how a column in GreenplumPython, t["name"]
in this case, is used for computation to form a more complex expression.
In this example, When the expression t["name"] == "alice"
is evaluated, t
will be bound to the current dataframe, i.e. student
.
GreenplumPython provides such a functional interface so that the user does not have to refer to the possibly long intermediate variable name like student
again and again when the expression becomes complicated.
Selecting Rows by Slices
We can get a quick glance of the data by selecting the first several rows. This can be achieved with slice
in Python.
Like many built-in data structures in Python, such as list
and tuple
, DataFrame
in GreenplumPython supports slicing.
For example, if we want only the first two rows of the DataFrame
of students in GreenplumPython, we can do:
[17]:
student[:2]
[17]:
name | age |
---|---|
alice | 18 |
bob | 19 |
In pandas, we can do exactly the same thing on a DataFrame
:
[18]:
pd_df[:2]
[18]:
name | age | |
---|---|---|
0 | alice | 18 |
1 | bob | 19 |
But you might notice the difference: When selecting rows,
for pandas, rows in the output
DataFrame
preserves the same order as the input, whilefor GreenplumPython, the order of rows in
DataFrame
might not be preserved.
The difference is due to the fact that database systems behind will not guarantee the order of rows unless otherwise specified.
Data Transformation
Data transformation is about changing the data to a desired form.
Like pandas, GreenplumPython provides powerful building blocks to make transformation easier.
Data Ordering
Having the data sorted in a desired order makes it convenient for many analytical tasks, such as statistics.
pandas supports sorting the data (a.k.a values) of a DataFrame
by columns.
For example, we can sort in pandas the dataframe of student info by “age” and then “name”, both in descending order with:
[19]:
pd_df.sort_values(["age", "name"], ascending=[False, False])
[19]:
name | age | |
---|---|---|
2 | carol | 19 |
1 | bob | 19 |
0 | alice | 18 |
In GreenplumPython, order of data can be defined with the order_by()
method:
[20]:
student.order_by("age", ascending=False).order_by("name", ascending=False)[:]
[20]:
name | age |
---|---|
carol | 19 |
bob | 19 |
alice | 18 |
There are some important difference compared with pandas:
GreenplumPython does not provide something like
DataFrame.sort_index()
in pandas, becauseDataFrame
in GreenplumPython does not have an “index column”.In GreenplumPython, slicing is requied after
order_by()
to get an orderedDataFrame
due to the limitations of relational database systems.
Column Transformation
Column transformation is to transform one or more existing columns into a new one of the same length.
A new column may contains data resulting from whatever computation we want.
Both GreeplumPython and pandas support transforming columns by adding a new column. Specifically, we need to:
define the transfomation as an expression, and then
bind the expression to a new column of the source
DataFrame
orDataFrame
to form a new one.
We can use assign()
method to add new columns in both packages. For example, suppose we would like to know the year of birth for each student in the previous example.
In pandas, we can add a new column named year_of_birth
like this:
[21]:
import datetime
this_year = datetime.date.today().year
pd_df.assign(year_of_birth=lambda df: -df["age"] + this_year)
[21]:
name | age | year_of_birth | |
---|---|---|---|
0 | alice | 18 | 2005 |
1 | bob | 19 | 2004 |
2 | carol | 19 | 2004 |
In GreenplymPython, we can do exactly the same:
[22]:
student.assign(year_of_birth=lambda t: -t["age"] + this_year)
[22]:
name | age | year_of_birth |
---|---|---|
bob | 19 | 2004 |
carol | 19 | 2004 |
alice | 18 | 2005 |
The column data can result from any expression, which can contain complex computations.
For example, in order to hide the names of students to protect privacy, we can write a function transforming names to something not human-readable:
[23]:
from hashlib import sha256
@gp.create_function
def hash_name(name: str) -> str:
return sha256(name.encode("utf-8")).hexdigest()
The gp.create_function
decorator converts a Python function into a User-Defined Function (UDF) in database so that it can be applied to Column
s.
With the function defined, we can then apply it to generate a new Column
:
[24]:
student.assign(name_=lambda t: hash_name(t["name"]))
[24]:
name | age | name_ |
---|---|---|
bob | 19 | 81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9 |
carol | 19 | 4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5 |
alice | 18 | 2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90 |
After adding the new column, we can select the columns we care about into a new DataFrame
with the []
operator.
To be more concise, GreenplumPython and pandas support transforming columns directly into a new DataFrame
or DataFrame
by apply()
-ing the function.
In the previous example, using apply()
, we can obtain the GreenplumPython DataFrame
with the original names hidden:
[25]:
from dataclasses import dataclass, asdict
@dataclass
class Student:
name: str
age: int
def hide_name(name: str, age: int) -> Student:
return Student(name=sha256(name.encode("utf-8")).hexdigest(), age=age)
student.apply(lambda t: gp.create_function(hide_name)(t["name"], t["age"]), expand=True)
[25]:
name | age |
---|---|
2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90 | 18 |
81b637d8fcd2c6da6359e6963113a1170de795e4b725b84d1e0b4cfd9ec58ce9 | 19 |
4c26d9074c27d89ede59270c0ac14b71e071b15239519f75474b2f3ba63481f5 | 19 |
We can directly apply the same Python function without any modification to the DataFrame
in pandas:
[26]:
pd_df.apply(lambda df: asdict(hide_name(df["name"], df["age"])), axis=1, result_type="expand")
[26]:
name | age | |
---|---|---|
0 | 2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac... | 18 |
1 | 81b637d8fcd2c6da6359e6963113a1170de795e4b725b8... | 19 |
2 | 4c26d9074c27d89ede59270c0ac14b71e071b15239519f... | 19 |
But there are still some important differences between the two cases:
In pandas, what we apply to a
DataFrame
is a Python function, while in GreenplumPython, a Python function must be converted to a database function before being applied to aDataFrame
.pandas supports applying a function along different axes, while GreenplumPython only supports applying a function to each row due to the limitation of the database system behind.
Data Grouping
Like ordering, Grouping data based on distinct set of values of columns can also facilitate analytical tasks.
Data grouping is often associated with aggregate functions to obtain data summaries.
For example, suppose we want to the number of students of different age. In pandas, we can do:
[27]:
import numpy as np
pd_df.groupby("age").apply(lambda df: np.count_nonzero(df["name"]))
[27]:
age
18 1
19 2
dtype: int64
In GreenplumPython, what we need to do is:
[28]:
count = gp.aggregate_function("count")
student.group_by("age").apply(lambda t: count(t["name"]))
[28]:
count | age |
---|---|
2 | 19 |
1 | 18 |
Data Deduplication
Data deduplication is to return a new data structure containing only the distinct set of values in the selected columns.
This operation is well supported in GreenplumPython, also in a very similar way to pandas.
For example, suppose we want to draw a representative sample containing students for each distinct age,
With pandas, we can do:
[29]:
pd_df.drop_duplicates("age")
[29]:
name | age | |
---|---|---|
0 | alice | 18 |
1 | bob | 19 |
With GreenplumPython, what we need to do is:
[30]:
student.distinct_on("age")
[30]:
name | age |
---|---|
alice | 18 |
carol | 19 |
Moreover, GreenplumPython also supports aggregation on only the distinct values.
Suppose we want to know the number of different ages of the students, we can do:
[31]:
student.apply(lambda t: count.distinct(t["age"]))
[31]:
count |
---|
2 |
Joins
Joins are operations combining two data structures horizontally in a sensible way.
This makes it easier and more efficient to query one data structure based on the other.
For example, suppose we want to retrieve all pairs of sturents of the same age.
In pandas, we can join the DataFrame
with itself on the “age” column:
[32]:
pd_df.merge(pd_df, on="age", suffixes=("", "_2"))
[32]:
name | age | name_2 | |
---|---|---|---|
0 | alice | 18 | alice |
1 | bob | 19 | bob |
2 | bob | 19 | carol |
3 | carol | 19 | bob |
4 | carol | 19 | carol |
Similarily, in GreenplumPython, we can do:
[33]:
student.join(student, on="age", other_columns={"name": "name_2"})
[33]:
name | age | name_2 |
---|---|---|
bob | 19 | carol |
bob | 19 | bob |
carol | 19 | carol |
carol | 19 | bob |
alice | 18 | alice |
In terms of associative query, there is an important difference between GreenplumPython and pandas.
pandas allows querying one DataFrame
based on another without previously joining them.
For example, suppose we have two pandas DataFrame
of numbers, the operation below is legal even though not sensible:
[34]:
num_1 = pd.DataFrame({"val": [1, 3, 5, 7, 9]})
num_2 = pd.DataFrame({"val": [2, 4, 6, 8, 10]})
num_1[num_2["val"] % 2 == 0] # Even numbers?
[34]:
val | |
---|---|
0 | 1 |
1 | 3 |
2 | 5 |
3 | 7 |
4 | 9 |
To avoid such kind of misuse, in GreenplumPython, it is impossible to refer to other DataFrame
s except for the “current” one in an expression except when using the in_()
expression.
This is because GreenplumPython only accepts a Callable
as argument for expression and will bind it to the current DataFrame
automatically.