Comparison with SQL

Prerequisites

To run this tutorial, we need

  • A database that is authorized to access, and

  • A table named demo as created with the following SQL command:

    CREATE TABLE demo AS
    SELECT n AS i, n AS j, n AS k
    FROM generate_series(0,9) AS n;
    

To create this table, if in a shell environment, psql can be used.

Or, inside a Jupyter Notebook, the SQL command can be executed directly in cells with ipython-sql, as shown below.

First we need to connect to the database (gpadmin in our example) specified with the URI using the %sql magic:

[1]:
%load_ext sql
%sql postgresql://localhost/gpadmin

Authentication methods and credentials can be specified in the URI. Please refer to the libpq document for detailed usage.

[2]:
%%sql

DROP TABLE IF EXISTS demo;

CREATE TABLE demo AS
SELECT n AS i, n AS j, n AS k
FROM generate_series(0,9) AS n;
 * postgresql://localhost/gpadmin
Done.
10 rows affected.
[2]:
[]

With the table created successfully, we are now good to go!

Getting Access to Database

To get access to the database we want:

[3]:
import greenplumpython as gp

db = gp.database(uri="postgresql://localhost/gpadmin")

Here, the uri follows the same specification in the libpq document as above.

As another example, if password is required, the uri might look like postgresql://user:password@hostname/dbname.

Accessing a DataFrame in the Database

After selecting the database, we can access a dataframe in the database by specifying its name:

[4]:
t = db.create_dataframe(table_name="demo")
t
[4]:
i j k
2 2 2
3 3 3
5 5 5
9 9 9
0 0 0
4 4 4
6 6 6
7 7 7
8 8 8
1 1 1

And of course, we can SELECT the first ordered N rows of a dataframe, like this:

[5]:
t.order_by("i")[:10]
[5]:
i j k
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9

Basic Data Manipulation

Now we have a dataframe. We can do basic data manipulation on it, just like in SQL.

For example, we can SELECT a subset of its columns:

[6]:
t_ij = t[["i", "j"]]
t_ij
[6]:
i j
2 2
3 3
5 5
9 9
0 0
4 4
6 6
7 7
8 8
1 1

And we can also SELECT a subset of its rows. Say we want all the even numbers:

[7]:
t_even = t_ij[lambda t: t["i"] % 2 == 0]
t_even
[7]:
i j
2 2
0 0
4 4
6 6
8 8

For a quick glance, we can SELECT the first unordered N rows of a dataframe, like this:

[8]:
t_n = t_even[:3]
t_n
[8]:
i j
2 2
0 0
4 4

Finally when we are done, we can save the resulting dataframe to the database, either temporarily or persistently:

[9]:
t_n.save_as(table_name="t_n", column_names=["i", "j"], temp=True)
[9]:
i j
0 0
2 2
4 4

Joining Two DataFrames

We can also JOIN two dataframes with GreenplumPython. For example, suppose we have two dataframes like this:

[10]:
rows = [
    (1, "'a'"),
    (2, "'b'"),
    (3, "'c'"),
    (4, "'d'"),
]
t1 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t1
[10]:
id val
1 'a'
2 'b'
3 'c'
4 'd'
[11]:
rows = [
    (1, "'a'"),
    (2, "'b'"),
    (3, "'a'"),
    (4, "'b'"),
]
t2 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t2
[11]:
id val
1 'a'
2 'b'
3 'a'
4 'b'

We can JOIN the two dataframe like this:

[12]:
t_join = t1.join(
    t2,
    on="val",
    self_columns={"id": "t1_id", "val": "t1_val"},
    other_columns={"id": "t2_id", "val": "t2_val"},
)
t_join
[12]:
t1_id t1_val t2_id t2_val
1 'a' 3 'a'
1 'a' 1 'a'
2 'b' 4 'b'
2 'b' 2 'b'

Creating and Calling Functions

Calling functions is essential for data analytics. GreenplumPython supports creating Greenplum UDFs and UDAs from Python functions and calling them in Python.

Suppose we have a dataframe of numbers:

[13]:
rows = [(i,) for i in range(10)]
numbers = db.create_dataframe(rows=rows, column_names=["val"])
numbers
[13]:
val
0
1
2
3
4
5
6
7
8
9

If we want to get the square of each number, we can write a function to do that:

[14]:
@gp.create_function
def square(a: int) -> int:
    return a**2


numbers.apply(lambda t: square(t["val"]), column_name="square")
[14]:
square
0
1
4
9
16
25
36
49
64
81

Note that this function is called in exactly the same way as ordinary Python functions.

If we also want to get the sum of these numbers, what we need is to write an aggregate function like this:

[15]:
@gp.create_aggregate
def my_sum(result: int, val: int) -> int:
    if result is None:
        return val
    return result + val


numbers.apply(lambda t: my_sum(t["val"]), column_name="sum")
[15]:
sum
45

Data Grouping

We can use DataFrame.group_by() to divide a dataframe into groups by distinct values of columns.

Afterwards, we can apply aggregate functions to individual groups to obtain group-wise summary.

For example, we can divide numbers into two group based on whether each number is even, and sum each of the groups:

[16]:
import greenplumpython.builtins.functions as F

(
    numbers.assign(is_even=lambda t: t["val"] % 2 == 0)
    .group_by("is_even")
    .apply(lambda t: F.sum(t["val"]))
)
[16]:
sum is_even
20 True
25 False