Comparison with SQL
Prerequisites
To run this tutorial, we need
A database that is authorized to access, and
A table named
demo
as created with the following SQL command:CREATE TABLE demo AS SELECT n AS i, n AS j, n AS k FROM generate_series(0,9) AS n;
To create this table, if in a shell environment, psql can be used.
Or, inside a Jupyter Notebook, the SQL command can be executed directly in cells with ipython-sql, as shown below.
First we need to connect to the database (gpadmin
in our example) specified with the URI using the %sql
magic:
[1]:
%load_ext sql
%sql postgresql://localhost/gpadmin
Authentication methods and credentials can be specified in the URI. Please refer to the libpq document for detailed usage.
[2]:
%%sql
DROP TABLE IF EXISTS demo;
CREATE TABLE demo AS
SELECT n AS i, n AS j, n AS k
FROM generate_series(0,9) AS n;
* postgresql://localhost/gpadmin
Done.
10 rows affected.
[2]:
[]
With the table created successfully, we are now good to go!
Getting Access to Database
To get access to the database we want:
[3]:
import greenplumpython as gp
db = gp.database(uri="postgresql://localhost/gpadmin")
Here, the uri
follows the same specification in the libpq document as above.
As another example, if password is required, the uri
might look like postgresql://user:password@hostname/dbname
.
Accessing a DataFrame in the Database
After selecting the database, we can access a dataframe in the database by specifying its name:
[4]:
t = db.create_dataframe(table_name="demo")
t
[4]:
i | j | k |
---|---|---|
2 | 2 | 2 |
3 | 3 | 3 |
5 | 5 | 5 |
9 | 9 | 9 |
0 | 0 | 0 |
4 | 4 | 4 |
6 | 6 | 6 |
7 | 7 | 7 |
8 | 8 | 8 |
1 | 1 | 1 |
And of course, we can SELECT
the first ordered N rows of a dataframe, like this:
[5]:
t.order_by("i")[:10]
[5]:
i | j | k |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
4 | 4 | 4 |
5 | 5 | 5 |
6 | 6 | 6 |
7 | 7 | 7 |
8 | 8 | 8 |
9 | 9 | 9 |
Basic Data Manipulation
Now we have a dataframe. We can do basic data manipulation on it, just like in SQL.
For example, we can SELECT
a subset of its columns:
[6]:
t_ij = t[["i", "j"]]
t_ij
[6]:
i | j |
---|---|
2 | 2 |
3 | 3 |
5 | 5 |
9 | 9 |
0 | 0 |
4 | 4 |
6 | 6 |
7 | 7 |
8 | 8 |
1 | 1 |
And we can also SELECT
a subset of its rows. Say we want all the even numbers:
[7]:
t_even = t_ij[lambda t: t["i"] % 2 == 0]
t_even
[7]:
i | j |
---|---|
2 | 2 |
0 | 0 |
4 | 4 |
6 | 6 |
8 | 8 |
For a quick glance, we can SELECT
the first unordered N rows of a dataframe, like this:
[8]:
t_n = t_even[:3]
t_n
[8]:
i | j |
---|---|
2 | 2 |
0 | 0 |
4 | 4 |
Finally when we are done, we can save the resulting dataframe to the database, either temporarily or persistently:
[9]:
t_n.save_as(table_name="t_n", column_names=["i", "j"], temp=True)
[9]:
i | j |
---|---|
0 | 0 |
2 | 2 |
4 | 4 |
Joining Two DataFrames
We can also JOIN
two dataframes with GreenplumPython. For example, suppose we have two dataframes like this:
[10]:
rows = [
(1, "'a'"),
(2, "'b'"),
(3, "'c'"),
(4, "'d'"),
]
t1 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t1
[10]:
id | val |
---|---|
1 | 'a' |
2 | 'b' |
3 | 'c' |
4 | 'd' |
[11]:
rows = [
(1, "'a'"),
(2, "'b'"),
(3, "'a'"),
(4, "'b'"),
]
t2 = db.create_dataframe(rows=rows, column_names=["id", "val"])
t2
[11]:
id | val |
---|---|
1 | 'a' |
2 | 'b' |
3 | 'a' |
4 | 'b' |
We can JOIN
the two dataframe like this:
[12]:
t_join = t1.join(
t2,
on="val",
self_columns={"id": "t1_id", "val": "t1_val"},
other_columns={"id": "t2_id", "val": "t2_val"},
)
t_join
[12]:
t1_id | t1_val | t2_id | t2_val |
---|---|---|---|
1 | 'a' | 3 | 'a' |
1 | 'a' | 1 | 'a' |
2 | 'b' | 4 | 'b' |
2 | 'b' | 2 | 'b' |
Creating and Calling Functions
Calling functions is essential for data analytics. GreenplumPython supports creating Greenplum UDFs and UDAs from Python functions and calling them in Python.
Suppose we have a dataframe of numbers:
[13]:
rows = [(i,) for i in range(10)]
numbers = db.create_dataframe(rows=rows, column_names=["val"])
numbers
[13]:
val |
---|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
If we want to get the square of each number, we can write a function to do that:
[14]:
@gp.create_function
def square(a: int) -> int:
return a**2
numbers.apply(lambda t: square(t["val"]), column_name="square")
[14]:
square |
---|
0 |
1 |
4 |
9 |
16 |
25 |
36 |
49 |
64 |
81 |
Note that this function is called in exactly the same way as ordinary Python functions.
If we also want to get the sum of these numbers, what we need is to write an aggregate function like this:
[15]:
@gp.create_aggregate
def my_sum(result: int, val: int) -> int:
if result is None:
return val
return result + val
numbers.apply(lambda t: my_sum(t["val"]), column_name="sum")
[15]:
sum |
---|
45 |
Data Grouping
We can use DataFrame.group_by()
to divide a dataframe into groups by distinct values of columns.
Afterwards, we can apply aggregate functions to individual groups to obtain group-wise summary.
For example, we can divide numbers into two group based on whether each number is even, and sum each of the groups:
[16]:
import greenplumpython.builtins.functions as F
(
numbers.assign(is_even=lambda t: t["val"] % 2 == 0)
.group_by("is_even")
.apply(lambda t: F.sum(t["val"]))
)
[16]:
sum | is_even |
---|---|
20 | True |
25 | False |