Pandas-Compatible DataFrame

This package supports pandas compatible API which involves Dataframe.

This package contains classes and functions having same names and parameters as the equivalence in pandas to provide a Data Scientist familiar syntax. And at the same time, its DataFrame has same specifications as GreenplumPython DataFrame, which means: Data is located and manipulated on a remote database system.

To use this interface, do simply as follow:

import greenplumpython.pandas as pd

N.B.: This package contains fewer functions than GreenplumPython DataFrame, it is easy to convert to it.

class pandas.dataframe.DataFrame

Bases: object

Representation of Pandas-Compatible DataFrame object.

to_sql(name, con, schema=None, if_exists='fail', index=False, index_label=None, chunksize=None, dtype=None, method=None)

Write records stored in a DataFrame to a SQL database.

Tables in database can be newly created, appended to, or overwritten.

This aligns with the function pandas.DataFrame.to_sql().

Parameters

con (str) – str: URI of database connection
name (str) –
schema (Optional[str]) –
if_exists (Literal['fail', 'replace', 'append']) –
index (bool) –
index_label (Optional[Union[str, List[str]]]) –
chunksize (Optional[int]) –
dtype (Optional[Dict[str, type]]) –
method (Optional[Literal[None, 'multi']]) –

Returns

Number of rows affected by this function.

Return type

int

Example

>>> cursor.execute("DROP TABLE IF EXISTS test_to_sql")
>>> pd_df = pd.read_sql('SELECT unnest(ARRAY[1,2,3]) AS "a",unnest(ARRAY[1,2,3]) AS "b"', con)
>>> pd_df.to_sql(name="test_to_sql", con=con)
3
>>> pd.read_sql("SELECT * FROM test_to_sql", con=con).sort_values("a")
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)

to_native()

Convert a pandas-compatible DataFrame to a native Dataframe.

Returns: a native:class:~dataframe.Dataframe.
Return type: DataFrame

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=True, key=None)

Sort by the values along columns.

This aligns with the function pandas.DataFrame.sort_values().

Returns

Dataframe order by the given arguments.

Return type

Dataframe

Parameters

by (Union[str, List[str]]) –
axis (int) –
ascending (Union[bool, list[bool], tuple[bool, ...]]) –
inplace (bool) –
kind (Literal['quicksort', 'mergesort', 'heapsort', 'stable']) –
na_position (Literal['first', 'last']) –
ignore_index (bool) –
key (Optional[Callable[[Any], None]]) –

Example

>>> pd_df = pd.read_sql('SELECT unnest(ARRAY[3, 1, 2]) AS "id",unnest(ARRAY[1,2,3]) AS "b"', con)
>>> pd_df.sort_values(["id"])
--------
 id | b
----+---
  1 | 2
  2 | 3
  3 | 1
--------
(3 rows)

drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=True)

Return DataFrame with duplicate rows removed.

This aligns with the function pandas.DataFrame.drop_duplicates().

Returns

Dataframe with duplicates removed.

Return type

Dataframe

Parameters

subset (Optional[Union[str, List[str]]]) –
keep (Literal['first', 'last', False]) –
inplace (bool) –
ignore_index (bool) –

Example

>>> students = [("alice", 18), ("bob", 19), ("bob", 19)]
>>> cursor.execute("DROP TABLE IF EXISTS student")
>>> df = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student", column_names=["name", "age"])
>>> student = pd.read_sql("SELECT * FROM student", con)
>>> student.drop_duplicates(subset=["name", "age"]).sort_values("name")
-------------
 name  | age
-------+-----
 alice |  18
 bob   |  19
-------------
(2 rows)
>>> cursor.execute("DROP TABLE IF EXISTS student")

merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='', copy=True, indicator=False, validate=None)

Join the current DataFrame with another using the given arguments.

N.B: This function can’t handle yet automatically suffixes when having the same column names on both sides.

This aligns with the function pandas.DataFrame.merge().

Returns

DataFrame of the two merged class:DataFrame.

Return type

Dataframe

Parameters

right (DataFrame) –
how (Literal['left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, List[str]]]) –
left_on (Optional[str]) –
right_on (Optional[str]) –
left_index (bool) –
right_index (bool) –
sort (bool) –
suffixes (str) –
copy (bool) –
indicator (bool) –
validate (Optional[str]) –

Example

>>> students = [("alice", 18), ("bob", 19), ("carol", 19)]
>>> cursor.execute("DROP TABLE IF EXISTS student_1")
>>> cursor.execute("DROP TABLE IF EXISTS student_2")
>>> df_1 = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student_1", column_names=["name_1", "age_1"])
>>> df_2 = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student_2", column_names=["name_2", "age_2"])
>>> student_1 = pd.read_sql("SELECT * FROM student_1", con)
>>> student_2 = pd.read_sql("SELECT * FROM student_2", con)
>>> student_1.merge(
...        student_2,
...        how="inner",
...        left_on="age_1",
...        right_on="age_2",
...    ).sort_values(["name_1", "name_2"])
---------------------------------
 name_1 | age_1 | name_2 | age_2
--------+-------+--------+-------
alice  |    18 | alice  |    18
bob    |    19 | bob    |    19
bob    |    19 | carol  |    19
carol  |    19 | bob    |    19
carol  |    19 | carol  |    19
---------------------------------
(5 rows)
>>> cursor.execute("DROP TABLE IF EXISTS student_1")
>>> cursor.execute("DROP TABLE IF EXISTS student_2")

head(n)

Return the first n unordered rows.

Returns: The first n unordered rows of class:~pandas.dataframe.DataFrame.
Return type: Dataframe
Parameters: n (int) –

pandas.dataframe.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)

Read SQL query or database table into a DataFrame.

This aligns with the function pandas.read_sql().

Parameters

con (str) – str: URI of database connection
sql (str) –
index_col (Optional[Union[str, list[str]]]) –
coerce_float (bool) –
params (Optional[List[str]]) –
parse_dates (Optional[List[str]]) –
columns (Optional[list[str]]) –
chunksize (Optional[int]) –

Returns

Dataframe from SQL.

Return type

Dataframe

Example

>>> cursor.execute("DROP TABLE IF EXISTS test_read_sql")
>>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]}
>>> df = db.create_dataframe(columns=columns).save_as("test_read_sql", column_names=["a", "b"])
>>> pd.read_sql("SELECT * FROM test_read_sql", con).sort_values("a")
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)
>>> pd.read_sql('SELECT unnest(ARRAY[1, 2, 3]) AS "a",unnest(ARRAY[1,2,3]) AS "b"', con).sort_values("a")
-------
 a | b
---+---
 1 | 1
 2 | 2
 3 | 3
-------
(3 rows)
>>> cursor.execute("DROP TABLE IF EXISTS test_read_sql")