Pandas-Compatible DataFrame
This package supports pandas compatible API which involves Dataframe
.
This package contains classes and functions having same names and parameters as the equivalence in pandas to provide a Data Scientist familiar syntax. And at the same time, its DataFrame has same specifications as GreenplumPython DataFrame, which means: Data is located and manipulated on a remote database system.
To use this interface, do simply as follow:
import greenplumpython.pandas as pd
N.B.: This package contains fewer functions than GreenplumPython DataFrame, it is easy to convert to it.
- class pandas.dataframe.DataFrame
Bases:
object
Representation of Pandas-Compatible DataFrame object.
- to_sql(name, con, schema=None, if_exists='fail', index=False, index_label=None, chunksize=None, dtype=None, method=None)
Write records stored in a DataFrame to a SQL database.
Tables in database can be newly created, appended to, or overwritten.
This aligns with the function pandas.DataFrame.to_sql().
- Parameters
con (str) – str: URI of database connection
name (str) –
schema (Optional[str]) –
if_exists (Literal['fail', 'replace', 'append']) –
index (bool) –
index_label (Optional[Union[str, List[str]]]) –
chunksize (Optional[int]) –
dtype (Optional[Dict[str, type]]) –
method (Optional[Literal[None, 'multi']]) –
- Returns
Number of rows affected by this function.
- Return type
int
Example
>>> cursor.execute("DROP TABLE IF EXISTS test_to_sql") >>> pd_df = pd.read_sql('SELECT unnest(ARRAY[1,2,3]) AS "a",unnest(ARRAY[1,2,3]) AS "b"', con) >>> pd_df.to_sql(name="test_to_sql", con=con) 3 >>> pd.read_sql("SELECT * FROM test_to_sql", con=con).sort_values("a") ------- a | b ---+--- 1 | 1 2 | 2 3 | 3 ------- (3 rows)
- to_native()
Convert a pandas-compatible
DataFrame
to a nativeDataframe
.- Returns
a native:class:~dataframe.Dataframe.
- Return type
DataFrame
- sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=True, key=None)
Sort by the values along columns.
This aligns with the function pandas.DataFrame.sort_values().
- Returns
Dataframe
order by the given arguments.- Return type
Dataframe
- Parameters
by (Union[str, List[str]]) –
axis (int) –
ascending (Union[bool, list[bool], tuple[bool, ...]]) –
inplace (bool) –
kind (Literal['quicksort', 'mergesort', 'heapsort', 'stable']) –
na_position (Literal['first', 'last']) –
ignore_index (bool) –
key (Optional[Callable[[Any], None]]) –
Example
>>> pd_df = pd.read_sql('SELECT unnest(ARRAY[3, 1, 2]) AS "id",unnest(ARRAY[1,2,3]) AS "b"', con) >>> pd_df.sort_values(["id"]) -------- id | b ----+--- 1 | 2 2 | 3 3 | 1 -------- (3 rows)
- drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=True)
Return DataFrame with duplicate rows removed.
This aligns with the function pandas.DataFrame.drop_duplicates().
- Returns
Dataframe
with duplicates removed.- Return type
Dataframe
- Parameters
subset (Optional[Union[str, List[str]]]) –
keep (Literal['first', 'last', False]) –
inplace (bool) –
ignore_index (bool) –
Example
>>> students = [("alice", 18), ("bob", 19), ("bob", 19)] >>> cursor.execute("DROP TABLE IF EXISTS student") >>> df = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student", column_names=["name", "age"]) >>> student = pd.read_sql("SELECT * FROM student", con) >>> student.drop_duplicates(subset=["name", "age"]).sort_values("name") ------------- name | age -------+----- alice | 18 bob | 19 ------------- (2 rows) >>> cursor.execute("DROP TABLE IF EXISTS student")
- merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='', copy=True, indicator=False, validate=None)
Join the current
DataFrame
with another using the given arguments.N.B: This function can’t handle yet automatically suffixes when having the same column names on both sides.
This aligns with the function pandas.DataFrame.merge().
- Returns
DataFrame
of the two merged class:DataFrame.- Return type
Dataframe
- Parameters
right (DataFrame) –
how (Literal['left', 'right', 'outer', 'inner', 'cross']) –
on (Optional[Union[str, List[str]]]) –
left_on (Optional[str]) –
right_on (Optional[str]) –
left_index (bool) –
right_index (bool) –
sort (bool) –
suffixes (str) –
copy (bool) –
indicator (bool) –
validate (Optional[str]) –
Example
>>> students = [("alice", 18), ("bob", 19), ("carol", 19)] >>> cursor.execute("DROP TABLE IF EXISTS student_1") >>> cursor.execute("DROP TABLE IF EXISTS student_2") >>> df_1 = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student_1", column_names=["name_1", "age_1"]) >>> df_2 = db.create_dataframe(rows=students, column_names=["name", "age"]).save_as("student_2", column_names=["name_2", "age_2"]) >>> student_1 = pd.read_sql("SELECT * FROM student_1", con) >>> student_2 = pd.read_sql("SELECT * FROM student_2", con) >>> student_1.merge( ... student_2, ... how="inner", ... left_on="age_1", ... right_on="age_2", ... ).sort_values(["name_1", "name_2"]) --------------------------------- name_1 | age_1 | name_2 | age_2 --------+-------+--------+------- alice | 18 | alice | 18 bob | 19 | bob | 19 bob | 19 | carol | 19 carol | 19 | bob | 19 carol | 19 | carol | 19 --------------------------------- (5 rows) >>> cursor.execute("DROP TABLE IF EXISTS student_1") >>> cursor.execute("DROP TABLE IF EXISTS student_2")
- head(n)
Return the first n unordered rows.
- Returns
The first n unordered rows of class:~pandas.dataframe.DataFrame.
- Return type
Dataframe
- Parameters
n (int) –
- pandas.dataframe.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
Read SQL query or database table into a
DataFrame
.This aligns with the function pandas.read_sql().
- Parameters
con (str) – str: URI of database connection
sql (str) –
index_col (Optional[Union[str, list[str]]]) –
coerce_float (bool) –
params (Optional[List[str]]) –
parse_dates (Optional[List[str]]) –
columns (Optional[list[str]]) –
chunksize (Optional[int]) –
- Returns
Dataframe
from SQL.- Return type
Dataframe
Example
>>> cursor.execute("DROP TABLE IF EXISTS test_read_sql") >>> columns = {"a": [1, 2, 3], "b": [1, 2, 3]} >>> df = db.create_dataframe(columns=columns).save_as("test_read_sql", column_names=["a", "b"]) >>> pd.read_sql("SELECT * FROM test_read_sql", con).sort_values("a") ------- a | b ---+--- 1 | 1 2 | 2 3 | 3 ------- (3 rows) >>> pd.read_sql('SELECT unnest(ARRAY[1, 2, 3]) AS "a",unnest(ARRAY[1,2,3]) AS "b"', con).sort_values("a") ------- a | b ---+--- 1 | 1 2 | 2 3 | 3 ------- (3 rows) >>> cursor.execute("DROP TABLE IF EXISTS test_read_sql")