Installing Python Packages on Server without Internet (Experimental)

WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.

In many environments, access from the database server to the Internet is disabled for security reasons. This makes it hard to install the Python packages required for data analytics on server.

To overcome this limitation, GreenplumPython provides a function Database.install_packages() to help the user

  1. Download Python packages from a PyPI site to the client;

  2. Pack and upload the downloaded packages to the database server;

  3. Install the uploaded Python packages on server.

All these happen automatically and the user only need to declare what packages are needed.

In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself.

NOTE: This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.

(Optional) Prerequisite: Sharing Python Environments in a Cluster with NFS

Setting up a NFS mount makes it easier to share a Python environment on multiple hosts and containers.

This is important for distributed database systems such as Greenplum because otherwise the same set of packages needs to be copied to every host in the cluster.

Starting an NFS server

First, we need to install and start an NFS server on one host. As an example, for Greenplum, we can start it on the coordinator host.

For how to do this, please refer to the documentation of the OS. For example, if you are using Rocky Linux, you might want to refer to the NFS page.

Mount a Python environment with NFS on Each Host

Next, we can mount a Python environment with NFS and share it to all hosts in the cluster.

In this way, we only need to install the packages on one host and the packages will be made available to all other hosts as well through NFS.

WARNING: This will affect all applications on the hosts. Please make sure that the database server is the only application that uses Python.

WARNING: This will hide all the files originally at the mount point. Please re-install them if they are needed by the database server.

[3]:
! python3 -m venv /tmp/test_venv
! sudo mount -t nfs "$(hostname):/tmp/test_venv" "$(python3 -m site --user-base)"
! ls -l "$(python3 -m site --user-base)"
total 8
drwxrwxr-x. 2 gpadmin gpadmin 4096 Oct  8 03:17 bin
drwxrwxr-x. 3 gpadmin gpadmin   21 Oct  8 03:17 etc
drwxrwxr-x. 2 gpadmin gpadmin    6 Oct  7 23:32 include
drwxrwxr-x. 3 gpadmin gpadmin   23 Oct  7 23:32 lib
lrwxrwxrwx. 1 gpadmin gpadmin    3 Oct  7 23:32 lib64 -> lib
-rw-rw-r--. 1 gpadmin gpadmin   80 Oct  8 03:47 pyvenv.cfg
drwxrwxr-x. 6 gpadmin gpadmin   65 Oct  8 03:17 share

Now Python environment is mounted at the Python user base directory as an NFS.

This means all packages installed with pip later will be available to all hosts with the NFS mounted.

Please note that if there is more than one hosts in the cluster, the commands above needs to be executed on each of them.

For example, if you are using Greenplum, this can be done by executing the commands in a gpssh session.

Note that the NFS can be unmounted by

[ ]:
! sudo umount "$(python3 -m site --user-base)"

Example: A UDF requiring a Third-Party Package

It is very common for a UDF that depends on a package that is not in the Python Standard Library. We can write one as a very simple example.

[5]:
%cd ../../../
!python3 -m pip install --upgrade .
/home/gpadmin/GreenplumPython
Processing /home/gpadmin/GreenplumPython
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting dill==0.3.6
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting psycopg2-binary==2.9.5
  Using cached psycopg2_binary-2.9.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Building wheels for collected packages: greenplum-python
  Building wheel for greenplum-python (PEP 517) ... done
  Created wheel for greenplum-python: filename=greenplum_python-1.0.1-py3-none-any.whl size=75737 sha256=88ea8683b2bba05bc714308aba1c612e06e40d849fd654296f549c697d3bd742
  Stored in directory: /tmp/pip-ephem-wheel-cache-n_iflnui/wheels/bb/1f/99/ff8594e48ec11df99af6e0ee8611a5e560e9f44d1a3fefb351
Successfully built greenplum-python
Installing collected packages: dill, psycopg2-binary, greenplum-python
Successfully installed dill-0.3.6 greenplum-python-1.0.1 psycopg2-binary-2.9.5
WARNING: You are using pip version 20.2.4; however, version 23.2.1 is available.
You should consider upgrading via the '/tmp/test_venv/bin/python3 -m pip install --upgrade pip' command.
[7]:
import greenplumpython as gp

db = gp.database("postgresql://localhost:7000")


@gp.create_function
def fake_name() -> str:
    from faker import Faker  # type: ignore reportMissingImports

    fake = Faker()
    return fake.name()

The UDF fake_name() generates fake names at random. This can be helpful for anonymizing the data.

However, if we try to call this UDF, we will get an error:

[10]:
db.apply(lambda: fake_name())
---------------------------------------------------------------------------
ExternalRoutineException                  Traceback (most recent call last)
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
    701 stream = StringIO()
    702 printer = pretty.RepresentationPrinter(stream, self.verbose,
    703     self.max_width, self.newline,
    704     max_seq_length=self.max_seq_length,
    705     singleton_pprinters=self.singleton_printers,
    706     type_pprinters=self.type_printers,
    707     deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
    709 printer.flush()
    710 return stream.getvalue()

File /tmp/test_venv/lib64/python3.9/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /tmp/test_venv/lib64/python3.9/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/GreenplumPython/greenplumpython/dataframe.py:220, in DataFrame.__repr__(self)
    213 def __repr__(self) -> str:
    214     # noqa
    215     """
    216     :meta private:
    217
    218     Return a string representation for a dataframe
    219     """
--> 220     contents = list(self)
    221     row_num_string = f"({len(contents)} row{'s' if len(contents) != 1 else ''})\n"
    222     if len(contents) == 0:  # DataFrame is empty

File ~/GreenplumPython/greenplumpython/dataframe.py:749, in DataFrame.__iter__(self)
    747     return DataFrame.Iterator(self._contents)
    748 assert self._db is not None
--> 749 self._contents = self._fetch()
    750 assert self._contents is not None
    751 return DataFrame.Iterator(self._contents)

File ~/GreenplumPython/greenplumpython/dataframe.py:873, in DataFrame._fetch(self, is_all)
    868 output_name = "cte_" + uuid4().hex
    869 to_json_dataframe = DataFrame(
    870     f"SELECT to_json({output_name})::TEXT FROM {self._name} AS {output_name}",
    871     parents=[self],
    872 )
--> 873 result = self._db._execute(to_json_dataframe._serialize())
    874 return result if isinstance(result, Iterable) else []

File ~/GreenplumPython/greenplumpython/db.py:77, in Database._execute(self, query, has_results)
     75 if config.print_sql:
     76     print(query)
---> 77 cursor.execute(query)
     78 return cursor.fetchall() if has_results else cursor.rowcount

File /tmp/test_venv/lib64/python3.9/site-packages/psycopg2/extras.py:236, in RealDictCursor.execute(self, query, vars)
    234 self.column_mapping = []
    235 self._query_executed = True
--> 236 return super().execute(query, vars)

ExternalRoutineException: ModuleNotFoundError: No module named 'faker'
CONTEXT:  Traceback (most recent call last):
  PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 16, in <module>
    return GD['__func_16dc1c114c0344938bc1d85945e9f0ad']()
  PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 6, in fake_name
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad"

---------------------------------------------------------------------------
ExternalRoutineException                  Traceback (most recent call last)
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/core/formatters.py:344, in BaseFormatter.__call__(self, obj)
    342     method = get_real_method(obj, self.print_method)
    343     if method is not None:
--> 344         return method()
    345     return None
    346 else:

File ~/GreenplumPython/greenplumpython/dataframe.py:281, in DataFrame._repr_html_(self)
    279 """:meta private:"""
    280 repr_html_str = ""
--> 281 ret = list(self)
    282 if len(ret) != 0:
    283     repr_html_str = "<table>\n"

File ~/GreenplumPython/greenplumpython/dataframe.py:749, in DataFrame.__iter__(self)
    747     return DataFrame.Iterator(self._contents)
    748 assert self._db is not None
--> 749 self._contents = self._fetch()
    750 assert self._contents is not None
    751 return DataFrame.Iterator(self._contents)

File ~/GreenplumPython/greenplumpython/dataframe.py:873, in DataFrame._fetch(self, is_all)
    868 output_name = "cte_" + uuid4().hex
    869 to_json_dataframe = DataFrame(
    870     f"SELECT to_json({output_name})::TEXT FROM {self._name} AS {output_name}",
    871     parents=[self],
    872 )
--> 873 result = self._db._execute(to_json_dataframe._serialize())
    874 return result if isinstance(result, Iterable) else []

File ~/GreenplumPython/greenplumpython/db.py:77, in Database._execute(self, query, has_results)
     75 if config.print_sql:
     76     print(query)
---> 77 cursor.execute(query)
     78 return cursor.fetchall() if has_results else cursor.rowcount

File /tmp/test_venv/lib64/python3.9/site-packages/psycopg2/extras.py:236, in RealDictCursor.execute(self, query, vars)
    234 self.column_mapping = []
    235 self._query_executed = True
--> 236 return super().execute(query, vars)

ExternalRoutineException: ModuleNotFoundError: No module named 'faker'
CONTEXT:  Traceback (most recent call last):
  PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 3, in <module>
    return GD['__func_16dc1c114c0344938bc1d85945e9f0ad']()
  PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 6, in fake_name
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad"

From the error message

ModuleNotFoundError: No module named ‘faker’

we learn that the error is due to missing of the module faker. We can fix it by installing it on server.

Installing Python Packages

To install the package on server, we can simply call Database.install_packages().

The packages will be installed to the currently activated environment. If there is no virtual environment activated, the packages will be installed to the user’s site-packages directory if the normal (system) site-packages directory is not writeable.

[12]:
import greenplumpython.experimental.file

db.install_packages("faker==19.6.1")

The installation succeeded if no error showed up. We can verify it by running fake_name() again:

[14]:
db.apply(lambda: fake_name(), column_name="name")
[14]:
name
Melinda Tran