Installing Python Packages on Server without Internet (Experimental)

WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.

In many environments, access from the database server to the Internet is disabled for security reasons. This makes it hard to install the Python packages required for data analytics on server.

To overcome this limitation, GreenplumPython provides a function Database.install_packages() to help the user

  1. Download Python packages from a PyPI site to the client;

  2. Pack and upload the downloaded packages to the database server;

  3. Install the uploaded Python packages on server.

All these happen automatically and the user only need to declare what packages are needed.

In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself.

NOTE: This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.

(Optional) Prerequisite: Sharing Python Environments in a Cluster with NFS

Setting up a NFS mount makes it easier to share a Python environment on multiple hosts and containers.

This is important for distributed database systems such as Greenplum because otherwise the same set of packages needs to be copied to every host in the cluster.

Starting an NFS server

First, we need to install and start an NFS server on one host. As an example, for Greenplum, we can start it on the coordinator host.

For how to do this, please refer to the documentation of the OS. For example, if you are using Rocky Linux, you might want to refer to the NFS page.

Mount a Python environment with NFS on Each Host

Next, we can mount a Python environment with NFS and share it to all hosts in the cluster.

In this way, we only need to install the packages on one host and the packages will be made available to all other hosts as well through NFS.

WARNING: This will affect all applications on the hosts. Please make sure that the database server is the only application that uses Python.

WARNING: This will hide all the files originally at the mount point. Please re-install them if they are needed by the database server.

! python3 -m venv /tmp/test_venv
! sudo mount -t nfs "$(hostname):/tmp/test_venv" "$(python3 -m site --user-base)"
! ls -l "$(python3 -m site --user-base)"
total 8
drwxrwxr-x. 2 gpadmin gpadmin 4096 Oct  8 03:17 bin
drwxrwxr-x. 3 gpadmin gpadmin   21 Oct  8 03:17 etc
drwxrwxr-x. 2 gpadmin gpadmin    6 Oct  7 23:32 include
drwxrwxr-x. 3 gpadmin gpadmin   23 Oct  7 23:32 lib
lrwxrwxrwx. 1 gpadmin gpadmin    3 Oct  7 23:32 lib64 -> lib
-rw-rw-r--. 1 gpadmin gpadmin   80 Oct  8 03:47 pyvenv.cfg
drwxrwxr-x. 6 gpadmin gpadmin   65 Oct  8 03:17 share

Now Python environment is mounted at the Python user base directory as an NFS.

This means all packages installed with pip later will be available to all hosts with the NFS mounted.

Please note that if there is more than one hosts in the cluster, the commands above needs to be executed on each of them.

For example, if you are using Greenplum, this can be done by executing the commands in a gpssh session.

Note that the NFS can be unmounted by

[ ]:
! sudo umount "$(python3 -m site --user-base)"

Example: A UDF requiring a Third-Party Package

It is very common for a UDF that depends on a package that is not in the Python Standard Library. We can write one as a very simple example.

%cd ../../../
!python3 -m pip install --upgrade .
import greenplumpython as gp

db = gp.database("postgresql://localhost:7000")

def fake_name() -> str:
    from faker import Faker  # type: ignore reportMissingImports

    fake = Faker()

The UDF fake_name() generates fake names at random. This can be helpful for anonymizing the data.

However, if we try to call this UDF, we will get an error:

db.apply(lambda: fake_name())
From the error message

ModuleNotFoundError: No module named ‘faker’

we learn that the error is due to missing of the module faker. We can fix it by installing it on server.

Installing Python Packages

To install the package on server, we can simply call Database.install_packages().

The packages will be installed to the currently activated environment. If there is no virtual environment activated, the packages will be installed to the user’s site-packages directory if the normal (system) site-packages directory is not writeable.

import greenplumpython.experimental.file


The installation succeeded if no error showed up. We can verify it by running fake_name() again:

db.apply(lambda: fake_name(), column_name="name")
Melinda Tran