Installing Python Packages on Server without Internet (Experimental)
WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.
In many environments, access from the database server to the Internet is disabled for security reasons. This makes it hard to install the Python packages required for data analytics on server.
To overcome this limitation, GreenplumPython provides a function Database.install_packages()
to help the user
Download Python packages from a PyPI site to the client;
Pack and upload the downloaded packages to the database server;
Install the uploaded Python packages on server.
All these happen automatically and the user only need to declare what packages are needed.
In this way, as long as there is a database connection on a client with Internet access, the user can easily install the required packages, even if the database server cannot access the Internet by itself.
NOTE: This function only installs packages on the server host that GreenplumPython directly connects to. If your database server spreads across multiple hosts, additional operations are required to make the packages available on all hosts.
(Optional) Prerequisite: Sharing Python Environments in a Cluster with NFS
Setting up a NFS mount makes it easier to share a Python environment on multiple hosts and containers.
This is important for distributed database systems such as Greenplum because otherwise the same set of packages needs to be copied to every host in the cluster.
Starting an NFS server
First, we need to install and start an NFS server on one host. As an example, for Greenplum, we can start it on the coordinator host.
For how to do this, please refer to the documentation of the OS. For example, if you are using Rocky Linux, you might want to refer to the NFS page.
Mount a Python environment with NFS on Each Host
Next, we can mount a Python environment with NFS and share it to all hosts in the cluster.
In this way, we only need to install the packages on one host and the packages will be made available to all other hosts as well through NFS.
WARNING: This will affect all applications on the hosts. Please make sure that the database server is the only application that uses Python.
WARNING: This will hide all the files originally at the mount point. Please re-install them if they are needed by the database server.
[3]:
! python3 -m venv /tmp/test_venv
! sudo mount -t nfs "$(hostname):/tmp/test_venv" "$(python3 -m site --user-base)"
! ls -l "$(python3 -m site --user-base)"
total 8
drwxrwxr-x. 2 gpadmin gpadmin 4096 Oct 8 03:17 bin
drwxrwxr-x. 3 gpadmin gpadmin 21 Oct 8 03:17 etc
drwxrwxr-x. 2 gpadmin gpadmin 6 Oct 7 23:32 include
drwxrwxr-x. 3 gpadmin gpadmin 23 Oct 7 23:32 lib
lrwxrwxrwx. 1 gpadmin gpadmin 3 Oct 7 23:32 lib64 -> lib
-rw-rw-r--. 1 gpadmin gpadmin 80 Oct 8 03:47 pyvenv.cfg
drwxrwxr-x. 6 gpadmin gpadmin 65 Oct 8 03:17 share
Now Python environment is mounted at the Python user base directory as an NFS.
This means all packages installed with pip
later will be available to all hosts with the NFS mounted.
Please note that if there is more than one hosts in the cluster, the commands above needs to be executed on each of them.
For example, if you are using Greenplum, this can be done by executing the commands in a gpssh
session.
Note that the NFS can be unmounted by
[ ]:
! sudo umount "$(python3 -m site --user-base)"
Example: A UDF requiring a Third-Party Package
It is very common for a UDF that depends on a package that is not in the Python Standard Library. We can write one as a very simple example.
[5]:
%cd ../../../
!python3 -m pip install --upgrade .
/home/gpadmin/GreenplumPython
Processing /home/gpadmin/GreenplumPython
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Collecting dill==0.3.6
Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting psycopg2-binary==2.9.5
Using cached psycopg2_binary-2.9.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Building wheels for collected packages: greenplum-python
Building wheel for greenplum-python (PEP 517) ... done
Created wheel for greenplum-python: filename=greenplum_python-1.0.1-py3-none-any.whl size=75737 sha256=88ea8683b2bba05bc714308aba1c612e06e40d849fd654296f549c697d3bd742
Stored in directory: /tmp/pip-ephem-wheel-cache-n_iflnui/wheels/bb/1f/99/ff8594e48ec11df99af6e0ee8611a5e560e9f44d1a3fefb351
Successfully built greenplum-python
Installing collected packages: dill, psycopg2-binary, greenplum-python
Successfully installed dill-0.3.6 greenplum-python-1.0.1 psycopg2-binary-2.9.5
WARNING: You are using pip version 20.2.4; however, version 23.2.1 is available.
You should consider upgrading via the '/tmp/test_venv/bin/python3 -m pip install --upgrade pip' command.
[7]:
import greenplumpython as gp
db = gp.database("postgresql://localhost:7000")
@gp.create_function
def fake_name() -> str:
from faker import Faker # type: ignore reportMissingImports
fake = Faker()
return fake.name()
The UDF fake_name()
generates fake names at random. This can be helpful for anonymizing the data.
However, if we try to call this UDF, we will get an error:
[10]:
db.apply(lambda: fake_name())
---------------------------------------------------------------------------
ExternalRoutineException Traceback (most recent call last)
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
701 stream = StringIO()
702 printer = pretty.RepresentationPrinter(stream, self.verbose,
703 self.max_width, self.newline,
704 max_seq_length=self.max_seq_length,
705 singleton_pprinters=self.singleton_printers,
706 type_pprinters=self.type_printers,
707 deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
709 printer.flush()
710 return stream.getvalue()
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
407 return meth(obj, self, cycle)
408 if cls is not object \
409 and callable(cls.__dict__.get('__repr__')):
--> 410 return _repr_pprint(obj, self, cycle)
412 return _default_pprint(obj, self, cycle)
413 finally:
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
776 """A pprint that just redirects to the normal repr function."""
777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
779 lines = output.splitlines()
780 with p.group():
File ~/GreenplumPython/greenplumpython/dataframe.py:220, in DataFrame.__repr__(self)
213 def __repr__(self) -> str:
214 # noqa
215 """
216 :meta private:
217
218 Return a string representation for a dataframe
219 """
--> 220 contents = list(self)
221 row_num_string = f"({len(contents)} row{'s' if len(contents) != 1 else ''})\n"
222 if len(contents) == 0: # DataFrame is empty
File ~/GreenplumPython/greenplumpython/dataframe.py:749, in DataFrame.__iter__(self)
747 return DataFrame.Iterator(self._contents)
748 assert self._db is not None
--> 749 self._contents = self._fetch()
750 assert self._contents is not None
751 return DataFrame.Iterator(self._contents)
File ~/GreenplumPython/greenplumpython/dataframe.py:873, in DataFrame._fetch(self, is_all)
868 output_name = "cte_" + uuid4().hex
869 to_json_dataframe = DataFrame(
870 f"SELECT to_json({output_name})::TEXT FROM {self._name} AS {output_name}",
871 parents=[self],
872 )
--> 873 result = self._db._execute(to_json_dataframe._serialize())
874 return result if isinstance(result, Iterable) else []
File ~/GreenplumPython/greenplumpython/db.py:77, in Database._execute(self, query, has_results)
75 if config.print_sql:
76 print(query)
---> 77 cursor.execute(query)
78 return cursor.fetchall() if has_results else cursor.rowcount
File /tmp/test_venv/lib64/python3.9/site-packages/psycopg2/extras.py:236, in RealDictCursor.execute(self, query, vars)
234 self.column_mapping = []
235 self._query_executed = True
--> 236 return super().execute(query, vars)
ExternalRoutineException: ModuleNotFoundError: No module named 'faker'
CONTEXT: Traceback (most recent call last):
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 16, in <module>
return GD['__func_16dc1c114c0344938bc1d85945e9f0ad']()
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 6, in fake_name
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad"
---------------------------------------------------------------------------
ExternalRoutineException Traceback (most recent call last)
File /tmp/test_venv/lib64/python3.9/site-packages/IPython/core/formatters.py:344, in BaseFormatter.__call__(self, obj)
342 method = get_real_method(obj, self.print_method)
343 if method is not None:
--> 344 return method()
345 return None
346 else:
File ~/GreenplumPython/greenplumpython/dataframe.py:281, in DataFrame._repr_html_(self)
279 """:meta private:"""
280 repr_html_str = ""
--> 281 ret = list(self)
282 if len(ret) != 0:
283 repr_html_str = "<table>\n"
File ~/GreenplumPython/greenplumpython/dataframe.py:749, in DataFrame.__iter__(self)
747 return DataFrame.Iterator(self._contents)
748 assert self._db is not None
--> 749 self._contents = self._fetch()
750 assert self._contents is not None
751 return DataFrame.Iterator(self._contents)
File ~/GreenplumPython/greenplumpython/dataframe.py:873, in DataFrame._fetch(self, is_all)
868 output_name = "cte_" + uuid4().hex
869 to_json_dataframe = DataFrame(
870 f"SELECT to_json({output_name})::TEXT FROM {self._name} AS {output_name}",
871 parents=[self],
872 )
--> 873 result = self._db._execute(to_json_dataframe._serialize())
874 return result if isinstance(result, Iterable) else []
File ~/GreenplumPython/greenplumpython/db.py:77, in Database._execute(self, query, has_results)
75 if config.print_sql:
76 print(query)
---> 77 cursor.execute(query)
78 return cursor.fetchall() if has_results else cursor.rowcount
File /tmp/test_venv/lib64/python3.9/site-packages/psycopg2/extras.py:236, in RealDictCursor.execute(self, query, vars)
234 self.column_mapping = []
235 self._query_executed = True
--> 236 return super().execute(query, vars)
ExternalRoutineException: ModuleNotFoundError: No module named 'faker'
CONTEXT: Traceback (most recent call last):
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 3, in <module>
return GD['__func_16dc1c114c0344938bc1d85945e9f0ad']()
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad", line 6, in fake_name
PL/Python function "func_16dc1c114c0344938bc1d85945e9f0ad"
From the error message
ModuleNotFoundError: No module named ‘faker’
we learn that the error is due to missing of the module faker
. We can fix it by installing it on server.
Installing Python Packages
To install the package on server, we can simply call Database.install_packages()
.
The packages will be installed to the currently activated environment. If there is no virtual environment activated, the packages will be installed to the user’s site-packages directory if the normal (system) site-packages directory is not writeable.
[12]:
import greenplumpython.experimental.file
db.install_packages("faker==19.6.1")
The installation succeeded if no error showed up. We can verify it by running fake_name()
again:
[14]:
db.apply(lambda: fake_name(), column_name="name")
[14]:
name |
---|
Melinda Tran |