Libraries
Most of the power of a programming language is in its libraries. This is especially true for Python which is an interpreted language and is therefore very slow (compared to compiled languages). However, the libraries are often compiled (can be written in compiled languages such as C/C++) and therefore offer much faster performance than native Python code.
A library is a collection of functions that can be used by other programs. Python’s standard library includes many functions we worked with before (print, int, round, …) and is included with Python. There are many other additional modules in the standard library such as math:
print('pi is', pi)
import math
print('pi is', math.pi)
You can also import math’s items directly:
from math import pi, sin
print('pi is', pi)
sin(pi/6)
cos(pi)
help(math) # help for libraries works just like help for functions
from math import *
You can also create an alias from the library:
import math as m
print m.pi
Question 10.1
What function from the math library can you use to calculate a square root without usingsqrt
?
Question 10.2
You want to select a random character from the stringbases='ACTTGCTTGAC'
. What standard library would you most expect
to help? Which function would you select from that library? Are there alternatives?
Question 10.3
A colleague of yours typeshelp(math)
and gets an error: NameError: name 'math' is not defined
. What has your
colleague forgotten to do?
Question 10.4
Convert the angle 0.3 rad to degrees using the math library.Virtual environments and packaging
To install a 3rd-party library into the current Python environment from inside a Jupyter notebook, simply do (you will probably need to restart the kernel before you can use the package):
%pip install <packageName> # e.g. try bson
In Python you can create an isolated environment for each project, into which all of its dependencies will be installed. This could be useful if your several projects have very different sets of dependencies. On the computer running your Jupyter notebooks, open the terminal and type:
(Important: on a cluster you must do this on the login node, not inside the JupyterLab terminal.)
module load python/3.9.6 # specific to HPC clusters
pip install virtualenv
virtualenv --no-download climate # create a new virtual environment in your current directory
source climate/bin/activate
which python && which pip
pip install --no-index netcdf4 ...
...
deactivate
To use this environment in the terminal, you would do:
source climate/bin/activate
...
deactivate
Optionally, you can add your environment to Jupyter:
pip install --no-index ipykernel # install ipykernel (IPython kernel for Jupyter) into this environment
python -m ipykernel install --user --name=climate --display-name "My climate project" # add your env to Jupyter
...
deactivate
Quit all your currently running Jupyter notebooks and the Jupyter dashboard, and then restart. One of the
options in New
below Python 3
should be climate
.
To delete the environment, in the terminal type:
jupyter kernelspec list # `climate` should be one of them
jupyter kernelspec uninstall climate # remove your environment from Jupyter
/bin/rm -rf climate
Quick overview of some of the libraries
Python lists are very general and flexible, which is great for high-level programming, but it comes at a cost. The Python interpreter can’t make any assumptions about what will come next in a list, so it treats everything as a generic object with its own type and size. As lists get longer, eventually performance takes a hit.
Python does not have any mechanism for a uniform/homogeneous list, where – to jump to element #1000 – you
just take the memory address of the very first element and then increment it by (element size in bytes)
x 999. NumPy library fills this gap by adding the concept of homogenous collections to python –
numpy.ndarray
s – which are multidimensional, homogeneous arrays of fixed-size items (most commonly numbers,
but could be strings too). This brings huge performance benefits!
To speed up calculations with NumPy, typically you perform operations on entire arrays, and this by extension applies the same operation to each array element. Since NumPy was written in C, it is much faster for processing multiple data elements than manually looping over these elements in Python.
Learning NumPy is outside the scope of this introductory workshop, but there are many packages built on top of NumPy that could be used in HSS:
pandas
is a library for working with 2D tables / spreadsheets, built on top of numpyscikit-image
is a collection of algorithms for image processing, built on top of numpyMatplotlib
andPlotly
are two plotting packages for Pythonxarray
is a library for working with labelled multi-dimensional arrays and datasets in Python- “
pandas
for multi-dimensional arrays” - great for large scientific datasets; writes into NetCDF files
- we won’t study it in this workshop
- “
We’ll also take a look at these two libraries (not based on NumPy):
requests
is an HTTP library to download HTML data from the webBeautiful Soup
is a library to parse these HTML data