Optimizing Python code for computations of pair-wise distances - Part III
Article Series
Part I: Python implementation only using built-in libraries
Part II: Numpy implementation
Part III: Numba and Cython implementation (you are here)
This is Part III of a series of three posts. In Part I and II, I discussed pure python and numpy implementations of performing pair-wise distances under a periodic condition, respectively. In this post, I show how to use Numba and Cython to further speed up the python codes.
At some point, the optimized python codes are not strictly python codes anymore. For instance, in this post, using Cython, we can make our codes very efficient. However, strictly speaking, Cython is not Python. It is a superset of Python, which means that any Python code can be compiled as Cython code but not vice versa. To see the performance boost, one needs to write Cython codes. So what is stopping you to just write C++/C codes instead and be done with it? I believe there is always some balance between the performance of the codes and the effort you put into writing the codes. As I will show here, using Numba or writing Cython codes is straightforward if you are familiar with Python. Hence, I always prefer to optimize the Python codes rather than rewrite it in C/C++ because it is more cost-effective for me.
Background #
Just to reiterate, the computation is to calculate pair-wise distances between every pair of particles under periodic boundary condition. The positions of particles are stored in an array/list with form [[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]]
. The distance between two particles, and is calculated as the following,
where and is the length of the simulation box edge. and is the positions. For more information, please read Part I.
Using Numba #
Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
Numba has existed for a few years. I remembered trying it a few years ago but didn’t have a good experience with it. Now it is much more matured and very easy to use as I will show in this post.
Serial Numba Implementation #
On their website, it is stated that Numba can make Python codes as fast as C or Fortran. Numba also provides a way to parallelize the for
loop. First, let’s try to implement a serial version. Numba’s official documentation recommends using Numpy with Numba. Following the suggestion, using the Numpy code demonstrated in Part II, I have the Numba version,
import numba
from numba import jit
@jit(nopython=True, fastmath=True)
def pdist_numba_serial(positions, l):
"""
Compute the pair-wise distances between every possible pair of particles.
positions: a numpy array with form np.array([[x1,y1,z1],[x2,y2,z2],...,[xN,yN,zN]])
l: the length of edge of box (cubic/square box)
return: a condensed 1D list
"""
# determine the number of particles
n = positions.shape[0]
# create an empty array storing the computed distances
pdistances = np.empty(int(n*(n-1)/2.0))
for i in range(n-1):
D = positions[i] - positions[i+1:]
out = np.empty_like(D)
D = D - np.round(D / l, 0, out) * l
distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
idx_s = int((2 * n - i - 1) * i / 2.0)
idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
pdistances[idx_s:idx_e] = distance
return pdistances
Using Numba is almost (see blue box below) as simple as adding the decorator @jit(nopython=True, fastmath=True)
to our function.
Inside the function pdist_numba_serial
, we basically copied the codes except the line D = D - np.round(D / l) * l
in the original code. Instead we need to use np.round(D / l, 0, out)
which is pointed out in this github issue
Parallel Numba Implementation #
pdist_numba_serial
is a serial implementation. The nature of pair-wise distance computation allows us to parallelize the process by simplifying distributing pairs to multiple cores/threads. Fortunately, Numba does provide a very simple way to do that. The for loop in pdist_numba_serial
can be parallelized using Numba by replacing range
with prange
and adding parallel=True
to the decorator,
from numba import prange
# add parallel=True to the decorator
@jit(nopython=True, fastmath=True, parallel=True)
def pdist_numba_parallel(positions, l):
# determine the number of particles
n = positions.shape[0]
# create an empty array storing the computed distances
pdistances = np.empty(int(n*(n-1)/2.0))
# use prange here instead of range
for i in prange(n-1):
D = positions[i] - positions[i+1:]
out = np.empty_like(D)
D = D - np.round(D / l, 0, out) * l
distance = np.sqrt(np.sum(np.power(D, 2.0), axis=1))
idx_s = int((2 * n - i - 1) * i / 2.0)
idx_e = int((2 * n - i - 2) * (i + 1) / 2.0)
pdistances[idx_s:idx_e] = distance
return pdistances
There are some caveats when using prange
when race condition would occur. However for our case, there is no race condition since the distances calculations for pairs are independent with each other, i.e. there is no communication between cores/threads. For more information on parallelizing using Numba, refer to their documentation.
Benchmark #
Now let’s benchmark the two versions of Numba implementations. The result is shown below,
Compared to the fastest Numpy implementation shown in Part II, the serial Numba implementation provides more than three times of speedup. As one can see, the parallel version is about twice as fast as the serial version on my 2-cores laptop. I didn’t test on the machines with more cores but I expect the speed up should scale linearly with the number of cores.
I am sure there are some more advanced techniques to make the Numba version even faster (using CUDA for instance). I would argue that the implementations above are the most cost-effective.
Using Cython #
As demonstrated above, Numba provides a very simple way to speed up the python codes with minimal effort. However, if we want to go further, it is probably better to use Cython.
Cython is basically a superset of Python. It allows Cython/Python codes to be compiled to C/C++ and then compiled to machine codes using C/C++ compiler. In the end, you have a C module you can import directly in Python.
Similar to the Numba versions, I show both serial and parallel versions of Cython implementations
Serial Cython implementation #
%load_ext Cython # load Cython in Jupyter Notebook
%%cython --force --annotate
import cython
import numpy as np
from libc.math cimport sqrt
from libc.math cimport nearbyint
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True) # Do not check division, may leads to 20% performance speedup
def pdist_cython_serial(double [:,:] positions not None, double l):
cdef Py_ssize_t n = positions.shape[0]
cdef Py_ssize_t ndim = positions.shape[1]
pdistances = np.zeros(n * (n-1) // 2, dtype = np.float64)
cdef double [:] pdistances_view = pdistances
cdef double d, dd
cdef Py_ssize_t i, j, k
for i in range(n-1):
for j in range(i+1, n):
dd = 0.0
for k in range(ndim):
d = positions[i,k] - positions[j,k]
d = d - nearbyint(d / l) * l
dd += d * d
pdistances_view[j - 1 + (2 * n - 3 - i) * i // 2] = sqrt(dd)
return pdistances
Some Remarks #
Declare static types for variables using
cdef
. For instance,cdef double d
declare that the variabled
has a double/float type.Import
sqrt
andnearbyint
from C library instead of using Python function. The general rule is that always try to use C functions directly whenever possible.positions
is a Numpy array and declared using Typed Memoryviews.Similar to
positions
,pdistances_view
access the memory buffer of the numpy arraypdistances
. Value assignments ofpdistances
are achieved throughpdistances_view
.It is useful to use
%%cython --annotate
to display the analysis of Cython codes. In such a way, you can inspect the potential slowdown of the code. The analysis will highlight lines where Python interaction occurs. In this particular example, it is very important to keep the core part — nested loop — from Python interaction. For instance, if we don’t usesqrt
andnearbyint
fromlibc.math
but instead just use python’s built-insqrt
andround
, then you won’t see much speedup since there is a lot of overhead in calling these python functions inside the loop.
Parallel Cython Implementation #
Similar to Numba, Cython also allows parallelization. The parallelization is achieved using OpenMP. First, to use OpenMP with Cython, we need to import needed modules,
from cython.parallel import prange, parallel
Then, replace the for i in range(n-1)
in the serial version with
with nogil, parallel():
for i in prange(n-1, schedule='dynamic'):
Everything else remains the same. Here I follow the example on Cython’s official documentation.
schedule='dynamic'
allows the iterations in the loop are distributed through threads as request. Other options include static
, guided
, etc. See full documentation.
I had some trouble compiling the parallel version directly in the Jupyter Notebook. Instead, it is compiled as a standalone module. The .pyx
file and setup.py
file can be found in this gist.
Benchmark #
The result of benchmarking pdist_cython_serial
and pdist_cython_parallel
is shown in the figure below,
As expected, the serial version is about half the speed of the parallel version on my 2-cores laptop. The serial version is more than two times faster than its counterpart using Numba.
Summing up #
In this serial of posts, using computations of pair-wise distance under periodic boundary condition as an example, I showed various ways to optimize the Python codes using built-in Python functions (Part I), NumPy (Part II), Numba and Cython (this post). The benchmark results from all of the functions tested are summarized in the table below,
Function | Averaged Speed (ms) | Speedup |
---|---|---|
pdist | 1270 | 1 |
pdist_v2 | 906 | 1.4 |
pdist_np_broadcasting | 160 | 7.9 |
pdist_np_naive | 110 | 11.5 |
pdist_numba_serial | 20.7 | 61 |
pdist_numba_parallel | 12.6 | 101 |
pdist_cython_serial | 5.84 | 217 |
pdist_cython_parallel | 3.19 | 398 |
The time is measured when . The parallel versions are tested on a 2-cores machine.