Quoted from http://ict.swisscom.ch/2015/02/pyperformance/
- I/O bound problems can make good use of multi-threading (where the GIL is released during I/O) or asynchronous programming.
- CPU-bound problems can be addressed by better algorithms (nothing beats an algorithm with less computational complexity), using array-based programming (NumPy), using various problem-specific packages written in a compiled language, or using Cython, a mix of C and Python.
- Application-level caches are also helpful, because no computation is always faster than the fastest possible computation.
The GIL constraint is removed when multiple processes are used, each with its own Python interpreter and GIL. This works nicely for problems that don’t require massive interaction between data or even massive amounts of read-only data.
In my own work with Quantax, the Swisscom Market Risk System, which is written in Python, we always face demand for increased speed. Using a lot of NumPy and many levels of application caches, we achieve about 25000 valuations of financial instruments per second on one core of a laptop CPU.
However, the price for this is complexity of cache invalidations, and complicated code to map the problem to NumPy.
We use processes at a relative coarse-grained level, as worker processes to calculate reports. The main issue of processes is the massive amount of common data the financial calculations require, leading to rather large memory consumption per process. However, there is rarely more than one logical process that modifies the objects (by changing transactions or rates).