PyParallel is a research project by Trent Nelson. Its goal is to bring the power of Windows’ I/O Completion Ports to Python in a way that allows for high performance asynchronous support.
Python’s asynchronous support is somewhat problematic. It is designed around the Unix/Linux idea of synchronous, non-blocking I/O. This is where a thread continuously polls for incoming data and then dispatches it accordingly. While Linux is tuned for this pattern, on a Windows machine this is disastrous for performance. It is really expensive to copy the data from the polling thread to the thread that will actually process the work.
So what PyParallel delivers instead is true asynchronous I/O using the native I/O Completion Ports (IOCP). Under the IOCP model there is one thread per core. Each thread handles completing the I/O request (e.g. copying data from the network card) and executes the application level callback associated with it.
This alone isn’t enough to scale out Python; the GIL or Global Interpreter Lock also needs to be addressed. Otherwise you are still limited to one thread executing at a time. Replacing the GIL with fine-grained locks was found to be even worse and software transactional memory as seen in PyPy usually ends up with 1 thread making progress and N-1 threads continuously retrying. So something else is needed.
For the PyParallel team that something else is to not allow free threading. That is to say, the application cannot arbitrarily create new threads. Instead, parallel operations are tied to the async callback mechanism and the concept of a parallel context.
Before we dive into parallel contexts, we need to look at the converse. When the parallel context isn’t running then the main thread is running and vice-versa. The main thread is what you think of for normal Python development. It holds the GIL and has full access to the global namespace.
Conversely, a parallel context has read-only access to the global namespace. This means that the developer needs to pay attention to whether something is a main thread object or parallel context object. (COM programmers dealing with apartment threading models know this pain all too well.)
For non-I/O tasks the main thread queues up tasks using the async.submit_work function. Then it switches to the parallel context using the async.run function. This suspends the main thread and activates the parallel interpreter. Multiple parallel contexts can run at the same time with the Windows OS handling the thread pool management.
Parallelism with a GIL
At this point it is important to note that multiple processes are not created. Though this technique is commonly used in Python development, PyParallel keeps everything in one process to reduce cross-process communication costs. Normally this isn’t allowed because the CPython interpreter isn’t thread safe. This includes:
- Global statics are frequently used
- Reference counting isn’t atomic
- Objects are not protected by locks
- Garbage collection isn’t thread safe
- Interned string creation isn’t thread safe
- The bucket memory allocator isn’t thread safe
- The arena memory allocator isn’t thread safe
Greg Stein tried to fix this by adding fine grain locking to Python 1.4, but his project was rejected because it caused a 40% slowdown in single-threaded code. So Trent Nelson decided on a different approach. While in the main thread, the GIL operates as normal. But when running in a parallel context, thread safe alternatives to the core functions are used instead.
The overhead for Trent’s approach is 0.01%, much better than Greg’s attempt. As for PyPy’s software transactional memory, that has an overhead cost of roughly 200 to 500% for the single thread model.
An interesting feature of this design is that code running in a parallel context doesn’t need to acquire a lock to read from objects in the global namespace. But again, it is in a read-only capacity.
PyParallel doesn’t have a garbage collector
In order to avoid dealing with locks for memory allocation, access, and garbage collection, PyParallel uses a share nothing mode. Each parallel context gets its own heap and no garbage collector. That’s right, there is no garbage collector associated with a parallel context. So here’s what happens.
- Memory allocation is done using a simple block allocator. Each memory allocation just bumps the pointer.
- As needed, new pages of either 4K or 2MB in size are allocated. The parallel context’s large pages setting controls this.
- No reference counting is used.
- When the parallel context terminates, all pages associated with it are freed at one time.
This design eliminate the cost of a thread-safe garbage collector or thread safe reference counting. Plus it allows for the aforementioned block allocator, which is probably the fastest possible way to allocate memory.
The PyParallel team thinks they can get away with this design because parallel contexts are meant to be short lived and finite in scope. A good example would be a parallel sorting algorithm or a web page request handler.
In order to make this work, objects created in a parallel context cannot escape into the main thread. This is enforced by having the read-only access to the global namespace.
Reference Counting and Main Thread Objects
At this point we have two types of objects: main thread objects and parallel context objects. Main thread objects are reference counted because at some point they are going to need to be deallocated. Parallel context objects are not reference counted. But what about the interaction between the two?
Well since the parallel context thread cannot modify a main thread object, it cannot alter the main thread object’s reference count. But since the main thread garbage collector cannot run while parallel contexts are running, that’s not a problem. By the time the main thread GC is started, all of the parallel contexts have been destroyed and there is nothing pointing from them back to the main thread objects.
The end result of all this is that code executing in a parallel context is actually faster than code executing on the main thread.
Parallel Contexts and Async I/O
The memory model discussed above starts to break down when you start talking about asynchronous I/O calls. These calls can keep a parallel context alive far longer than the system was designed for. And in the case of something like a web page request handler, there could be an unlimited number of calls.
To deal with this problem, Trent added the concept of snapshots. When an asynchronous callback starts a snapshot of the parallel context’s memory is taken. And the end of the callback the changes are reverted and the newly allocated memory is freed. Again, this is good for something stateless like a web page request handler but bad if you need to retain data.
Snapshots can be nested up to 64 layers deep, but Trent didn’t go into details about how that works.
Balancing Synchronous and Asynchronous I/O
Asynchronous I/O isn’t free. When you want to get the most throughput with the least amount of latency, synchronous I/O can actually be faster. But that is only true if the number of concurrent requests is less than the number of available cores.
Since the developer doesn’t necessarily know what the load is going to be at any given time, expecting him to make the decision can be unreasonable. So the PyParallel offers a socket library that makes the decision at run time based on the number of active clients. As long as the number of active clients is less than the number of cores, synchronous code is executed. If the client count exceeds it the library automatically switches to asynchronous mode. Either way the application code is ignorant of the change.
Asynchronous HTTP Server
As part of the proof of concept, PyParallel comes with an asynchronous HTTP server based on the SimpleHttpServer that is part of stdlib. One of the key features is support for the Win32 function TransmitFile, which allows data to be sent directly from the file cache to the socket. PyParallel’s HTTP server allows prepending and postpending data to this in an asynchronous manner.
Future Plans
In the future Trent wishes to continue to improve the memory model by introducing a new set of interlocked data types and adding the use of context managers to control memory allocation protocols.
Integration with Numba is also in the works. The idea is to launch Numba asynchronously and atomically swap out CPython for natively generated code when Numba completes.
Another planed change is support for pluggable PxSocket_IOLoop endpoints. This would allow different protocols to be chained together in a pipeline fashion. Where possible he wants to use pipes instead of sockets for this to reduce the amount of copying necessary between steps.
For more information see Trent Nelson’s presentation, PyParallel - How We Removed the GIL and Exploited All Cores (Without Needing to Remove the GIL at all).
About the Author
Jonathan Allen has been writing news report for InfoQ since 2006 and is currently the lead editor for the .NET queue. If you are interested in writing news or educational articles for InfoQ please contact him at jonathan@infoq.com.