I used not to care about parallelizing small programs because I thought parallelizing them would cost me more time than writing the programs themselves, and also because there are only four cores on my laptop. Unless the task were IO intensive, I would leave it running with a single thread. Everything has changed since I have access to a machine with 32 cores and 128GB memory. Looking at all those idle cores from
htop, I felt the urge to utilize them. And I found it’s super easy to parallel a Python program.
The Python standard library covers large parts of daily needs and is very handy, which is one of the reasons why I love Python so much. There are two built-in libraries for parallelism,
threading. It’s a natural thought to use threads, especially if you have experiences with parallelism in other languages before. Threads are awesome. The cost of multithreading is relatively low, and threads share memories by default. However, I have to tell you the bad news: if you are using CPython implementation, using
threading library for computation intensive tasks equals to saying goodbye to parallelism. In fact, it will run even slower than a single thread.
Global Interpreter Lock
The CPython I mentioned just now is an implementation of Python language by python.org. Yes, Python is a language, and there are lots of different implementations, such as PyPy, Jython, IronPython, and so on. CPython is the most widely used one. When people are talking about Python, they are talking about CPython for almost every time.
CPython uses Global Interpreter Lock (GIL) to simplify the implementation of the interpreter. GIL makes the interpreter only execute bytecodes from a single thread at the same time. Unless waiting for IO operations, the multithreading of CPython is entirely a lie! Of course, if you don’t use CPython and don’t have this GIL issue, you might consider
To know more about the GIL, you might be interested in the following materials:
So, ruling out
threading, let’s take a look at
multiprocessing.Pool is a handy tool. If you can solve a task by
ys = map(f, xs) which as we all know is a perfect form of parallelism, it’s super easy to make it parallel in Python. For example, square each element of a list:
import multiprocessing def f(x): return x * x cores = multiprocessing.cpu_count() pool = multiprocessing.Pool(processes=cores) xs = range(5) # method 1: map print pool.map(f, xs) # prints [0, 1, 4, 9, 16] # method 2: imap for y in pool.imap(f, xs): print y # 0, 1, 4, 9, 16, respectively # method 3: imap_unordered for y in pool.imap_unordered(f, xs): print(y) # may be in any order
map returns a list. The two functions prefixed with
i return an iterator.
_unordered suffix means that it will be unordered.
When a task takes a long time, we might want to have a progress bar. Printing
\r makes the cursor move back to the start of the line without going to the new line. We can have this:
cnt = 0 for y in pool.imap_unordered(f, xs): sys.stdout.write('done %d/%d\r' % (cnt, len(xs))) cnt += 1 # deal with y
Also, there is a handy library
tqdm to handle the progress bar for you:
from tqdm import tqdm for y in tqdm(pool.imap_unordered(f, xs)): # deal with y
More Complicated Tasks
If the task is more complicated, you can use
multiprocessing.Process to spawn subprocesses. There are several methods to have those subprocesses communicate with each other:
I strongly recommend
Queue. Many applications fit into Producer-Comsumer Model. For example, you can have the parent create a
Queue and make it one of the arguments when spawning children. Then, you can have the parent send tasks to children and children send back results via the
Watch Out When Importing Libraries
Some side effects might happen when importing libraries, such as global variables, open files, listening to sockets, creating threads, and so on. For example, if you import libraries that use CUDA, such as Theano and TensorFlow, and spawn a subprocess, things might go wrong:
could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED
The solution is that don’t import those libraries in the parent and instead import them in subprocesses. Here is how you can do it with
import multiprocessing def hello(taskq, resultq): import tensorflow as tf config = tf.ConfigProto() config.gpu_options.allow_growth=True sess = tf.Session(config=config) while True: name = taskq.get() res = sess.run(tf.constant('hello ' + name)) resultq.put(res) if __name__ == '__main__': taskq = multiprocessing.Queue() resultq = multiprocessing.Queue() p = multiprocessing.Process(target=hello, args=(taskq, resultq)) p.start() taskq.put('world') taskq.put('abcdabcd987') taskq.close() print(resultq.get()) print(resultq.get()) p.terminate() p.join()
multiprocessing.Pool, you can have an initializer function and pass it to the pool.
import multiprocessing def init(): global tf global sess import tensorflow as tf config = tf.ConfigProto() config.gpu_options.allow_growth=True sess = tf.Session(config=config) def hello(name): return sess.run(tf.constant('hello ' + name)) if __name__ == '__main__': pool = multiprocessing.Pool(processes=2, initializer=init) xs = ['world', 'abcdabcd987', 'Lequn Chen'] print pool.map(hello, xs)