Parallel Python

Introduction for instructors

This page describes a course module for teaching parallelism with OpenMPI in Python. The materials are adapted from those used in teaching the BIO 371 DIVAS Seminar II, which was the capstone course for our Digital Imaging and Vision Applications in Science (DIVAS) program. The course description for this course is:

This seminar is a capstone to the Doane Digital Imaging and Vision Applications in Science Project and is intended for DIVA scholars. Students mentor incoming DIVA scholars by occasionally participating in DIVAS Seminar I. Students will develop and present a summary of their research to students in DIVAS Seminar I, most likely at MindExpo. Finally, students will explore various ways to use supercomputing and parallelism to solve a variety of problems. DIVA scholars are required to enroll in this course after completing summer research.

Some specific activities for our semester are listed below:

Git, bash, Python, OpenCV review

Use of Onyx or other supercomputer

Parallel architectures, problem-solving methodologies, etc.

When / what to parallelize

Language-specific implementation

SLURM batch files, etc.

Solving a variety of problems from different disciplines

Application of above in a project

Mentoring DIVAS I cohort

MindExpo presentation regarding DIVAS I activities / summer research

The DIVAS Seminar II course was taught to non-computing majors who had minimal programming experience. So, our approach was to introduce the subject of parallelism in as simple a way as possible.

Module introduction

We will use a single problem to motivate our discussion for this module: estimating the mathematical constant π. There are a wide variety of different infinite series formulae for estimating π, but the approach we will take here uses a Monte Carlo method, which means that it relies on randomness.

Using a circle’s area to estimate π

Consider the formula for calculating the area of a circle with radius r:

Area of a circle

If we use one for the radius of our circle, then the area formula simplifies to:

A = π.

And, suppose we look at only one quarter of the circle, say, the upper right quadrant. Then, we would have

A = π / 4.

We’re trying to find an estimate of π, so solving for π in the last equation gives us:

π = 4A.

So, if we have a way to estimate the area of one quadrant of a unit circle, we can multiply that by four and arrive at an estimate of π.

The Monte Carlo method

To find the area of our circle quadrant, we’re going to pretend that the unit square containing the quadrant is a dart board, and throw lots and lots of virtual darts at it!

Our Monte Carlo method works like this: we’ll generate many, many (x, y) coordinates, where each x and y is a randomly chosen number in the range [0, 1). For each coordinate, we will calculate the distance from the origin (0, 0) to the point (x, y). If that distance is less than or equal to one, then the point lands inside the unit circle quadrant; if the distance is greater than one, then the point landed outside the circle quadrant. We will count how many of our darts / coordinates were inside the circle, and if we throw enough darts, dividing that number by the total number of darts thrown will give us an estimate of the area of the circle quadrant.

This image may help you understand the process. The leftmost image shows our dartboard after just a few darts were thrown; those that are inside the circle quadrant are colored red, while those outside the quadrant are blue. The middle image has more darts, and the rightmost image has even more.

Monte Carlo method

As we have more and more darts, the ratio of red to blue points will be closer and closer to the area of the circle quadrant.

In order to determine if a dart is inside the circle quadrant, we need to calculate its distance from the origin. Generally speaking, the distance between two points (x₁, y₁) and (x₂, y₂) can be determined by this formula:

General distance formula

We can simplify things to make the formula simpler and faster to calculate. First, recall that one of our points in this situation is always (0, 0). Let’s assume that (x₁, y₁) = (0, 0). Now, the subtractions are eliminated, because subtracting zero from any value leaves you with the same value. So, our distance formula is now:

Simplified distance formula

Squaring both sides, we have:

Further simplified distance formula

We are interested in the case where d ≤ 1. Squaring both sides of that inequality, we have d² ≤ 1² = 1. So, if the Boolean expression x * x + y * y <= 1.0 is true, our dart was inside the circle quadrant.

Python solution, take one

Here is a Python function to implement our Monte Carlo method of estimating π.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def monte_pi(n):
    '''
    Use a Monte Carlo method to estimate the value of π.

    parameters
    ----------
    n : integer
      Number of "darts" to throw at a unit circle quadrant.

    returns
    -------
    Estimate of π, as a floating-point number.
    '''
    import random

    num_in = 0
    for i in range(n):
        x = random.random() # random number in [0, 1)
        y = random.random()
        if x * x + y * y <= 1.0:
            num_in += 1
    
    return 4.0 * num_in / n

We use the for statement to throw our darts, and the random.random() method to get the (x, y) value for each dart. Then, we use our simplified distance formula to determine if each dart landed inside the circle quadrant. We use an accumulator pattern to keep track of how many darts landed inside the quadrant. Finally, we return four times our area estimate as our estimate of π.

We called this function for 100, 10,000, and 10,000,000 darts, and as expected, the estimate of π gets better as the number of darts increases.

Estimate using 100 darts: 3.040000
Estimate using 10000 darts: 3.145600
Estimate using 10000000 darts: 3.141642

There is one problem, however. As the number of darts increases, it takes longer and longer for our code to execute. So, is there a way to speed things up? Well, now that you asked…

Parallelism with OpenMPI

One way we can speed up our Monte Carlo simulation is parallelism. The previous Python code uses a single Central Processing Unit (CPU) core – think of this as a single computer – to do its work. But, modern computers have multiple CPU cores; the author’s desktop has eight. And, there are cluster computers that have hundreds or thousands of cores. In parallelism, we use more than one core at the same time, all working on the same problem, to speed up our program. Here, we will see how we can use OpenMPI to parallelize our Monte Carlo estimation of π.

Introduction to OpenMPI

Note: in the discussion below, we will use the terms processors and processes interchangeably, to refer to a single CPU.

The Message Passing Interface (MPI) is a standard way to use parallelism, especially on cluster computers. Here we will use an open source implementation of MPI, OpenMPI.

The key concept in MPI is message passing. That is, all of the different processors we have working on our problem will communicate with each other by sending messages back and forth. These messages will be either data (e.g., the number of darts that landed in the circle quadrant) or control information (e.g, here’s how many darts you should throw).

In an MPI program the collection of processors working together on a problem is called a communicator group. Inside the group, each processor is assigned a unique integer number, called its rank. The process with rank zero is known as the root process; it is responsible for organizing the work that the remaining processors will do. We will call the non-root processes drones. The drones do the work while the root controls the algorithm.

Each processor in the communicator group will be running the same program. So, we will use logic (an if / else statement) to separate the code the root process should execute from the code the drone processes should use.

Template Python OpenMPI program

Generally speaking, a Python program using OpenMPI might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# we need this import to use MPI in our Python code
from mpi4py import MPI 

# get the communications object
comm = MPI.COMM_WORLD

# get rank of current processor
rank = comm.Get_rank()

if rank == 0:
    ###########################################################################
    # root processor section                                                  #
    ###########################################################################

else:
    ###########################################################################
    # drone processor section                                                 #
    ###########################################################################

In this template, the statements the root processor should execute go in the root processor section, while the statements the drone processors should use go in the drone processor section.

Messages are passed back and forth between processes in an MPI program using the send() and recv() methods of the MPI.COMM_WORLD object. The way we will use it, the send() method takes three parameters:

first, the object we want to send, then,
second, the dest parameter, which specifies the rank of the process the message should go to, and finally
third, a tag parameter, which is a number used to identify the type of message this is.

The recv() method is simpler; it takes two parameters:

first, the source parameter, which identifies the rank of the process we are looking to receive a message from, and
second, a tag parameter, which is a number used to identify the type of message we expect to receive.

Parallelized Monte Carlo method

Given the template above, and what we know about sending and receiving messages, we can now put together an OpenMPI program in Python for estimating the value of π with a Monte Carlo method. The strategy will be this:

the drone processes will each throw 10,000,000 darts at their own unit circle quadrant, and will send the number of darts that landed in the quadrant to the root processor, while
the root process will collect all of the drone processor’s counts, add them all up, and use that result to estimate the value of π.

Here’s the program to implement this strategy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import datetime
import random

# we need this import to use MPI in our Python code
from mpi4py import MPI 

def monte_pi_count(seed, n):
    '''
    Count how many darts land in the unit circle quadrant in a Monte Carlo
    estimation of π.

    parameters
    ----------
    seed : integer
      Seed value for the random number generator.
    n : integer 
      Number of "darts" to throw at a unit circle quadrant.

    returns
    -------
    Number of darts that landed inside the unit circle quadrant.
    '''
    import random

    # seed the random number generator
    random.seed(seed)

    num_in = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x * x + y * y <= 1.0:
            num_in += 1

    return num_in

###############################################################################
# main program                                                                #
###############################################################################

# The following collects information about the process. We will almost always 
# need the rank, but the others are just illustrative in this example.

# get the communications object
comm = MPI.COMM_WORLD

# get total number of processors
size = comm.Get_size()

# get rank of current processor
rank = comm.Get_rank()

# get name of the processor
name = MPI.Get_processor_name()

# just for illustration, each process (root and drones) prints the information
# it gathered.
print(size, rank, name)

# define a tag number for the messages we will pass
COUNT_FROM_DRONE = 1

# constant for number of darts each drone throws
N = 10000000

if rank == 0:
    ###########################################################################
    # root processor section                                                  #
    ###########################################################################
    
    # accumulate the number of darts in the unit circle quadrant from each of 
    # the drone processes
    count_acc = 0
    for proc in range(1, size): # proc is rank of each drone process
        # get messages from each drone, with their count. Parameters:
        #     source: rank number of drone process to receive from
        #     tag: number identifying which type of message to look for
        count = comm.recv(source = proc, tag = COUNT_FROM_DRONE)

        # illustrative only: report counts from each drone
        print('Received count {0:d} from processor {1:d}'.format(count, proc))

        # accumulate the value received
        count_acc += count

    # now estimate π
    pi = 4.0 * count_acc / ((size - 1) * N)

    # print our results
    print('Estimate of pi: {0:f}'.format(pi))

    # root process is complete!

else:
    ###########################################################################
    # drone processor section                                                 #
    ###########################################################################

    # each processor needs to have a different random number seed, otherwise 
    # they all will do exactly the same thing. So, use the current time in
    # microseconds, plus the drone's rank, as a seed value
    seed = datetime.datetime.now().microsecond + rank

    # count number of darts in the unit circle quadrant, for this drone
    count = monte_pi_count(seed, N)

    # illustrative only: print out this drone's count
    print('Drone process {0:d} count: {1:d}'.format(rank, count))

    # send the count value back to the root process. Parameters:
    #   count: the value we're sending
    #   dest: which process we're sending it to; in this case, the root
    #   tag: number identifying which type of message we're sending
    comm.send(count, dest = 0, tag = COUNT_FROM_DRONE)

    # and that's it! Drone process is complete

We have modified our Monte Carlo function to just count and return the number of darts that land on the unit circle quadrant. The drone process code calls the function and sends the results back to the root process with the send() method. The root process uses a for loop to receive counts from each of the drones with the recv() method, accumulates the results, and calculates the estimate of π. In the send() and recv() methods we use a tag value of 1, represented by the variable COUNT_FROM_DRONE. The tag field is sort of like the subject field of an email message – it tells the receiver what the received value is referring to.

We ran our code on the author’s desktop computer, which has eight cores. Here are the results:

8 2 LI231.doane.local
Drone process 2 count: 7853298
8 1 LI231.doane.local
Drone process 1 count: 7856538
8 3 LI231.doane.local
Drone process 3 count: 7852901
8 4 LI231.doane.local
Drone process 4 count: 7855042
8 7 LI231.doane.local
Drone process 7 count: 7853070
8 5 LI231.doane.local
Drone process 5 count: 7853401
8 6 LI231.doane.local
Drone process 6 count: 7855272
8 0 LI231.doane.local
Received count 7856538 from processor 1
Received count 7853298 from processor 2
Received count 7852901 from processor 3
Received count 7855042 from processor 4
Received count 7853401 from processor 5
Received count 7855272 from processor 6
Received count 7853070 from processor 7
Estimate of pi: 3.141687

First, notice that the order of each drone process finishing up is not sequential. In fact, each time we run the program, it’s likely that the order will be different! That is an important aspect of parallel programming. Each of the processes is separate, and due to uncontrollable differences in the load from other programs on each processor, we cannot rely on them finishing in any particular order, even if they are all executing the same program!

Our root process does receive the messages passed to in in order, since we put the recv() call inside a for loop. The root waits for the message from drone one before looking for the message from drone two, and so on. If, say, the message from drone four arrives before the message from drone one, the message from drone four is saved until the root process executes the appropriate recv() call.

We can see that we got a workable estimate of π, based on throwing a total of 70,000,000 darts. And, the overall process took less time than it would have taken to throw 70,000,000 darts using our first, single-processor program.

Running Python OpenMPI on your own PC

You do not have to have access to a Linux cluster computer to use OpenMPI in your Python programs! Here’s how you can set up your Windows PC to run MPI in Python, assuming you already have Python installed:

Install the Microsoft MPI Software Development Kit (SDK), and
Install the mpi4py Python module, via the command pip install mpi4py.
Execute your program (here named MontePiMPI.py) from the command line like this:

mpiexec python MontePiMPI.py

That’s it! You are now ready to run OpenMPI programs in Python on your very own computer!