Python Single Thread vs Multi-Thread vs Multi-Process on Monty Hall Problem

DaveAuld

5.00/5 (5 votes)

Sep 13, 2018

CPOL

22 min read

17817

153

This is a probability based simulation that demonstrates 'Swapping' is considered the best option! We also look at the effect of multi-threading and multi-process in Python.

Introduction

With all the moving for both work and residential I have had since 2015, I really had dropped out of doing any software or gadget/micros related stuff. I was needing to get back into it again. I really wasn't sure what, but just needed something to prod the grey matter again.

The first part of this was to get my main development box under control. It was driving me mad with the noise, and it was also choked full. I took the plunge into building a new watercooled box which I named Serenity. That gave me some enjoyment as it had been a while since I had given my machine an upgrade, and adding the watercooling not only helped with the noise, but also gave me some hardware to play with. I did write a build log complete with pictures and did some benchmarking, you can read about it in the article here.

I was then poking around YouTube, and stumbled across a video that talked about the Monty Hall Problem that someone had submitted to Engineer Man, and he basically showed the ouptut of this program, but not details on how it was constructed, etc. He did give a basic explanation of the Monty Hall Problem in his introduction. These two things gave me enough of a poke, and off I went, and here we are now.

Armed with Visual Studio Code, Python installed on my laptop and Google at the ready........it was time to begin.

The Monty Hall Problem

You can read up more on this over on Wikipedia, but the short form is as follows;

You are a contestant on a game show, and you have to pick from three options, one of which is the prize. The host, in this case Monty Hall, knows which option is the winner. Once you, the contestant, have selected your option, the host will show you one of the other options, revealing the losing one. You now have the choice of either sticking with your original choice, or swapping to the last unrevealed option.

Probability shows that you should always swap, as it has the highest chance of winning at 66%, and if you stick with your original choice, your odds are only 33% of winning.

I had also decided to write this in Python, as I had only ever touched the language once before, and this was really only a copy and paste directly from an example, to get a temperature sensor reading via a Raspberry Pi, so not enough to get me thinking. Python is one of those languages that appears to be doing the rounds more and more, and appears to be gaining in popularity. At least it appears that way as I see it mentioned more and more on the net.

That was where it was originally planned to end. However, I started looking into performance of the run, which took me into the realms of multithreading and a surprising outcome of it in Python and this in turn lead to looking into how multi-process fits in and can be achieved. The latter turning into a proper headache to begin with!

Getting into the Code

When I first wrote out the program, and got it working, I was like 'Yeah!', then I wanted to be able to change how many rounds the code would run, and also wanted to be able to turn off the round output to speed up execution for longer round runs. This then led me to add additional functionality to set these options via command line inputs, without having to change variables in the code each time.

What we are going to walk through here, is the code that did just that - a single threaded end to end program.

The first thing we needed to do was import the various libraries, that we would need functionality from.

import argparse # argparse added to support command line parameter functionality
from random import randint, choice # used for selections 
from timeit import default_timer as timer # used for timing the runs.

argparse provides the functionality to do the command line parameter parsing and handling. It is a bit of a complicated beast to setup correctly and I had to search a few different resources for examples. I wasn't doing anything complicated as we will see later, but still had issues to get it to behave.

You can also see that for the default_timer, we also give it an alias name timer.

Next up, we will set a couple of variables with defaults for the program. Namely, how many rounds we will run and if we will show the individual round output.

numberOfRounds = 1000     # Set default for number of rounds
roundOutput = False       # Set default for display of individual round output

The next section relates to the handling of the command line parameters by configuring how the argparse will deal with the various parameters.

# Setup the argparse
parser = argparse.ArgumentParser(prog="montyhallsim", 
                 description='''Monty Hall Simulation. This is a basic Monty Hall Simulation, 
                 the program will run for a given number of rounds  
                 and display the number of wins for the different methods (stick/random/swap).''', 
                 epilog='''For more information on the Monty Hall paradox, visit; \n
                 https://en.wikipedia.org/wiki/Monty_Hall_problem''')
# Sdd argument for displaying the round output.
parser.add_argument("-o", "--output", action="store_true", 
                     help="Display individual round output. Default is hidden.")
parser.add_argument("-r", "--rounds", nargs=1, type=int, default=1000, 
                     help="Set the number of rounds. Integer. Default is 1000.")
args = parser.parse_args()

if args.output:
     roundOutput=True

if args.rounds:
    if type(args.rounds) is int:            # If not supplied on cli, defaults value returns int type
        numberOfRounds = args.rounds
    elif type(args.rounds) is list:         # If supplied on cli, it returns a list of int, need 1st one.
        numberOfRounds = args.rounds[0]
        if numberOfRounds == 0:             # Prevent user providing 0 as a number
            numberOfRounds = 1

The first part declares a variable parser, that is the instance of the argparse and then setup some parameters containing information that will be presented to the user about the program.
The arguments are added and you can see that both short and longforms can be setup, e.g. -o and --output, in this instance, for --output, an action is defined to set the parameters to a value true if it is specified.
Logic is then applied using the args collection on the various parameters to determine how the program should respond to the users command line inputs.
One interesting thing to note is how the type for the --rounds parameter value changes type depending if it supplied or automatically defaulted. If it is defaulted, it is an int type, and if it is provided, it is a list type, so you need to test accordingly.

We need somewhere to store the results of each round, and also the final results of the simulation, two variables are declared for these, round will be re-used on each iteration by each run.

# current round array contains, [RoundNumber, WinningNumber, ParticipantPick, 
  HostShow, ResultStick, ResultRandom, ResultSwap]
round = [0,0,0,0,False,False,False]

# count of wins for each strategy, stick, random, swap
results = [0,0,0]

And the last real part of setup is the variable to do with the execution timing determination. We declare two variables to hold the start and stop.

# Timings for Run
startTime = timer()
finishTime = timer()

Now we are pretty much ready to begin execution of a run, before we can actually trigger off the run, we define the various methods that build up the main execution loop.

The first part of this is the main() method. In this, we generate some output to tell the user what we are doing, using some basic print() statements. We check if the user is requiring to see the individual round output and if so print the header so they know what the contents of the round output mean. This is a basic if test of the roundOutput variable declared at the start and set using the command line. Following this, we move into a for loop which repeats until the range of numberOfRounds is reached. This was also declared at the start and set via command line if required. The call the runRound() method on each iteration.

Once the loop has been completed, the output prints out the results from the round, and also calculates the run time, and then the program exits.

def main():
# Initialise current round by setting up the winning number
print("Monty Hall Simulator, 3 boxes.")
print("Number of Rounds: " + str(numberOfRounds))

if roundOutput == True:
  print("RoundNumber, WinningNumber, ParticipantPick, HostShow, ResultStick, ResultRandom, ResultSwap")

for round[0] in range(numberOfRounds):
  runRound()

else:
  finsihTime = timer()
  duration = finishTime - startTime
  print("Results for Number of Rounds: " + str(numberOfRounds))
  print("============================================================")
  print("Duration, " + str(duration) + " seconds.")
  print("Stick = " + str(results[0]) + " : " + str((float(results[0]) / numberOfRounds) * 100) + " %")
  print("Random = " + str(results[1]) + " : " + str((float(results[1]) / numberOfRounds) * 100) + " %")
  print("Swap = " + str(results[2]) + " : " + str((float(results[2]) / numberOfRounds) * 100) + " %")

In the main() method loop, we called the doRound() method. This starts off the process of performing the logic associated with the individual round. The first part of this is to randomly select the winning box, and performing the random selection by the participant. After this, it calls the next phase of the round which is the hostPick() function.

def runRound():
# Increment Round Number
round[0] += 1

# Select the rounds winning box, random choice
round[1] = randint(1,3)

# Select the participant random choice
round[2] = randint(1,3)

# Host does their reveal next.
hostPick()

As the host knows what the participant has selected, and also where the winning number is, using some basic logic, the host determines which option they would display to the participant. This is, of course, a non-winning box. The method then calls the next participantChoiceResult() which is where the final round choice is made by the participant and the results of the rounds are determined.

def hostPick():
#host compares winning box with participant choice and shows a losing box

# 1st Case, Participant has chosen the right box
if round[1] == round[2]:
  if round[1] == 1:
    round[3] = choice([2,3]) #Participant Pick 1, Host Show 2 or 3
  if round[1] == 2:
    round[3] = choice([1,3]) #Participant Pick 2, Host Show 1 or 3
  if round[1] == 3:
    round[3] = choice([1,2]) #Participant Pick 3, Host Show 1 or 2

# 2nd Case, Participant has chosen the wrong box
if round[1] <> round[2]:
  if round[1] == 1 and round[2] == 2:
    round[3] = 3 #Participant Picked 1, correct is 2, Host Show 3
  if round[1] == 1 and round[2] == 3:
    round[3] = 2 #Participant Picked 1, correct is 3, Host Show 2
  if round[1] == 2 and round[2] == 1:
    round[3] = 3 #Participant Picked 2, correct is 1, Host Show 3
  if round[1] == 2 and round[2] == 3:
    round[3] = 1 #Participant Picked 2, correct is 3, Host Show 3
  if round[1] == 3 and round[2] == 1:
    round[3] = 2 #Participant Picked 3, correct is 1, Host Show 2
  if round[1] == 3 and round[2] == 2:
    round[3] = 1 #Participant Picked 3, correct is 2, Host Show 1

#Participant has their 2nd choice next
participantChoiceResult()

After the host has presented their option to the participant, the participant now makes their final choice, i.e., do they stick with their original choice, or do they swap to the last unrevealed box. Now, in the code, I have also added another option where the participant can't decide whether to stick or swap, so does a random selection from the two valid choices. Through each check, any win is recorded to the results array. Lastly, if the roundOutput flag has been set, then this is also printed out.

def participantChoiceResult():
# 1st Case Participant Sticks
if round[1] == round[2]:
  round[4] = True
  results[0] += 1 # Increment Win count
else:
  round[4] = False

# 2nd Case Participant Picks Random box from remaining 2
if round[3] == 1:
  if choice([2,3]) == round[1]:
    round[5] = True
    results[1] += 1 # Increment Win count
  else:
    round[5] = False

if round[3] == 2:
  if choice([1,3]) == round[2]:
    round[5] = True
    results[1] += 1 # Increment Win count
  else:
    round[5] = False

if round[3] == 3:
  if choice([1,2]) == round[2]:
    round[5] = True
    results[1] += 1 # Increment Win count
  else:
    round[5] = False

# 3rd Case Participant Swaps box
if round[2] == round[1]:
  #If Participant had originally picked winning number, then loses.
  round[6] = False
else:
  round[6] = True
  results[2] += 1 # Increment win count

#Show round output
if roundOutput == True:
  printRoundOutput()

The method below is the one used to print the round output.

def printRoundOutput():
# Display the ouptut for the current round
print(str(round[0]) + ":" + str(round[1]) + ":" + str(round[2]) + ":" + 
      str(round[3]) + ":" + str(round[4]) + ":" + str(round[5]) + ":" + str(round[6]))

With all the necessary methods defined, it is now possible to actually kick everything off by call the main() method.

# Let's Go!
main()

Execution and Output

Opening up a command prompt executing the following command with no parameters will give you the default output which is 1000 rounds and no individual round output:

python montyhallsim.py

The next example specifies that the user wants to see the output and set the number of rounds to 10 using the command line:

python montyhallsim.py --output --rounds 10

If you remember, we added some other description and epilog parameters in the argparse definition at the start. These can be seen when the user asks for help using the --help option.

python montyhallsim.py --help

Results

As you can see from the two examples, for the default 1000 and 10, swapping was from a probability point of view the right choice. If we scale this up to a 1,000,000 rounds, the output is:

Results for Number of Rounds: 1000000
=====================================
Duration, 4.97863848097 seconds.
Stick  = 332312 : 33.2312 %
Random = 499819 : 49.9819 %
Swap   = 667688 : 66.7688 %

So, looks like Swap is definitely the way to go!

Wait a Minute, What About Performance?

Ah, good question! Running the 1,000,000 rounds on my laptop, which is a Intel Core i7-5600U @ 2.6GHz yields an execution time of just under 5 seconds (you can see the time in the results above). The program however is only executing in single thread. The CPU in my laptop is a 2 Core/4 Thread. If I run 10,000,000 rounds and take a snapshot of the CPU performance, you can see that the load appears to be spread across the 4 CPUs, for an average total load of 35%, and a duration of 49.9470172114 seconds.

So the question must be, can we convert the program to be multi-threaded, and take advantage of all 4 threads?

Switch to Multi-Threading

The switch to multi-threading needs some changes to the code, not only in how we setup the threads, but how common variables are shared between the threads. The outline of the changes are shown below as we walk through the code.

Starting off with the usual additions to any library requirements.

import threading       # Required for multi-threading
import multiprocessing # Required to get CPU max logical cores.

A new variable is added to hold how many threads we will limit ourselves to. This is set to default to the number of logical processors reported by the system using a call to the multiprocessing library.

threadLimit = multiprocessing.cpu_count() # Set default number of threads.

Rather than restricting ourselves to a fixed number of threads, providing the ability to adjust the number of threads used can be achieved by adding a new parameter --threads to our command line handling. This will be defaulted to the threadLimit previously determined.

parser.add_argument("-t", "--threads", nargs=1, type=int, default=threadLimit, 
    help="Set the number of threads. Integer. Default is CPU Logical Cores."+ str(threadLimit))

The command line arguments will also be used to adjust the threadLimit variable declared earlier. A check is done to prevent the user setting 0 threads.

if args.threads:
  if type(args.threads) is int: # If not supplied on cli, defaults value returns int type
    threadLimit = args.threads
  elif type(args.threads) is list: # If supplied on cli, it returns a list of int, need 1st one.
    threadLimit = args.threads[0]

if threadLimit == 0: # Prevent user providing 0 as a number
  threadLimit = 1

To prevent threads trying to simultaneously access a common resource, thread locking mechanisms are required. There are 3 different locks we need, one for the round counter, one for the results variable and also one for the ouptut. A variable threads[] is also added to store the collection of threads.

# Threads collection
threads = []

# Current Round Number
currentRound = 0
currentRoundLock = threading.Lock() # Each Thread needs to acquire this lock 
                   for updating the currentRound counter.

# Output Print Lock
outputLock = threading.Lock() # Each Thread needs to acquire this lock for printing output, 
                                helps keeping alignment of text.

# count of wins for each strategy, stick, random, swap
results = [0,0,0]
resultsLock = threading.Lock() #Each Thread needs to acquire this lock for updating results.

In the main() function, the threads are now created and added to the threads[] collection, notice the target argument which contains the function name runRound that the thread should execute. The thread is also given a simple name, which can be displayed in the output, so you can see which thread executed which round. The threads are then started. The program will now wait for all threads to return, using the join method on each thread, before ending as normal.

# Register the threads upto the thread limit
for t in range(threadLimit):
  newThread = threading.Thread(target=runRound, name="t"+str(t))
  threads.append(newThread)

for t in threads:
  t.start()

for t in threads:
  t.join()

The runRound function needs to be modified to support the multi-threading.

When the function first calls, it checks the current round is less than the target number of rounds. A lock is acquired on the currentRound variable, another check is performed again, as one of the other threads may have incremented the currentRound variable between the first check and the lock acquire.

If there is still work to be done, then the function continues. Once the round[0] value is set to the currentRound value, we are finished with the threading lock, so release it. If there is no more work to be done, then we release the lock that was acquired, and simply fall through the function.

This is one area where I managed to get myself into a knot and had a race condition lock the program, the lock release() after the else was key to the problem.

def runRound():
  global currentRound
  while currentRound < numberOfRounds:
    currentRoundLock.acquire()
    if currentRound < numberOfRounds:
      currentRound += 1
      # current round array contains, [RoundNumber, WinningNumber, ParticipantPick, 
                                       HostShow, ResultStick, ResultRandom, ResultSwap]
      round = [0,0,0,0,False,False,False]

      # Increment Round Number
      round[0] = currentRound
      currentRoundLock.release()

      # Select the rounds winning box, random choice
      round[1] = randint(1,3)

      # Select the participant random choice
      round[2] = randint(1,3)

      # Host does their reveal next.
      hostPick(round)
    else:
      currentRoundLock.release()

There are two other things to notice in the changed function. The first is that we have moved the round[] data from being a global variable, into the thread itself, this is because each thread will need its own copy. We also pass this variable onto the next called function so the thread continues to work with the same variable. The other thing to note is the addition of the global currentRound statement at the start of the function. The parser didn't like the variable being used by the while statement, and reported it as a problem, i.e., it had not been defined, the use of the global keyword informs that the variable is at the global level.

Other changes required were in each function for handling the round data being passed in, and also in the handling of the results. To minimise the lock duration on the results variable, all logic was first applied and updated into the round data, then the lock was acquired for incrementing the results.

Typical change to function to handle record data passing:

def hostPick(roundData):
#host compares winning box with participant choice and shows a losing box
  round = roundData
 
  ...

Results updating pulled out of core logic and standalone:

# Now update results if required, first acquiring lock object
if round[4] or round[5] or round[6]:
  resultsLock.acquire()
  if round[4]:
    results[0] += 1
  if round[5]:
    results[1] += 1
  if round[6]:
    results[2] += 1
  resultsLock.release()

If we now run the code with 4 threads and show the output for 10 rounds, you can see each thread name appearing at the end of the round data.

python monty_hall_threaded_test.py --threads 4 --rounds 10 --output

Monty Hall Simulator
Number of Rounds: 10
Number of threads: 4
RoundNumber, WinningNumber, ParticipantPick, HostShow, ResultStick, ResultRandom, ResultSwap, Thread
1:3:1:2:False:False:True:t0
2:3:3:2:True:False:False:t1
3:1:3:2:False:False:True:t0
4:3:1:2:False:True:True:t2
5:2:3:1:False:True:True:t3
6:3:3:1:True:False:False:t1
7:2:3:1:False:True:True:t0
8:2:1:3:False:True:True:t2
9:3:1:2:False:False:True:t3
10:2:3:1:False:False:True:t1
Results for 10 rounds using 4 threads.
============================================================
Duration, 0.0241714037581 seconds.
Stick  = 2 : 20.0 %
Random = 4 : 40.0 %
Swap   = 8 : 80.0 %

You can see above that t0 completed 3 rounds, t1 3 rounds and both t2 and t3 completed 2 each.

Excellent, we now have a working multi-threaded code base.

Multi-threading Performance

We will run the program several times on 1,000,000 rounds, using 1, 2, 4, 8 and 128 threads and see what we get.

Threads | Time (seconds)
------------------------
   1    | 6.04301673772
   2    | 31.2029141007
   4    | 38.0319919469
   8    | 38.4018383862 
  128   | 38.6696277436

Hang on! Those durations are getting worse with the increase in the thread count. You will also notice that basically after 4 threads, there is no real change. It is also worse on 1 thread than it was when the application was originally written just to be single threaded with no threading capability built in.

This highlights the key considerations you have to make when writing any program, basically where are the bottlenecks. Here, we're doing a CPU intensive activity. In Python, threading benefits IO constraints. As it states in the documentation for the threading module, "due to the Global Interpreter Lock, only one thread can execute Python code at once". We can see this if we look at the CPU usage for a run using the logical processor count, in this case 4:

The CPU utilisation doesn't exceed 50%. The fact the single threaded version being quicker than the threaded version run using 1 thread can possibly be explained by the additional overhead of having to acquire and release locks frequently.

So for this type of CPU intensive activity, what can we do to improve performance? Well, it also tells us in the threading module of the Python documentation, that multi-processing is maybe the way to go.

The Move to Multi-Processing

When I looked at the documentation for the multiprocessing module, and a couple of the basic examples around the net, it looks fairly simple, you basically swap thread for process and all is good. Well, that couldn't be further from the truth! The first thing to realise is how the multiprocess is created and data shared. No state information is copied across, so you need to consider first how to coordinate access to shared data.

This is where things got a little bit head scratchy, I tried various approaches based on the documentation in the multiprocessing module, and searching for various examples when my code didn't work. It all got a bit heavy and my head really began to hurt. It felt like I was going round in circles, all the basic examples just didn't cut it, and all the heavy examples were way over the top.

In the end, I decided to use the Pool approach, which showed promise, stumbled across and added to this the use of a Partial which further helped and then finally added in an Initializer and a Queue. And BOOM! It all worked a treat.

This all came about by breaking the problems encountered into chunks and building as we go:

How to get multiprocess in a pool
How to get multi input and results to and from the pool processes
How to get additional flags into the processes
How to synchronize the print statements from the processes

Given the way Python generates the child processes, it is essential to distinguish which code will run only in the main process so that it does not get consumed in the child. This is achieved using a simple If statement, wrapping the main code only;

if __name__ == "__main__":
    # main process only code

Within this, we did the usual, set up the defaults, etc. and initialize the argparse. I added a similar additional parameter to the threading version, but to set the process limit;

parser.add_argument("-p", "--procs", nargs=1, type=int, default=processLimit, 
help="Set the number of processes. Integer. Default is CPU Logical Cores. " + str(processLimit))

if args.procs:
  if type(args.procs) is int: # If not supplied on cli, defaults value returns int type
    processLimit = args.procs
  elif type(args.procs) is list: # If supplied on cli, it returns a list of int, need 1st one.
    processLimit = args.procs[0]
    if processLimit == 0: # Prevent user providing 0 as a number
      processLimit = 1

The approach I was using was slightly different that the previous versions, each round would return its own copy of the result, rather than all use a common shared array. After all processes had completed, the results would be aggregated back into a single result array. This was defined as a simple array:

# count of wins for each strategy, stick, random, swap,
finalResults = [0,0,0]

You will see how that is used later on.

Next up was to define the Queue. This was going to handle any individual round output printing, if the --output flag had been set on the CLI. When I originally did the print() statements in the individual rounds, all the processes printing to the same console jumbled up all the output. Now, each process would push the output text onto the queue, where the main process would be the one that offloaded it back to the console.

# Global queue for passing print output from pool processes to main for display
outputQ = multiprocessing.Queue()

As mentioned earlier, I used a Partial. This is a method of passing multiple variables into the pool process map function.

# Variable for passing multiple arguments to the process.map
target = partial(processRound, output=roundOutput)

The processRound is the function executed by the pool in each child process, and the roundOutput is the flag to say whether the round should print out its individual data. The latter is fixed and won't change.

Now we can create the Pool and the child process work.

# Setup the pool and initiate the work
p = multiprocessing.Pool(processLimit, initializer=initProc, initargs=(outputQ, ))
results = p.map(target, range(numberOfRounds))

The first parameter of the Pool is a number to specify the limit to how many processes to create. The initialize parameter specifies a function to run which will initialise the process. The initargs parameter specifies which variables will be passed into the initializer function. In this case, it is the reference to the Queue for the output messages.

results is an array containing all the results from the processes. In this case, each round will return an array containing if the 'Stick, Random, Swap' were a win or not. [0,0,1], where 0 is lose, 1 is a win.

p.map is passed in the partial defined earlier, and specifics of the data to be distributed and processed by the child processes, in this case, a range from 0 to numberOfRounds

results = p.map() is effectively a map/reduce function.

Now the child processes are off and running. If we have the output flag, we need to pull those messages off the queue and print them to the output. It is a loop which checks the queue is not empty and that the pool has child processes. It constantly loops round getting the messages off the queue and printing them.

# Check if we have child processes in the pool and the shared queue is not empty
# print the queue to the standard output.
while not (outputQ.empty() and (p._pool.count > 0)):
print(outputQ.get_nowait())

Once the work is complete, we have to close the pool and join the pool back in.

p.close()
p.join()

With all the results sitting in the result array, we need to aggregate them into the final results.

The results from the map function would like this [[0,1,0],[1,1,0],[1,0,0],...]. To aggregate, run a loop over and increment the final results tally as you go. For the above, the final result would look like [2,2,0].

# Aggregate results from pool results.
for result in results:
  finalResults[0] += result[0]
  finalResults[1] += result[1]
  finalResults[2] += result[2]

Throw in the same timers and output printing of the results and that completes the requirements in the main process.

All the other functions are defined earlier in the code. one of which was the Initializer, it received the outputQ and creates the global reference for the child processes.

def initProc(outQ):
  # Used by the process pool to initialize the shared global queue on the child processes
  global outputQ # The shared queue
  outputQ = outQ

When the round processing is kicked off, there are a couple of other changes now required.

First off, we now pass in the currentRound to be processed and also if the output should be printed. Each round will now hold its own round array, containing all the data required for determining the result. This is passed down the chain of functions, each function doing its part and updating the round array as it goes.

Each round also has its own result array, this is also passed down the chain, and at the end of the chain, the results check populates the results array and returns it all the way back up the stack to hand back to the map results.

If we look at the runRound function, you can see this:

def processRound(currentRound, output):
  # Local Round Data
  round = [0,0,0,0,False,False,False] # 
          [RoundNumber, WinningNumber, ParticipantPick, HostShow, ResultStick, ResultRandom, ResultSwap]
  result = [0,0,0] # [stick, random, swap]

  # store the round number
  round[0] = currentRound

  # Select the rounds winning box, random choice
  round[1] = randint(1,3)

  # Select the participant random choice
  round[2] = randint(1,3)

  # Host does their reveal next. pass on the local data
  hostPick(round, output, result)

  # Pass result back to caller
  return result

The last function in the stack not only calculates the results, but determines if the round data needs to be printed. If required, it calls the necessary printRoundOutput function. This function then uses the global queue referenced previously and pushes the output onto the queue.

def printRoundOutput(round):
  # Place the output text for the current round onto the shared queue.
  text = str(round[0]) + ":" + str(round[1]) + ":" + str(round[2]) + ":" + 
  str(round[3]) + ":" + str(round[4]) + ":" + str(round[5]) + ":" + str(round[6]) + ":" + 
  multiprocessing.current_process().name
  outputQ.put(text)

Excellent! We now have a multi-processing version of the Monty Hall Problem.

Multi-process Performance

For comparison purposes, let us run the same 1,000,000 rounds, using the same 1, 2, 4, 8, 128 processes as we did with the threads version, and also compare against the single thread version

Process | Time (seconds)     Threads   | Time (seconds)   Single Thread Version (seconds)
------------------------     --------------------------   -------------------------------
    1   | 7.84796339471         1      | 6.04301673772            4.9596014663
    2   | 5.13545195655         2      | 31.2029141007
    4   | 4.09028748777         4      | 38.0319919469
    8   | 4.19029514611         8      | 38.4018383862
   128  | 9.12885394124        128     | 38.6696277436

Great, so taking advantage of all 4 logical cores for 1,000,000 rounds is about 18% quicker.

We can discount the threaded version as it clearly doesn't work for this type of problem. If we push the workload up to 10,000,000 rounds and compare the single threaded to multi-process, the results are less impressive. 49.5728323937 vs 45.8644244836 seconds.

Not as much an improvement, but an improvement none the less. The type of work being done and how it is distributed is obviously key. Maybe if the approach to distributing the data and running independent processes rather than a pool, may still yield further improvement, especially if there are no extra overheads for pooling, etc.

You can also see that as you exceed the logical CPU count, in this case 4, the performance also starts to drop off. This is expected as you have to now start trying to share those extra processes across the cores that are already trying to be kept fully occupied doing work.

We can clearly see however that the CPU is now being utilised for 100% and that was what we were aiming to do by coming away from single thread and head towards multi-process.

What next? Must be able to get better performance still... maybe a job for another day to explore different multi-process options.

Update on 11^th Oct 2018: Running on Serenity

The main content of the article above was all developed and ran on my laptop. I was away working at the time, so that is all I had at hand. Now back at home, I had the chance to now run the code on my new computer I recently built. You can read about it here.

Serenity is a Intel i9-7920x, with 12 Cores / 24 Threads, which is miles ahead of the laptop CPU!

I ran the multiprocess version of the code against all the 24 threads and for the same 1,000,000 rounds. The laptop's best time to complete this was 4.09028748777 seconds. As you can also see from above screenshots, this completely maxed out the CPU on the laptop to 100%.

So how did Serenity fair? Well it completed the same task in 1.5631458167 seconds. It was so quick in comparison, that the CPU didn't even have time to fully ramp up to boost speed. If you look at the CPU graph, there is a brief spike on each thread. Safe to say, that the CPU didn't really feel a thing!

Ok, so what if we take it up to the next comparison, the 10,000,000 rounds, what happens then? The fastest the laptop could achieve was 45.8644244836, as for Serenity, it managed the same task in 16.0187814005 seconds. Again, as you can see from below, there wasn't really much load on the CPU, just the initial spikes and then stabilises for the duration of the run.

Conclusions

So there we have it, taking a single algorithm problem and exploring the journey through single threaded, multi-threading and multi-processing.

Having not given Python any real attention, this was more an educational piece for myself. There are probably still lots of improvements that can be made, lots of bad design choices, lack of standard/best practices being used, etc., but regardless of all that stuff, it was an experience. The move to the multi-processing really turned out to be a lot harder than I originally anticipated from just reading the documentation.

I may go back at a later date and revisit. I am sure the performance can be improved further for the multi-process approach. Maybe that is something you can take away and explore and of course, share your findings and knowledge with all of us to learn from.

References

History

11^th October 2018 - Added test runs on Serenity
13^th September 2018 - First release