Click here to Skip to main content
13,198,068 members (63,260 online)
Click here to Skip to main content
Add your own
alternative version

Tagged as


3 bookmarked
Posted 13 May 2013

Programming Massively Parallel Processors (second edition) by Kirk and Hwu

, 18 May 2013
Rate this:
Please Sign up or sign in to vote.
“Programming Massively Parallel Processors (second edition)” by Kirk and Hwu is a very good second book for those interested in getting started with CUDA.

Editorial Note

This article is in the Book Review chapter. Reviews are intended to provide you with information on books - both paid and free - that others consider useful and of value to developers. Read a good programming book? Write a review!

“Programming Massively Parallel Processors (second edition)” by Kirk and Hwu is a very good second book for those interested in getting started with CUDA. A first must-read is “CUDA by Example: An Introduction to General-Purpose GPU Programming” by Jason Sanders. After reading all of Sanders work, feel free to jump right to chapters 8 and 9 of this Kirk and Hwu publication.


In chapter 8, the authors do a nice job of explaining how to write an efficient convolution algorithm that is useful for smoothing and sharpening data sets. Their explanation of how shared memory can play a key role in improving performance is well written. They also handle the issue of “halo” data very well. Benchmark data would have served as a nice conclusion to this chapter.

In chapter 9, the authors provide the best description of the Prefix Sum algorithm I have seen to date. It describes the problem being solved in terms that I can easily relate to - food. They write, “We can illustrate the applications of inclusive scan operations using an example of cutting sausage for a group of people.” They first describe a simple algorithm, then a “work-efficient” algorithm, and then an extension for larger data sets. What puzzles me here is that the authors seem fixated on solving the problem with the least number of total operations (across all threads) as opposed to the least number of operations per thread. They do not mention that the “work-efficient” algorithm requires almost twice as many more operations for the longest-path thread than the simple algorithm. Actual performance benchmarks showing a net throughput gain would be required for a skeptical reader.

Now before moving forward, let's back up a bit. Even though we have already read CUDA by Example, it is worth reading chapter 6… at least the portion regarding the reduction algorithm starting at the top of page 128. The discussion is rather well written and insightful. Now, onward.

In chapter 13, the authors list the tree-fold goals of parallel computing: solve a given problem in less time, solve bigger problems in the same amount of time, and achieve better solutions for a given problem in a given amount of time. These all make sense, but have not been the reasons I have witnessed for the transition to parallel computing. I believe the biggest motivation for utilizing CUDA is to solve problems that would otherwise be unsolvable. For example, the rate of data generated by many scientific instruments could simply not be processed without a massively parallel computing solution. In other words, CUDA makes things possible.

Also in Chapter 13, they bring up a very important point. Solving problems with thousands of threads requires that software developers think differently. To think of the resources of a GPU as a means by which you can make a parallel-for-loop run faster completely misses the point – and the opportunity the GPU provides. These three chapters then make the book worthwhile.

The chapters on OpenCL, OpenACC, and AMP seem a bit misplaced in a book like this. The author’s coverage of these topics is a bit too superficial to make them useful for serious developers. On page 402, they list the various data types that AMP supports. It would have made sense for the authors to point out that AMP does not support byte and short. When processing large data sets of these types, AMP introduces serious performance penalties.

This then brings me to my biggest concern about this book. There is very little attention paid to the technique of overlapping data transfer operations and with kernel execution. I did happen upon a discussion of streaming in chapter 19, “Programming a Heterogeneous Computing Cluster.” However, the context of the material is with respect to MPI, and those not interested in MPI might easily miss it. Because overlapping I/O with kernel operations can easily double throughput, I believe this topic deserves at least one full chapter. Perhaps in the next edition, we can insert it between chapters 8 and 9? Oh, and let’s add “Overlapped I/O”, “Concurrent” and “Streams” as first class citizens in the index. While we are editing the index, let’s just drop the entry for “Apple’s iPhone Interfaces”. Seriously.

In summary, I believe this is a very helpful book and well written. I would consider it a good resource for CUDA developers. It is not, however, a must-have CUDA resource.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

John Michael Hauck
Software Developer (Senior) LECO Corporation
United States United States
John Hauck has been developing software professionally since 1981, and focused on Windows-based development since 1988. For the past 17 years John has been working at LECO, a scientific laboratory instrument company, where he manages software development. John also served as the manager of software development at Zenith Data Systems, as the Vice President of software development at TechSmith, as the lead medical records developer at Instrument Makar, as the MSU student who developed the time and attendance system for Dart container, and as the high school kid who wrote the manufacturing control system at Wohlert. John loves the Lord, his wife, their three kids, and sailing on Lake Michigan.

You may also be interested in...

Comments and Discussions

QuestionOpenCL DOES support async/overlapped memory transfers Pin
jimrayvaughn17-May-13 19:27
memberjimrayvaughn17-May-13 19:27 
AnswerRe: OpenCL DOES support async/overlapped memory transfers Pin
John Michael Hauck18-May-13 5:23
memberJohn Michael Hauck18-May-13 5:23 
SuggestionThis should probably be moved to the "Book Reviews" section. Pin
SoMad14-May-13 11:14
professionalSoMad14-May-13 11:14 
AnswerRe: This should probably be moved to the "Book Reviews" section. Pin
John Michael Hauck15-May-13 4:46
memberJohn Michael Hauck15-May-13 4:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.171020.1 | Last Updated 18 May 2013
Article Copyright 2013 by John Michael Hauck
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid