Parallel programming on PlayStation 3 (Cell architecture) by example: Puzzle solving






4.59/5 (11 votes)
Feb 19, 2008
13 min read

78286

485
A bird's eye view of the programming model for the new PlayStation 3 console and an interesting example of useful concurrency.
Introduction
The purpose of this article is to offer programmers a bird's eye view of the programming model for the new PlayStation 3 console. To do this, I structured the article in two parts: the first part is a presentation of the hardware architecture, the programming model, and where to find resources, its APIs and how to use them. The second part shows an example application: we will try to solve a puzzle using as many of the programming features offered as possible.
Part 1: The Cell
Historical Context
As the need for processing power grows, hardware designers find it increasingly difficult to satisfy the demand.
The first obstacle is the so called "frequency wall" that basically says that we can only go so far by increasing the processor frequency because we get diminishing returns for deeper pipelines. A related obstacle is the heat dissipation ("power wall") that, as power consumption grows becomes even more difficult.
Another obstacle is the "Memory Wall". Memory works at a very small speed compared with the CPU; the several layers of memory (hard drive, RAM, cache L3, cache L2, cache L1) do their best to hide this latency, but today's systems are often limited by memory speed. Possible solutions to increase the performance are: increasing efficiency and increasing concurrency.
The alliance formed in 200 by Sony, IBM, and Toshiba aimed for both. The processing architecture they created, named Cell Broadband Engine, is now used both in consumer products (PlayStation 3) that require great processing power (like 3D games) and on servers.
Architecture Overview
The architecture is a heterogeneous multicore architecture: we have a Power Processing Element (PPE) for control tasks and 8 Synergetic Processing Elements (SPE) for data intensive processing.
The Cell architecture is, from Flynn taxonomy point of view, MIMD (Multiple Instructions operating on Multiple Data). That is, we can see the Cell as 9 separate processors on a single die, one of them being the master and offering work for the others and the others doing the actual work.
The Cell PPE is a 64 bit RISC PowerPC processor working at 3.2GHz, 2-way hardware multi threaded, and with separate L1 caches for instructions and data and a unified L2 cache.
The Cell SPEs are 8 per chips and offer integer and floating point operations and a local store of 256 KB. The interesting choice of the designers was to replace the on-chip cache with this local store memory (normal memory, not associative like caches). Each SPE offers 4 floating point units and integer units such that at each moment, four mathematical operations can take place. As you can see, the performance does not come only by increasing the concurrency at the number of cores, but also in each core. To control these units, we use a SIMD (Single Instruction, Multiple Data) flow; this means we can process a vector of four 32 bit elements at the same time, but we can apply only one (and the same) operation to all elements in the vector. Alternatively, we can use half-word data types for the same operations.
Programming Tools and APIs
To program for the Cell, we must use the Cell SDK found in the IBM Cell Broadband Engine Resource Center (http://www.ibm.com/developerworks/power/cell/). You can find there a lot of useful information on both the hardware and the programming model used.
The quickest way to get a functional environment for Cell programming is to download a Cell Virtual Machine, for example, from this page.
Actual programming takes place in Linux using this SDK and optionally The Eclipse IDE for which there are a couple of plug-ins that offer integration with the compiler for Cell. The Cell SDK includes a Simulator which allows you to program for Cell without actual access to Cell hardware.
The Programming Model
From the programmer's perspective, the language we use is C++, with special extensions. Let's analyse the programming model for Cell.
The usable libraries for the SPEs and the PPE are different (this is normal, considering that the hardware resources are different); this means that some functions work only in SPE code, others in PPE code; you have to check the manual for this. Usually, the header file that contains the functions gives a hint (for example, spu_mfcio.h is used for SPU memory IO access)
SPE Threads
A general Cell program will have a piece of code like this in the PPE code; this code is used to load the executable images on the SPEs and start them, offering the parameters they expect.
void *spe_thread( void *voidarg )
{
thread_args_t *arg = (thread_args_t *)voidarg;
unsigned int runflags = 0;
unsigned int entry = SPE_DEFAULT_ENTRY;
spe_context_run( arg->spe_context, &entry,
runflags, arg->argp, arg->envp, NULL );
pthread_exit( NULL );
}
void StartSpu(int i, int* originalPieceAddr, int** data)
{
sprintf(buffer,"%d %d", piece_height, (int)originalPieceAddr);
printf("Started SPU with %d %d %d and buffer %s",
originalPieceAddr[0],
originalPieceAddr[1],
originalPieceAddr[2], buffer);
spe_contexts[i] = spe_context_create( SPE_EVENTS_ENABLE, NULL );
events[i].spe = spe_contexts[i];
events[i].events = SPE_EVENT_OUT_INTR_MBOX;
events[i].data.u32 = i;
spe_event_handler_register(handler, &events[i] );
spe_program_load( spe_contexts[i], &spu );
thread_args[i].spe_context = spe_contexts[i];
thread_args[i].argp = buffer;
thread_args[i].envp = buffer;
pthread_create( &threads[i], NULL, &spe_thread, &thread_args[i] );
}
Mailboxes
Naturally, we want to support communication between the SPEs and the PPE. This is done usually by using a feature called Mailboxes. The mailbox is used to send short messages (32 bits) from an SPE to the PPE or from the PPE to an SPE, and is generally used for synchronization between the two. We can use the mailboxes in two ways:
- By blocking. The receiver waits until it receives a message, doing nothing in that time.
- By polling. The receiver periodically checks for a message, then does some work and repeats this process until it receives a message.
Obviously, the option that has a better performance is the second option because it does not lead to a "busy waiting" approach (the PPU waits for a message and cannot do anything in that time). To initialize the mailbox, a small amount of code is required, like this:
events[i].events = SPE_EVENT_OUT_INTR_MBOX;
// setting the type of events we use
spe_event_handler_register(handler, &events[i] );
//register the handler for the specified events
In the mailbox, the PPE will receive messages from any SPE so we want to know the exact SPE that sent the message. We do this by associating a number to each SPE, a number that we can obtain along with the received data:
events[i].data.u32 = i;
Waiting for a message in the PPE is done like this:
spe_event_wait(handler, events_generated, NUM_THREADS, -1);
//printf("Got event! from spe no %d:", events_generated[0].data.u32);
spe_out_intr_mbox_read (events_generated[0].spe,
(unsigned int*) &data, 1, SPE_MBOX_ANY_BLOCKING);
And sending a message from the PPE is done like this:
spe_in_mbox_write( events_generated[0].spe, value_p, 1, SPE_MBOX_ANY_BLOCKING);
DMA Access
As I said earlier, Mailboxes are used for synchronization between the PPE and the SPEs; that is so that the PPE knows what each SPE is doing at a certain time. But the need to send data to the SPEs arises. The SPEs are good at tasks involving intense data processing, so sending 32 bits at a time is out of the question. To send data from a PPE to the SPE, we use DMA access. In normal x86 C++ programming, you don't have to use the DMA explicitly, but using it like this offers better control, flexibility, and thus performance.
int tag = 30;
int tag_mask = 1<<tag;
mfc_get((volatile void*)original_piece, (uint64_t)originalPieceAddrAsInt,
piece_height*piece_height*sizeof(int), tag, 0, 0);
mfc_write_tag_mask(tag_mask);
mfc_read_tag_status_any();
The actual transfer is done using the mfc_get
function. This function starts the transfer of piece_height*piece_height*sizeof(int)
bytes, starting from the originalPieceAddrAsInt
address in the PPE memory to the SPE the code runs on; the data is written starting at original_piece
. Tagging is an optional feature of DMA transfers, and allows selective waiting for DMA transfers to complete. Generally, when we want to wait for a DMA to complete, we use a read_status
function, like this:
mfc_read_tag_status_all();
When we want to wait for a particular DMA, we apply a tag to it when we start it, and when we wait for it, we apply a tag mask:
mfc_write_tag_mask(tag_mask);
mfc_read_tag_status_any();
The tag mask is an integer in which the bit in the position corresponding to the tag of the desired transfer(s) is 1. This corresponds to a bit shift:
int tag_mask = 1<<tag;
Tags make an optimization technique called "double buffering" possible. Double buffering essentially means starting a DMA transfer before you actually need the data; you request the data for the next step, and process data from the current step. By the time you finish processing the data, the next step data has already arrived. This is used to hide the memory latency that is mentioned at the beginning of the article.
Double buffering is well-known for its use in graphical systems where, by keeping the next frame to display in memory, you can reduce flickering.
SIMD Processing
The Cell SPU libs offer functions that map to the intrinsic ASM instructions of the Cell; the most interesting of these are the SIMD instructions; they allow operations on more than one variable at a time, specifically a maximum of four 32 bit variables (integers or floating point). These operations require use of a new data type called vector that allows grouping of these variables. The main limitation of these vectors is that they must be arranged in memory at 128 bit boundaries. This is done for statically allocated data by using the aligned
attribute:
int temp[MAX_LEN] __attribute__ ((aligned(128)));
and for dynamic data by using the memalign
function:
piece_pos = (int*)memalign(128, piece_height*piece_height*sizeof(int));
Vector types are: vector int
, vector float
, vector char
, etc., and operations include addition, subtraction, division, etc. We can add the four elements found in the two vectors A and B and put the result in another vector C: C[0]=A[0]+B[0] ...C[3]=A[3]+B[3], using the spu_add
function which takes A and B as parameters and returns C. More complicated SIMD functions allow other operations like comparing four 32 bit numbers concurrently:
// vi - vector int
cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece));
zero = spu_orx(cmp);
if (*( (int*)(&zero) ) != 0) return 0;
Part 2: Puzzle Solving
The goal of this exercise is to solve a puzzle. We start with two images: the original image, and a modified image which might be the original image split into pieces and the pieces shuffled and rotated (0, 90, 180, 270 degrees) so that a new image results. Knowing the puzzle piece size (in pixels), we want to reconstruct the original image from the two images. The output will be the reconstructed image (so that we can see that the result is OK) and a text file showing where each original piece is in the puzzle image. We want to make the most efficient use of the Cell processor. Also, the program must work correctly with images that have multiple identical pieces.
My solution includes the use of:
- concurrency through SPEs
- DMA access
- Mailboxes
- Double buffering
I decided to use a simple format for the images: the text based PPM (http://netpbm.sourceforge.net/doc/ppm.html) format; this format basically uses RGB pixel format stored as ASCII text and separated by simple delimiters (like spaces). This makes the task of actually understanding the image format easier and allows us to focus on the processing that is to be done. It is not the most performance-oriented format though. We want to be able to process images of normal size (at least of 1280*1024 resolution) using pieces of size from 2*2 to 64*64 and 65500 colors (so that a pixel fits a single integer).
The basic protocol of communication between the SPEs and the PPE goes like this:
- The PPE starts the SPE threads.
- The PPE remembers the current
original_piece
that is sent to each SPE and the currentpuzzle_piece
sent to each SPE. Each SPE will get aoriginal_piece
and then it will receivepuzzle_piece
s until it identifies a puzzle piece as being identical to the original piece. The request fororiginal_piece
s andpuzzle_piece
s will use mailboxes, and the actual data will be sent through DMA. To request anotherpuzzle_piece
, the SPU sends back a 0. To announce thatpuzzle_piece
andoriginal_piece
match, the SPE returns 1, 2, 3, or 4 according to the rotation that is identified. When there are no more puzzle pieces for comparison, the PPE responds to thepuzzle_piece
request by refusing it (send 0). If there are morepuzzle_piece
s, it sends a pointer to the region of memory in which the puzzle piece resides.
The main loop for the SPE looks like this:
while ( NULL != (original = RequestNewOriginalPiece(length)) )
{
FindPuzzlePieceForOriginalPiece(piece_height);
}
The problem that appears is that if we want to use a single DMA transfer to transfer a piece in one go, the piece must be located in a continuous area in memory. If we read the image as it is in the file, the puzzle piece will be spread out; thus we need to read each puzzle piece in an array by changing each pixel's coordinates:
for(j=0; j<height; j++)
for(i=0; i< width; i++)
{
p_no_y = j / piece_height;
piece_offset = (j - p_no_y*piece_height) * piece_height;
p_no_x = i / piece_height;
piece_offset += i - p_no_x*piece_height; // pixel offset in piece
piece_no = p_no_y* (width/piece_height) +p_no_x;
// get number of piece from piece coordinates
fscanf(f_original, "%d", &r);
fscanf(f_original, "%d", &g);
fscanf(f_original, "%d", &b);
original[piece_no][piece_offset] = (r<<16) + (g<<8) + b;
// cram pixel data in a single int
}
The SPE double buffering is another interesting code area:
turn++;
if (turn%2 == 1)
{
// wait for tranfer1
// !!!value = transfer1.addr;
value = (float*)transfer1.addr;
// !!!printf("Waiting for transfer 1 at addr %d.\n", value);
//printf("Waiting for transfer 1 at addr %d.\n", (int)value);
if ( transfer1.addr != 0)
{
//printf("Transfer1 not 0.\n");
//printf("Checking tranfer status for tag = %d",transfer1.tag);
mfc_write_tag_mask(1<<transfer1.tag);
mfc_read_tag_status_any();
// use tranfer1
//printf("Tranfer 1 data (%d %d %d) length %d\n",
// transfer1.buffer[0],transfer1.buffer[1], transfer1.buffer[2], length);
memcpy(input, transfer1.buffer, length*4);
// start transfer 2:
// -request a new backup DMA tranfer address
//printf("Sending request for transfer2.\n");
spu_write_out_intr_mbox(match_result);
addr = spu_read_in_mbox();
//printf("Got addr=%d", addr);
transfer2.addr = addr;
// -initialize actual tranfer
transfer2.tag = transfer1.tag + 1;
if (transfer2.tag>30)
{
transfer2.tag -= 30;
//printf("Decremented tag.\n");
}
tag_mask = 1<<transfer2.tag;
if (addr != 0)
{
mfc_get((volatile void*)transfer2.buffer,
(uint64_t)addr, length * 4, transfer2.tag, 0, 0);
}
else
{
//printf("Don't wait\n");
}
}
turn
" variable allows for selecting the actual buffer to use, but because the double buffering structs transfer1
and transfer2
are not in an array, we cannot try to see its triple or 4-time buffering further improve performance. Changing this code is a good exercise for the reader.We take advantage of the SIMD capabilities when we test for equality of the two pieces:
int test_equality(int* puzzle_piece, int* original_piece, int piece_height)
{
vi zero,cmp;
int i = 0;
for(i=0; i<piece_height*piece_height/4; i++)
{
cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece));
zero = spu_orx(cmp);
if (*( (int*)(&zero) ) != 0) return 0;
puzzle_piece += 4;
original_piece += 4;
}
return 1;
}
An interesting situation that needs to be treated adequately is this: let's assume that the image contains multiple pieces that are the same. Then the two places from the original image that are the same must be filled with the two pieces from the puzzle image, and not with the same puzzle piece! If the program were sequential, we could stop this process simply by not sending a puzzle piece that is already fitted in the image for checking if it fits another position. But, considering we have 8 concurrent instruction streams (and thus 8 pieces are being processed at the same time), this cannot be done. The simplest way to handle this case is by checking (when the result from the SPE comes as positive) if the position of the puzzle piece was not already used. (Attention: Checking this before sending to the SPU is not enough!). Another, and more interesting approach would be to always have different pieces in processing at the same time, but this approach is more complicated to code.
References
- The SDK and the Programming Standards sectioned in the Docs page of the IBM Cell Resource Center
- Cell Broadband Engine Programmer's Guide (updated)
- Cell Broadband Engine Programming Tutorial (updated)
- Cell Broadband Engine Programming Handbook
- SIMD Math Library Specification for Cell Broadband Engine Architecture
- C/C++ Language Extensions for Cell Broadband Engine Architecture
- The wonderful course on the Structure of Computer Systems held at the Politehnica University of Bucharest, faculty of Computer Science
License
The code is licensed under a proprietary license. Any use of this code is allowed only after my explicit permission.