Introduction

The purpose of this article is to offer programmers a bird's eye view of the programming model for the new PlayStation 3 console. To do this, I structured the article in two parts: the first part is a presentation of the hardware architecture, the programming model, and where to find resources, its APIs and how to use them. The second part shows an example application: we will try to solve a puzzle using as many of the programming features offered as possible.

Part 1: The Cell

Historical Context

As the need for processing power grows, hardware designers find it increasingly difficult to satisfy the demand.

The first obstacle is the so called "frequency wall" that basically says that we can only go so far by increasing the processor frequency because we get diminishing returns for deeper pipelines. A related obstacle is the heat dissipation ("power wall") that, as power consumption grows becomes even more difficult.

Another obstacle is the "Memory Wall". Memory works at a very small speed compared with the CPU; the several layers of memory (hard drive, RAM, cache L3, cache L2, cache L1) do their best to hide this latency, but today's systems are often limited by memory speed. Possible solutions to increase the performance are: increasing efficiency and increasing concurrency.

The alliance formed in 200 by Sony, IBM, and Toshiba aimed for both. The processing architecture they created, named Cell Broadband Engine, is now used both in consumer products (PlayStation 3) that require great processing power (like 3D games) and on servers.

Architecture Overview

The architecture is a heterogeneous multicore architecture: we have a Power Processing Element (PPE) for control tasks and 8 Synergetic Processing Elements (SPE) for data intensive processing.

The Cell architecture is, from Flynn taxonomy point of view, MIMD (Multiple Instructions operating on Multiple Data). That is, we can see the Cell as 9 separate processors on a single die, one of them being the master and offering work for the others and the others doing the actual work.

The Cell PPE is a 64 bit RISC PowerPC processor working at 3.2GHz, 2-way hardware multi threaded, and with separate L1 caches for instructions and data and a unified L2 cache.

The Cell SPEs are 8 per chips and offer integer and floating point operations and a local store of 256 KB. The interesting choice of the designers was to replace the on-chip cache with this local store memory (normal memory, not associative like caches). Each SPE offers 4 floating point units and integer units such that at each moment, four mathematical operations can take place. As you can see, the performance does not come only by increasing the concurrency at the number of cores, but also in each core. To control these units, we use a SIMD (Single Instruction, Multiple Data) flow; this means we can process a vector of four 32 bit elements at the same time, but we can apply only one (and the same) operation to all elements in the vector. Alternatively, we can use half-word data types for the same operations.

Parallel programming on MIMD machines is becoming even more popular these days; so, much of the information in this article is not only interesting but also useful, because you can better understand how to write code that uses today's multicore processors better. Remember that currently many people have dual core or even quad core processors and it's a shame not to harness that power.

Programming Tools and APIs

To program for the Cell, we must use the Cell SDK found in the IBM Cell Broadband Engine Resource Center (http://www.ibm.com/developerworks/power/cell/). You can find there a lot of useful information on both the hardware and the programming model used.

The quickest way to get a functional environment for Cell programming is to download a Cell Virtual Machine, for example, from this page.

Actual programming takes place in Linux using this SDK and optionally The Eclipse IDE for which there are a couple of plug-ins that offer integration with the compiler for Cell. The Cell SDK includes a Simulator which allows you to program for Cell without actual access to Cell hardware.

The Programming Model

From the programmer's perspective, the language we use is C++, with special extensions. Let's analyse the programming model for Cell.

The usable libraries for the SPEs and the PPE are different (this is normal, considering that the hardware resources are different); this means that some functions work only in SPE code, others in PPE code; you have to check the manual for this. Usually, the header file that contains the functions gives a hint (for example, spu_mfcio.h is used for SPU memory IO access)

SPE Threads

A general Cell program will have a piece of code like this in the PPE code; this code is used to load the executable images on the SPEs and start them, offering the parameters they expect.

void *spe_thread( void *voidarg ) 
{

  thread_args_t *arg = (thread_args_t *)voidarg;

  unsigned int runflags = 0;
  unsigned int entry = SPE_DEFAULT_ENTRY;

  spe_context_run( arg->spe_context, &entry, 
      runflags, arg->argp, arg->envp, NULL );
  pthread_exit( NULL );
}

void StartSpu(int i, int* originalPieceAddr, int** data)
{
    sprintf(buffer,"%d %d", piece_height, (int)originalPieceAddr);
    printf("Started SPU with %d %d %d and buffer %s", 
            originalPieceAddr[0], 
            originalPieceAddr[1],
            originalPieceAddr[2], buffer);
    spe_contexts[i] = spe_context_create( SPE_EVENTS_ENABLE, NULL ); 

    events[i].spe = spe_contexts[i];
    events[i].events = SPE_EVENT_OUT_INTR_MBOX;
    events[i].data.u32 = i; 
    spe_event_handler_register(handler, &events[i] );
    spe_program_load( spe_contexts[i], &spu );
    thread_args[i].spe_context = spe_contexts[i];
    thread_args[i].argp = buffer;
    thread_args[i].envp = buffer;

    pthread_create( &threads[i], NULL, &spe_thread, &thread_args[i] );

}

Programmers with Linux programming experience using the pthread library will find the SPE thread creation process quite familiar.

Mailboxes

Naturally, we want to support communication between the SPEs and the PPE. This is done usually by using a feature called Mailboxes. The mailbox is used to send short messages (32 bits) from an SPE to the PPE or from the PPE to an SPE, and is generally used for synchronization between the two. We can use the mailboxes in two ways:

By blocking. The receiver waits until it receives a message, doing nothing in that time.
By polling. The receiver periodically checks for a message, then does some work and repeats this process until it receives a message.

Obviously, the option that has a better performance is the second option because it does not lead to a "busy waiting" approach (the PPU waits for a message and cannot do anything in that time). To initialize the mailbox, a small amount of code is required, like this:

events[i].events = SPE_EVENT_OUT_INTR_MBOX;
// setting the type of events we use

spe_event_handler_register(handler, &events[i] );
//register the handler for the specified events

In the mailbox, the PPE will receive messages from any SPE so we want to know the exact SPE that sent the message. We do this by associating a number to each SPE, a number that we can obtain along with the received data:

events[i].data.u32 = i;

Waiting for a message in the PPE is done like this:

spe_event_wait(handler, events_generated, NUM_THREADS, -1);
//printf("Got event! from spe no %d:", events_generated[0].data.u32);
spe_out_intr_mbox_read (events_generated[0].spe, 
   (unsigned int*) &data, 1, SPE_MBOX_ANY_BLOCKING);

And sending a message from the PPE is done like this:

spe_in_mbox_write( events_generated[0].spe, value_p, 1, SPE_MBOX_ANY_BLOCKING);

Mailboxes allow only basic synchronization techniques. Primitives like barriers, semaphores, and fences are still required and used on Cell.

DMA Access

As I said earlier, Mailboxes are used for synchronization between the PPE and the SPEs; that is so that the PPE knows what each SPE is doing at a certain time. But the need to send data to the SPEs arises. The SPEs are good at tasks involving intense data processing, so sending 32 bits at a time is out of the question. To send data from a PPE to the SPE, we use DMA access. In normal x86 C++ programming, you don't have to use the DMA explicitly, but using it like this offers better control, flexibility, and thus performance.

int tag = 30;  
int tag_mask = 1<<tag;
mfc_get((volatile void*)original_piece, (uint64_t)originalPieceAddrAsInt, 
         piece_height*piece_height*sizeof(int), tag, 0, 0);    
mfc_write_tag_mask(tag_mask);
mfc_read_tag_status_any();

The actual transfer is done using the mfc_get function. This function starts the transfer of piece_height*piece_height*sizeof(int) bytes, starting from the originalPieceAddrAsInt address in the PPE memory to the SPE the code runs on; the data is written starting at original_piece. Tagging is an optional feature of DMA transfers, and allows selective waiting for DMA transfers to complete. Generally, when we want to wait for a DMA to complete, we use a read_status function, like this:

mfc_read_tag_status_all();

DMA actually means a special hardware exists that takes the burden of IO transfers form the CPU, leaving it free to do what it knows best: processing data. This is a hardware optimization technique known as specialization, which was also used by the Cell designers when they decided to have SPUs and a PPE with different capabilities.

When we want to wait for a particular DMA, we apply a tag to it when we start it, and when we wait for it, we apply a tag mask:

mfc_write_tag_mask(tag_mask);
mfc_read_tag_status_any();

The tag mask is an integer in which the bit in the position corresponding to the tag of the desired transfer(s) is 1. This corresponds to a bit shift:

int tag_mask = 1<<tag;

Tags make an optimization technique called "double buffering" possible. Double buffering essentially means starting a DMA transfer before you actually need the data; you request the data for the next step, and process data from the current step. By the time you finish processing the data, the next step data has already arrived. This is used to hide the memory latency that is mentioned at the beginning of the article.

Double buffering is well-known for its use in graphical systems where, by keeping the next frame to display in memory, you can reduce flickering.

SIMD Processing

The Cell SPU libs offer functions that map to the intrinsic ASM instructions of the Cell; the most interesting of these are the SIMD instructions; they allow operations on more than one variable at a time, specifically a maximum of four 32 bit variables (integers or floating point). These operations require use of a new data type called vector that allows grouping of these variables. The main limitation of these vectors is that they must be arranged in memory at 128 bit boundaries. This is done for statically allocated data by using the aligned attribute:

int temp[MAX_LEN] __attribute__ ((aligned(128)));

and for dynamic data by using the memalign function:

piece_pos = (int*)memalign(128, piece_height*piece_height*sizeof(int));

Vector types are: vector int, vector float, vector char, etc., and operations include addition, subtraction, division, etc. We can add the four elements found in the two vectors A and B and put the result in another vector C: C[0]=A[0]+B[0] ...C[3]=A[3]+B[3], using the spu_add function which takes A and B as parameters and returns C. More complicated SIMD functions allow other operations like comparing four 32 bit numbers concurrently:

// vi - vector int
cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece));
zero = spu_orx(cmp);
if (*( (int*)(&zero) ) != 0) return 0;

Part 2: Puzzle Solving

The goal of this exercise is to solve a puzzle. We start with two images: the original image, and a modified image which might be the original image split into pieces and the pieces shuffled and rotated (0, 90, 180, 270 degrees) so that a new image results. Knowing the puzzle piece size (in pixels), we want to reconstruct the original image from the two images. The output will be the reconstructed image (so that we can see that the result is OK) and a text file showing where each original piece is in the puzzle image. We want to make the most efficient use of the Cell processor. Also, the program must work correctly with images that have multiple identical pieces.

My solution includes the use of:

concurrency through SPEs
DMA access
Mailboxes
Double buffering

I decided to use a simple format for the images: the text based PPM (http://netpbm.sourceforge.net/doc/ppm.html) format; this format basically uses RGB pixel format stored as ASCII text and separated by simple delimiters (like spaces). This makes the task of actually understanding the image format easier and allows us to focus on the processing that is to be done. It is not the most performance-oriented format though. We want to be able to process images of normal size (at least of 1280*1024 resolution) using pieces of size from 2*2 to 64*64 and 65500 colors (so that a pixel fits a single integer).

The basic protocol of communication between the SPEs and the PPE goes like this:

The PPE starts the SPE threads.
The PPE remembers the current original_piece that is sent to each SPE and the current puzzle_piece sent to each SPE. Each SPE will get a original_piece and then it will receive puzzle_pieces until it identifies a puzzle piece as being identical to the original piece. The request for original_pieces and puzzle_pieces will use mailboxes, and the actual data will be sent through DMA. To request another puzzle_piece, the SPU sends back a 0. To announce that puzzle_piece and original_piece match, the SPE returns 1, 2, 3, or 4 according to the rotation that is identified. When there are no more puzzle pieces for comparison, the PPE responds to the puzzle_piece request by refusing it (send 0). If there are more puzzle_pieces, it sends a pointer to the region of memory in which the puzzle piece resides.

The main loop for the SPE looks like this:

while ( NULL != (original = RequestNewOriginalPiece(length)) )
{
    FindPuzzlePieceForOriginalPiece(piece_height);
}

The problem that appears is that if we want to use a single DMA transfer to transfer a piece in one go, the piece must be located in a continuous area in memory. If we read the image as it is in the file, the puzzle piece will be spread out; thus we need to read each puzzle piece in an array by changing each pixel's coordinates:

for(j=0; j<height; j++)
    for(i=0; i< width; i++)
    {
        p_no_y = j / piece_height;
        piece_offset = (j - p_no_y*piece_height) * piece_height;

        p_no_x = i / piece_height;
        piece_offset += i - p_no_x*piece_height;  // pixel offset in piece

        piece_no = p_no_y* (width/piece_height) +p_no_x;
        // get number of piece from piece coordinates

        fscanf(f_original, "%d", &r);
        fscanf(f_original, "%d", &g);
        fscanf(f_original, "%d", &b);
        original[piece_no][piece_offset] = (r<<16) + (g<<8) + b;
        // cram pixel data in a single int
    }

The SPE double buffering is another interesting code area:

turn++;
if (turn%2 == 1) 
{
    // wait for tranfer1
    // !!!value =  transfer1.addr;
    value =  (float*)transfer1.addr;
    // !!!printf("Waiting for transfer 1 at addr %d.\n", value);
    //printf("Waiting for transfer 1 at addr %d.\n", (int)value);
    if ( transfer1.addr != 0)
    {
        //printf("Transfer1 not 0.\n");
        //printf("Checking tranfer status for tag = %d",transfer1.tag);
        mfc_write_tag_mask(1<<transfer1.tag);
        mfc_read_tag_status_any();
        // use tranfer1
        //printf("Tranfer 1 data (%d %d %d) length %d\n", 
        //    transfer1.buffer[0],transfer1.buffer[1], transfer1.buffer[2], length);
        memcpy(input, transfer1.buffer, length*4); 
        // start transfer 2:
        // -request a new backup DMA tranfer address
        //printf("Sending request for transfer2.\n");
        spu_write_out_intr_mbox(match_result);
        addr = spu_read_in_mbox();
        //printf("Got addr=%d", addr);
        transfer2.addr = addr;
        // -initialize actual tranfer
        transfer2.tag = transfer1.tag + 1;
        if (transfer2.tag>30) 
        { 
            transfer2.tag -= 30;
            //printf("Decremented tag.\n");
        } 
        tag_mask = 1<<transfer2.tag;
        if (addr != 0)
        { 
                mfc_get((volatile void*)transfer2.buffer, 
                  (uint64_t)addr, length * 4, transfer2.tag, 0, 0);
        }
        else
        {
            //printf("Don't wait\n");
    }
}

This code leaves a lot of place for optimisation. The "turn" variable allows for selecting the actual buffer to use, but because the double buffering structs transfer1 and transfer2 are not in an array, we cannot try to see its triple or 4-time buffering further improve performance. Changing this code is a good exercise for the reader.

We take advantage of the SIMD capabilities when we test for equality of the two pieces:

int test_equality(int* puzzle_piece, int* original_piece, int piece_height)
{
    vi zero,cmp;
    int i = 0; 

    for(i=0; i<piece_height*piece_height/4; i++)
    {
        cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece));
        zero = spu_orx(cmp);
        if (*( (int*)(&zero) ) != 0) return 0;
        puzzle_piece += 4;
        original_piece += 4;
    }
    return 1;
}

SIMD operations can be used also for the rotation process. Their use, however, generates very complex code and is extremely difficult to understand so its place is not here. If you have the time and interest, you might find it a good exercise for your mind to find the answer.

An interesting situation that needs to be treated adequately is this: let's assume that the image contains multiple pieces that are the same. Then the two places from the original image that are the same must be filled with the two pieces from the puzzle image, and not with the same puzzle piece! If the program were sequential, we could stop this process simply by not sending a puzzle piece that is already fitted in the image for checking if it fits another position. But, considering we have 8 concurrent instruction streams (and thus 8 pieces are being processed at the same time), this cannot be done. The simplest way to handle this case is by checking (when the result from the SPE comes as positive) if the position of the puzzle piece was not already used. (Attention: Checking this before sending to the SPU is not enough!). Another, and more interesting approach would be to always have different pieces in processing at the same time, but this approach is more complicated to code.

References

The SDK and the Programming Standards sectioned in the Docs page of the IBM Cell Resource Center

Cell Broadband Engine Programmer's Guide (updated)
Cell Broadband Engine Programming Tutorial (updated)
Cell Broadband Engine Programming Handbook
SIMD Math Library Specification for Cell Broadband Engine Architecture
C/C++ Language Extensions for Cell Broadband Engine Architecture

The wonderful course on the Structure of Computer Systems held at the Politehnica University of Bucharest, faculty of Computer Science

License

The code is licensed under a proprietary license. Any use of this code is allowed only after my explicit permission.