|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
IntroductionThe purpose of this article is to offer programmers a birds eye view of the programming model for the new Playstation 3 console. To do this, I structured the article in two parts: the first part is a presentation of the hardware architecture, the programming model and where to find resources, its APIs and how to use them. The second part shows an example application: we will try to solve a puzzle using as many of the programming features offered as possible. Part 1: The CellHistorical ContextAs the need for processing power grows, hardware designers find it increassing difficult to satisfy the demand.
Architecture overviewThe architecture is an eterogenous multicore architecture: we have a Power Processing Element (PPE) for control tasks and 8 Synergetic Processing Elements (SPE) for data intensive processing.
Programming Tools and APIsTo program for the Cell, we must use the Cell SDK found in the IBM Cell Broadband Engine Resource Center (http://www.ibm.com/developerworks/power/cell/). You cand find there a lot of useful information on both the hardware and the programming model used. The Programming ModelFrom the programmers perspective, the language we use is C++, with special extensions.
Spe ThreadsA general Cell program will have a piece of code like this in the PPE code; this code is used to load the executable images on the SPEs and start them, offering the parameters they expect. void *spe_thread( void *voidarg ) { thread_args_t *arg = (thread_args_t *)voidarg; unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run( arg->spe_context, &entry, runflags, arg->argp, arg->envp, NULL ); pthread_exit( NULL ); } void StartSpu(int i, int* originalPieceAddr, int** data) { sprintf(buffer,"%d %d", piece_height, (int)originalPieceAddr); printf("Started SPU with %d %d %d and buffer %s", originalPieceAddr[0], originalPieceAddr[1], originalPieceAddr[2], buffer); spe_contexts[i] = spe_context_create( SPE_EVENTS_ENABLE, NULL ); events[i].spe = spe_contexts[i]; events[i].events = SPE_EVENT_OUT_INTR_MBOX; events[i].data.u32 = i; spe_event_handler_register(handler, &events[i] ); spe_program_load( spe_contexts[i], &spu ); thread_args[i].spe_context = spe_contexts[i]; thread_args[i].argp = buffer; thread_args[i].envp = buffer; pthread_create( &threads[i], NULL, &spe_thread, &thread_args[i] ); }
MailboxesNaturally, we want to support communication between the SPEs and the PPE. This is done usually by using a feature called Mailboxes. The mailbox is used to send short messages (32 bits) from the SPE to the PPE or form a SPE to the PPE and is generally used for synchronization between the two. We can use the mailboxes in two ways:
events[i].events = SPE_EVENT_OUT_INTR_MBOX; // setting the type of events we use spe_event_handler_register(handler, &events[i] ); //register the handler for the specified events
events[i].data.u32 = i;
spe_event_wait(handler, events_generated, NUM_THREADS, -1); //printf("Got event! from spe no %d:", events_generated[0].data.u32); spe_out_intr_mbox_read (events_generated[0].spe,(unsigned int*) &data, 1, SPE_MBOX_ANY_BLOCKING);
spe_in_mbox_write( events_generated[0].spe, value_p, 1, SPE_MBOX_ANY_BLOCKING);
DMA AccessAs I said earlier , Mailboxes are used for synchronization between the PPE and the SPEs, that is so that the PPE knows what each SPE is doing at a certain time. But the need to send data to the SPEs arises. The SPEs are good at tasks involving intense data processing so sending 32 bits at a time is out of the question. To send data from a PPE to the SPE we use DMA access. In normal , x86 C++ programming, you don't have to use the DMA explicitly, but using it like this offers better control, flexibility and thus performance. int tag = 30; int tag_mask = 1<<tag; mfc_get((volatile void*)original_piece, (uint64_t)originalPieceAddrAsInt, piece_height*piece_height*sizeof(int), tag, 0, 0); mfc_write_tag_mask(tag_mask); mfc_read_tag_status_any();
mfc_read_tag_status_all();
When we want to wait for a particular DMA, we apply a tag to it when we start it and when we wait for it, we apply a tag mask: mfc_write_tag_mask(tag_mask); mfc_read_tag_status_any();
int tag_mask = 1<<tag;
SIMD ProcessingThe Cell SPU libs offer functions that map to intrinsec ASM instruction of the Cell; the most interesting of these are the SIMD instruction; they allow operations on more than one variable at a time, specifically a maximum of four 32 bit variables(integers or floating point). These operations require use of a new data type called vector that allows grouping of these variables. The main limitation of these vectors it that they must be arranged in memory at 128 bit boundaries. This is done for statically allocated data by using the int temp[MAX_LEN] __attribute__ ((aligned(128)));
piece_pos = (int*)memalign(128, piece_height*piece_height*sizeof(int));
// vi - vector int cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece)); zero = spu_orx(cmp); if (*( (int*)(&zero) ) != 0) return 0; Part 2: Puzzle Solving
while ( NULL != (original = RequestNewOriginalPiece(length)) )
{
FindPuzzlePieceForOriginalPiece(piece_height);
}
for(j=0; j<height; j++) for(i=0; i< width; i++) { p_no_y = j / piece_height; piece_offset = (j - p_no_y*piece_height) * piece_height; p_no_x = i / piece_height; piece_offset += i - p_no_x*piece_height; // pixel offset in piece piece_no = p_no_y* (width/piece_height) +p_no_x; // get number of piece from piece coordinates fscanf(f_original, "%d", &r); fscanf(f_original, "%d", &g); fscanf(f_original, "%d", &b); original[piece_no][piece_offset] = (r<<16) + (g<<8) + b; // cram pixel data in a single int }
turn++;
if (turn%2 == 1)
{
// wait for tranfer1
// !!!value = transfer1.addr;
value = (float*)transfer1.addr;
// !!!printf("Waiting for transfer 1 at addr %d.\n", value);
//printf("Waiting for transfer 1 at addr %d.\n", (int)value);
if ( transfer1.addr != 0)
{
//printf("Transfer1 not 0.\n");
//printf("Checking tranfer status for tag = %d",transfer1.tag);
mfc_write_tag_mask(1<<transfer1.tag);
mfc_read_tag_status_any();
// use tranfer1
//printf("Tranfer 1 data (%d %d %d) length %d\n", transfer1.buffer[0],transfer1.buffer[1], transfer1.buffer[2], length);
memcpy(input, transfer1.buffer, length*4);
// start transfer 2:
// -request a new backup DMA tranfer address
//printf("Sending request for transfer2.\n");
spu_write_out_intr_mbox(match_result);
addr = spu_read_in_mbox();
//printf("Got addr=%d", addr);
transfer2.addr = addr;
// -initialize actual tranfer
transfer2.tag = transfer1.tag + 1;
if (transfer2.tag>30)
{
transfer2.tag -= 30;
//printf("Decremented tag.\n");
}
tag_mask = 1<<transfer2.tag;
if (addr != 0)
{
mfc_get((volatile void*)transfer2.buffer, (uint64_t)addr, length * 4, transfer2.tag, 0, 0);
}
else
{
//printf("Don't wait\n");
}
}
We take advantage of the SIMD capabilities when we test for equality the two pieces: int test_equality(int* puzzle_piece, int* original_piece, int piece_height) { vi zero,cmp; int i = 0; for(i=0; i<piece_height*piece_height/4; i++) { cmp = spu_sub( *((vi*)puzzle_piece), *((vi*)original_piece)); zero = spu_orx(cmp); if (*( (int*)(&zero) ) != 0) return 0; puzzle_piece += 4; original_piece += 4; } return 1; }
An interesting situation that need to be treated adequately is this: Lets assume that the iamge contains multiple pieces that are the same. Then the two places from the original image that are the same must be filled with the two pieces from the puzzle image, and NOT with the same puzzle piece! If the program were sequential, we could stop this process simply by not sending a puzzle piece that is already fitted in the image for checking if it fits another position. But, considering we have 8 concurrent instruction streams (and thus 8 pieces are being processed at the same time ) this cannot be done. The simplest way to handle this case is by checking (when the result from the spe comes as a positive) if the position of the puzzle piece was not already used. (Attention: Checking this before sending to the spu is not enough!). Another, and more interesting approach would be to always have different pieces in processing ar the same time, but this approach is more complicated to code. References1. The SDK and the Programming Standards sectiond in the Docs page of the IBM Cell Resource Center
2. The wonderfull course on the Structure of Computer Systems held at the Politehnica University of Bucharest, faculty of Computer ScienceLicenseThe code is licensed under a proprietary license. Any use of this code is allowed only after my explicit permission.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||