AFAIK: CUDA is strictly for nvidia GPU's only. Earlier versions supported emulators for non-nvidia devices, eliminated completely in v3.0. For using it on others, OpenCL is there (There's both CPU/GPU SIMD implementations). But I prefer CUDA because its faster even than OpenCL and very similar to OpenCL syntax. On top of it NVIDIA/ATI/(to some extent on S3G) has OpenCL support. Moreover nvidia customers as compared to ATI are more. Though last 6 months ATI customer base has shown excellent increase.
To Run CUDA, nvidia 8-Series and above cards are required. I work with GeForce 8500GT (With Compute Capability 1.1) which is very low end card and the performance is little more than my Intel Core 2 Duo E8400 @3.00GHz.
Is there a library that wraps the whole thing using CPU / CPU SIMD implementation if no CUDA-supporting card is found, or do you have to require a newer NVIDIA card / code it on your own?
Get Thrust, and use the OpenMP back-end. Unfortunately that is compile time, I was talking with one of the developers during an nVidia class on Thrust and he is considering doing a run-time version. Certainly you can setup one call to an OpenMP routine and another to a CUDA, but that means double coding. When Thrust gets the runtime OpenMP/CUDA working, it will be much easier.
I don't know if people realise this, but only certain tasks are suitable for parellelism. Specifically, tasks that can be split into completely seperate units with no data sharing between threads. Once you have data shared by several threads, then you have thread synchonisation issues and you are in store for some *pain*
I write image analysis software and parallel processing is an absolute must. BUT you better be careful - coding by the seat of your pants is asking for trouble. You have to spend some time designing how it's going to work, otherwise it probably won't, or (worse) it will work sometimes, or (even worse) it will work slightly differently each time. Yes, I have been reduced to swearing at my own software. Some careful planning and it shouldn't be too painful, just tedious. But you get to watch every core running flat out which is pretty funny. Hah! Who's making who work now?
I don't know if people realise this, but only certain tasks are suitable for parellelism
Perhaps, perhaps.... Obviously you are correct, and at the same time some programmers can write things that always prove you are correct, no matter what they write. I have one programmer here who places all variables, including STL iterators in the class private variables, and accessing through an Instance pointer he always has access to the root thread from multiple threads and thus always has to mutex because he is always sharing data, no routine is re-entrant, no routine can thread without synchronization.
Some things can be rephrased, some algorithms can be restructured so that multiple operations can occur, large loops that operate on a large structure (thus sharing data), but operate on different areas of memory for every increment of the counter can be done in parallel because no two areas operate on the same area of memory. But other methods such as interleaving allow even sharing areas to operate simultaneously. Thus, even in shared data systems, when you learn the techniques, you can thread without synchronization blocking.
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....