Today I would like to share a couple of interesting references about optimizing CUDA. There are many similariries among these presentations, but still its interesting as reading through give you new ideas about whats possible.
1.) Optimization Techniques for Large Data Structures on CUDA
2.) AstroGPU - CUDA Optimization Part I
3.) AstroGPU - CUDA Optimization Part II
4.) CUDA Programming Notes
5.) NVISION08: Advanced CUDA: Optimizing to Get 20x Performance
6.) Top 5 Optimization Strategies for CUDA
7.) CUDA at MIT - IAP2009
Looking at foil 3 of the first presentation, using the GPU should give an average speedup of factor 10 compared to the CPU in case the algorithm can be fully SIMD parallized. ( GPU: GTX280, 933GFlops/141.7 GB/s Mem, CPU: Intel Core 2 QX9650, 96 GFlops/12.8 GB/s Mem).
Now looking at NVidias CUDA page, I am often surprised to see that some algorithms seem to have been sped up like 100x or even more, compared to CPU - this seems to be rather hard to believe, taking the numbers above into account.