Here a scene from Voxelstein
Surprisingly, the rendering is very quick :-)
Download Demo (Cuda 1.1)
It's an alphaversion,
so not expect anything ;-)
Samstag, 28. Februar 2009
Freitag, 27. Februar 2009
First Color for the new Version
Today the scene gots a bit more colored.
Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.
So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.
Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.
So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.
Donnerstag, 26. Februar 2009
Mittwoch, 25. Februar 2009
New Download
Here you can download the Demos below.
Its still the old version, so rendering at 1024x768 is not fast yet.
Required: Cuda 1.1
Filesize: 22MB
VoxelDemos.zip
Its still the old version, so rendering at 1024x768 is not fast yet.
Required: Cuda 1.1
Filesize: 22MB
VoxelDemos.zip
Dienstag, 24. Februar 2009
Donnerstag, 19. Februar 2009
Performance
Today I measured the performance of the actual implementation. The result: The scene on the right has about 8.0M RLE elements in the view frustum, out of which 4.7M are not culled and 280k are visible, rendered as 450k pixels. This, at a frame-rate of about 25 means the renderer processes about 117M RLE elements/second.
My graphic cards maximum untextured triangle performance is 280M/s in case the triangles share vertices, and about 133M/s in case the triangles have independent vertices. Maximal vertex transform rate is about 400M/s.
This means, if the landscape would be visualized using splats, each rendered as single triangle, then at least 8M triangles would be required. Without any culling, this would lead to a performance of about 133/8=16 fps. Here, perhaps the geometry shader might be used to accelerate the rendering. It would be possible to send only one vertex from which the geometry shader generates a quad or triangle.
I case we would visualize each voxel inside the landscape using conventional polygons, we would have to use at least 2 triangles for each to create a quad. This means, taking shared vertices into account, We would have to render at least 16M quads, resulting in a theorethic frame rate of 280/16=17.5.
Freitag, 13. Februar 2009
Still some work
Donnerstag, 12. Februar 2009
More Speed
After adjusting a couple of parameters and doing further opimizations I got get 20 fps at 1024x768 now. Update: after some more optimizations, 30 fps seem possible at 1024x768 :-)
Mittwoch, 11. Februar 2009
Tips for CUDA programming
If some of you think of writing a CUDA program, here a couple of things to keep in mind:
1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...
17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.
1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...
17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.
Mittwoch, 4. Februar 2009
First Colorful Screenshots
Abonnieren
Posts (Atom)