Here a first trial to raycast perlin noise on the fly for achieving volumetric terrain rendering. In the demo, a 128^3 sized random volume data is used as a base for the scenes on the screenshots above.
By optimizing the empty-space skipping, it is possible to raycast reasonably large outdoor scenes at interactive framerates (20-40 fps) on a Nvidia GTX 260 GPU. The advantage of this kind of landscapes is, that they are extremely easy to handle and also that they are very memory friendly ( its just 128^3 rgba voxels = 8 MB of data ). Also can the performance easily adjusted for older graphics cards depending on the empty-space skipping configuration.
The Demo can be downloaded here: Perlin_Noise_Raycasting.zip Controls are w,s,a,d.
Donnerstag, 26. November 2009
Samstag, 14. November 2009
SVO-Voxel-Raycasting
Here some demos of my new sparse-voxel-octree (SVO) rayster.
Technical details:
-Storage: ca. 100 bit/voxel
Demo download: SVO-Demo-Cuda.2.3.7z
Technical details:
-Storage: ca. 100 bit/voxel
-Stack-based
-Uses a variant of persistent threads
Demo download: SVO-Demo-Cuda.2.3.7z
Montag, 22. Juni 2009
Tile-based memory layout
Dienstag, 31. März 2009
More Videos
Here two videos showing the Happy Buddha scene (1024x2048x1024).
High quality video here: Buddha avi [mirror]
The updated demo download from today (right side, first position in the links)
also includes the endless Buddha executable.
High quality video here: Buddha avi [mirror]
The updated demo download from today (right side, first position in the links)
also includes the endless Buddha executable.
Montag, 30. März 2009
Video
For the ones of you who cannot run the demo for some reason, I just captured a short video of it. You can watch it below in the window or download the larger version with better quality to see more details.
Landscape AVI [mirror]
Landscape AVI [mirror]
Samstag, 28. März 2009
CUDA optimizations II
Today I would like to share a couple of interesting references about optimizing CUDA. There are many similariries among these presentations, but still its interesting as reading through give you new ideas about whats possible.
1.) Optimization Techniques for Large Data Structures on CUDA
2.) AstroGPU - CUDA Optimization Part I
3.) AstroGPU - CUDA Optimization Part II
4.) CUDA Programming Notes
5.) NVISION08: Advanced CUDA: Optimizing to Get 20x Performance
6.) Top 5 Optimization Strategies for CUDA
7.) CUDA at MIT - IAP2009
Looking at foil 3 of the first presentation, using the GPU should give an average speedup of factor 10 compared to the CPU in case the algorithm can be fully SIMD parallized. ( GPU: GTX280, 933GFlops/141.7 GB/s Mem, CPU: Intel Core 2 QX9650, 96 GFlops/12.8 GB/s Mem).
Now looking at NVidias CUDA page, I am often surprised to see that some algorithms seem to have been sped up like 100x or even more, compared to CPU - this seems to be rather hard to believe, taking the numbers above into account.
1.) Optimization Techniques for Large Data Structures on CUDA
2.) AstroGPU - CUDA Optimization Part I
3.) AstroGPU - CUDA Optimization Part II
4.) CUDA Programming Notes
5.) NVISION08: Advanced CUDA: Optimizing to Get 20x Performance
6.) Top 5 Optimization Strategies for CUDA
7.) CUDA at MIT - IAP2009
Looking at foil 3 of the first presentation, using the GPU should give an average speedup of factor 10 compared to the CPU in case the algorithm can be fully SIMD parallized. ( GPU: GTX280, 933GFlops/141.7 GB/s Mem, CPU: Intel Core 2 QX9650, 96 GFlops/12.8 GB/s Mem).
Now looking at NVidias CUDA page, I am often surprised to see that some algorithms seem to have been sped up like 100x or even more, compared to CPU - this seems to be rather hard to believe, taking the numbers above into account.
Montag, 23. März 2009
New Benchmark Version
Today I ported the CUDA version to CPU (multicore), it is included in the updated Demo
[-Download-] (CUDA 2.1 Required - Driver version 181.20 or newer )
The first results so far are:
CPU (3Ghz PentiumD) - Single/Repeated/Repeated 2xAA: 3/1.2/0.6 fps
CPU (Intel Core2 Quad Q6600, 4x 3Ghz) - Single/Repeated/Repeated 2xAA: 15/8/5 fps
GPU (8800GTS) - Single/Repeated/Repeated 2xAA: 33/24/17 fps
GPU (285GTX) - Single/Repeated/Repeated 2xAA: 44/34/36 fps
Scene is this time the complex version of the one shown in the pictures below
(spherescape_complex.rle4).
Reason for the low CPU performance is mostly due many floating point operations I guess. Changing the calculations to Integer might improve the speed. Now its the most possible fair comparison however, since CPU and GPU get the same c++ code to execute.
[-Download-] (CUDA 2.1 Required - Driver version 181.20 or newer )
The first results so far are:
CPU (3Ghz PentiumD) - Single/Repeated/Repeated 2xAA: 3/1.2/0.6 fps
CPU (Intel Core2 Quad Q6600, 4x 3Ghz) - Single/Repeated/Repeated 2xAA: 15/8/5 fps
GPU (8800GTS) - Single/Repeated/Repeated 2xAA: 33/24/17 fps
GPU (285GTX) - Single/Repeated/Repeated 2xAA: 44/34/36 fps
Scene is this time the complex version of the one shown in the pictures below
(spherescape_complex.rle4).
Reason for the low CPU performance is mostly due many floating point operations I guess. Changing the calculations to Integer might improve the speed. Now its the most possible fair comparison however, since CPU and GPU get the same c++ code to execute.
Donnerstag, 19. März 2009
CPU vs. GPU
Today I made a comparison of CPU vs. GPU, to see if it was really worth the work to write everything in CUDA rather than for CPU. [detaild pics] [-CPU-Demo-]
The oponents:
CPU: 3.0 Ghz Pentium D, 1GB vs.
GPU: NVidia GTX285, 1GB
In the first round the CPU seems to provide a good performance, compared to the GPU - the GPU is just 3x faster than the CPU.
In the second round however, the GPU already wins over CPU with a speed factor of 7.3 : 1.
In the third round the CPU now lost all ground and the GPU wins about 20:1 (47.5:2.4)
Finally it would be interesting to know why the GPU doesnt work linear at all. I dont have any idea why the framerate is not half if the computations are doubled or vice versa.
The oponents:
CPU: 3.0 Ghz Pentium D, 1GB vs.
GPU: NVidia GTX285, 1GB
In the first round the CPU seems to provide a good performance, compared to the GPU - the GPU is just 3x faster than the CPU.
In the second round however, the GPU already wins over CPU with a speed factor of 7.3 : 1.
In the third round the CPU now lost all ground and the GPU wins about 20:1 (47.5:2.4)
Finally it would be interesting to know why the GPU doesnt work linear at all. I dont have any idea why the framerate is not half if the computations are doubled or vice versa.
Mittwoch, 18. März 2009
Demo with 2x AA
Small update - the demo linked below now also includes 2xAA (not 2x2!), reducing the aliasing of distant pixels significantly. On the GTS 8800 its quite slow right now, but on the GTX285 its almost no difference to the normal version I found.
For the GTS perhaps I will think about only applying AA to distant geometry to increase the speed.
For the GTS perhaps I will think about only applying AA to distant geometry to increase the speed.
Dienstag, 17. März 2009
Now the algorithm works entirely on the GPU
Today I finished shifting the ray generation part to the GPU, saving another 1-4ms as well as an unnecessary memcopy. Also silhouette-smoothing is working well, together with basic anti-aliasing ( so far only for GTX2xx cards ).
As for the smoothing, I tried two variants (left), and found the one in the middle looks best so far. The unsmoothed original (top) is too edgy and the one on the bottom smoothens too much for the tree-scene which lets near rendered geometry look like a 2D impostor.
The updated demo is here [-download-] (Cuda 2.1)
Also containing softening for the buddha & dragon scenes now
For the experienced ones of you, the shader-folder contains the shader in GLSL (soft.frag). You can experiment a bit by modifying the smoothing.
As for the smoothing, I tried two variants (left), and found the one in the middle looks best so far. The unsmoothed original (top) is too edgy and the one on the bottom smoothens too much for the tree-scene which lets near rendered geometry look like a 2D impostor.
The updated demo is here [-download-] (Cuda 2.1)
Also containing softening for the buddha & dragon scenes now
For the experienced ones of you, the shader-folder contains the shader in GLSL (soft.frag). You can experiment a bit by modifying the smoothing.
Sonntag, 15. März 2009
Silhouette Smoothing
Samstag, 14. März 2009
Soft Voxels II
Today I improved the filtering a bit. The softening looks more nice than yesterday (also its slower a litte). [-dl-new shaders-]
Still I'm not yet sure if soft voxels look better than hard-edged voxels in general. It gives the impression of missing detail and low resolution - both things which are unwanted..
Better would be real filtering to approximate the surface.
Freitag, 13. März 2009
Soft Voxels
Donnerstag, 12. März 2009
New Release
Today its time for a new release. Major mapping bugs are fixed and the colors look better now (I hope).
[-Demo Version v2-] ( Cuda 2.1 )
I also posted the Demo as IOTD on GDev as I think its worth to see.
[-link-]
[-Demo Version v2-] ( Cuda 2.1 )
I also posted the Demo as IOTD on GDev as I think its worth to see.
[-link-]
Dienstag, 10. März 2009
Happy Buddha reloaded
Any limit?
View distance set to 4.000.000 - still interactive (18fps). To have unique voxels everywhere is a problem in this case however.
Here we can also see an advantage of the RLE structure - its very easy to generate procedural mountains. With octree-raycasting it might be possible too, but right now I dont have an idea how this could work easily.
Here we can also see an advantage of the RLE structure - its very easy to generate procedural mountains. With octree-raycasting it might be possible too, but right now I dont have an idea how this could work easily.
Montag, 9. März 2009
Anti-Aliasing
Freitag, 6. März 2009
Maximal complexity ?
Donnerstag, 5. März 2009
Better Performance
Samstag, 28. Februar 2009
Voxelstein
Here a scene from Voxelstein
Surprisingly, the rendering is very quick :-)
Download Demo (Cuda 1.1)
It's an alphaversion,
so not expect anything ;-)
Surprisingly, the rendering is very quick :-)
Download Demo (Cuda 1.1)
It's an alphaversion,
so not expect anything ;-)
Freitag, 27. Februar 2009
First Color for the new Version
Today the scene gots a bit more colored.
Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.
So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.
Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.
So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.
Donnerstag, 26. Februar 2009
Mittwoch, 25. Februar 2009
New Download
Here you can download the Demos below.
Its still the old version, so rendering at 1024x768 is not fast yet.
Required: Cuda 1.1
Filesize: 22MB
VoxelDemos.zip
Its still the old version, so rendering at 1024x768 is not fast yet.
Required: Cuda 1.1
Filesize: 22MB
VoxelDemos.zip
Dienstag, 24. Februar 2009
Donnerstag, 19. Februar 2009
Performance
Today I measured the performance of the actual implementation. The result: The scene on the right has about 8.0M RLE elements in the view frustum, out of which 4.7M are not culled and 280k are visible, rendered as 450k pixels. This, at a frame-rate of about 25 means the renderer processes about 117M RLE elements/second.
My graphic cards maximum untextured triangle performance is 280M/s in case the triangles share vertices, and about 133M/s in case the triangles have independent vertices. Maximal vertex transform rate is about 400M/s.
This means, if the landscape would be visualized using splats, each rendered as single triangle, then at least 8M triangles would be required. Without any culling, this would lead to a performance of about 133/8=16 fps. Here, perhaps the geometry shader might be used to accelerate the rendering. It would be possible to send only one vertex from which the geometry shader generates a quad or triangle.
I case we would visualize each voxel inside the landscape using conventional polygons, we would have to use at least 2 triangles for each to create a quad. This means, taking shared vertices into account, We would have to render at least 16M quads, resulting in a theorethic frame rate of 280/16=17.5.
Freitag, 13. Februar 2009
Still some work
Donnerstag, 12. Februar 2009
More Speed
After adjusting a couple of parameters and doing further opimizations I got get 20 fps at 1024x768 now. Update: after some more optimizations, 30 fps seem possible at 1024x768 :-)
Mittwoch, 11. Februar 2009
Tips for CUDA programming
If some of you think of writing a CUDA program, here a couple of things to keep in mind:
1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...
17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.
1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...
17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.
Mittwoch, 4. Februar 2009
First Colorful Screenshots
Abonnieren
Posts (Atom)