Monday, May 1, 2017

fast kdtree traversal

tested several methods for fast kdtree traversal on cpu/gpu

The test scene consists of several meshes with total number of triangles about 30000. Test rays were uniformly sphere sampled. Each leaf consists of 16 bounding boxes, which are all tested for intersection.

Test machine: i7 6700k, 32G, nvidia 1070 gtx.
cpu:
2.7 mrs (million of rays / second). 1 thread, batch=1 ray
13.4 mrs. 1 threads, batch= 8 rays via avx
15.5 mrs. 8 threads, batch = 1 ray.
65 mrs. 8 threads, batch = 8 rays via avx

Note that ray batches are of very simular rays (origin and direction) else performance is very poor (traversal is almost linear for random rays).

gpu (cuda):
46 mrs (16384/128 (grid size/block  size))

I implemented gpu traversal from paper
"Stackless KD-Tree Traversal for High Performance GPU Ray Tracing",
the main compute bottlenecks are warp divergence and low global memory load efficiency.
I  interleaved execution with streams, but actually transferring cost (host to device) is dominated.
And also gpu implementation archives high performance only for very big batches (>=10^6 of rays). Stackless approach gave about 5-10% boost.

No comments:

Post a Comment