Vulcan - Stream compaction

I incorporated from-scratch stream compaction into Vulcan.  If you are not familiar with stream compaction, here is the gist:

Imagine we have a list of "streams", which you can think of as independent portions of the image we have to evaluate in order to finish the render.  Without stream compaction, each of these streams represents a pixel.  This is intuitive to think about - you simply render each pixel independently from each other.  So, on the GPU, each thread is rendering a pixel.  This is where the initial gain in speed comes from, since on a CPU you would usually have 1-8 threads.

Now, some pixels in the image are going to take longer to render than others.  In simple path tracing, if you hit a light you can terminate the algorithm.  So if your light is directly visible in the image, the pixels that it covers will be done after 1 bounce.  Other surfaces, like diffuse, take a much longer time to finish.  So, if you parallelize by pixel, there are a lot of the threads that finish really soon, and a lot that take forever to finish.  That is extremely wasteful, which is where the idea of stream compaction comes into play.

To start, you have to think of these streams in a different way.  Now, each stream is going to represent a single bounce along a path.  In comparison to per-pixel streams, these stream represent a smaller portion of the total image.  So, at each call to the rendering algorithm (in this case path tracing), you take the streams and bounce them 1 more time through the scene.  However, some of these streams may finish their path on that bounce (they hit a light for example).  We no longer need to account for these paths.  So stream compaction is the algorithm to cull these "dead" streams from your dataset.  By compacting your streams, you can work on less data, which means your program will run faster.

Now for some stats.

Here is the "Fresnel flower" render from a few days ago:

This image took 0.304902 seconds per iteration, compared to the render using stream compaction:

which took 0.056904 seconds/iteration.  A roughly 5-6x speedup.