kajiya
uses a range of techniques to render an approximation of global illumination in real time. It strikes a compromise between performance and correctness. Being a toy renderer free from the constraints necessary to ship games, this compromise is a bit different from those made by the large engines out there. The renderer is a vehicle for learning rather than something strictly pragmatic, and some well-known algorithms are intentionally avoided in order to avoid retracing the same steps.
Here's a 1920x1080 image rendered by kajiya
in 8.4 milliseconds on a Radeon RX 6800 XT.
The scene by Olga Shachneva was exported from Unreal Engine 4 via Epic's GLTF Exporter.
For reference, kajiya
's built-in path tracer produces the following image in 30 seconds, tracing around 1000 paths per pixel (with caustic suppression which increases specular roughness for further bounces; see here for an image without caustic suppression):
This serves to illustrate both the renderer's strengths, as well as weaknesses.
The overall brightness of the scene is similar, with many features preserved, including complex shadowing on rough specular reflections, and roughness map detail (there are no normal maps in those shots):
Rougher surfaces are more difficult to denoise though, and some explicit bias is used, which can distort the shape and intensity of the reflections. This becomes obvious when flipping between the above images.
Sometimes this can manifest as feature loss, for example the thin lines on the floor seemingly disappearing. Note that this is not due to over-filtering but bias in BRDF sampling.
Indirect shadows tend to be rather blurry:
Reflections are not traced recursively, resulting in a less punchy look:
Complex geometry below a certain scale can result in light leaking and temporal instability:
And finally, comparing against the reference image without caustic suppression, multi-bounce specular light transport turns diffuse, reducing contrast, and clamping potentially important features:
Some of those will be possible to improve, but ultimately sacrifices will be necessary to have the global illumination update in real time:
Note that due to how the images are captured here, there's frame-to-frame variability, e.g. different rays being shot, TAA shimmering, GI fluctuations.
Lighting only
Indirect diffuse
Reflections
Direct lighting
The geometry is rasterized into a G-buffer packed in a single RGBA32
image. The four dwords store:
- Albedo (8:8:8, with one byte to spare)
- Normal (11:10:11)
- Roughness & metalness (2xf16; could be packed more)
- Emissive (shared-exponent rgb9e5)
All dielectrics are forced to 4% F0 reflectance.
G-buffer albedo
G-buffer roughness
G-buffer metalness
G-buffer normals
Indirect diffuse starts with a half-resolution trace. Rays are launched from the world-space positions corresponding to g-buffer pixels. Since the trace happens at half-resolution, only one in four pixels traces a ray. The pixel in a 2x2 tile chosen for this changes every frame.
Following ReSTIR GI, rays are traced with a hemispherical distribution (not cosine-shaped).
If the hit point of the ray happens to be visible from the primary camera's point of view, the irradiance from the previous frame is reprojected. Otherwise geometric attributes returned in the ray payload are used by the ray generation shader to perform lighting. An additional ray is potentially used for sun shadows.
The output of this pass is not merely radiance but also:
- Normal of the hit point;
- Ray offset from the trace origin to the hit point.
The results are not used directly for lighting calculations, but fed into ReSTIR reservoirs.
ReSTIR ELI5: Each reservoir remembers its favorite sample. Every frame (ish) you feed new candidates into reservoirs, and they maybe change their minds. They can also gossip between each other (spatial resampling). W
makes the math happy. M
controls the length of the reservoirs' memory. With just the temporal part, you get slowdown of noise, but lower variance; that means slower temporal convergence though! Spatial resampling speeds it up again because neighbors likely contain "just as good" samples, and favorites flip often again. Spatial reduces quality unless you're VERY careful and also use ray tracing to check visibility. Clamp M
to reduce the reservoirs' memory, and don't feed spatial back into temporal unless starved for samples.
One-sample reservoirs are stored at half resolution, and along with them, additional information needed for ReSTIR:
- Origin of the ray currently selected by the reservoir;
- Incident radiance seen through the selected ray;
- Normal of the hit point of the selected ray;
- Offset of the hit point from the trace origin for the selected ray.
Through temporal reservoir exchange and an interpretation of permutation sampling, ReSTIR selects promising samples. Their incident radiance looks much brighter on average, meaning that it's improving sample quality.
With just temporal reservoir exchange (M
clamped to 10):
Temporal resampling here uses only luminance as the target weight function. The Lambertian BRDF terms will only appear later.
When we add permutation sampling (a form of spatial resampling which gets fed back into temporal resampling in subsequent frames):
Note that we have lost some micro-detail due to naively running the spatial part without any occlusion checks, but our subsequent spatial reuse passes will recover that by being a bit more careful.
After one spatial reuse pass using 8 samples:
After the second spatial reuse pass using 5 samples:
The micro-shadowing is regained because the final pass of spatial reuse performs a minimal screen-space ray march between the center pixel and the hit point of the neighbor (max 6 taps into a half-res depth buffer). Such shadowing is hugely approximate and lossy, but considerably cheaper than additional ray tracing would be.
Unlike temporal resampling, spatial resampling passes use the product of luminance and BRDF weight for the target function. This keeps the samples hemispherically distributed through the temporal phase, and then leans towards cosine-distributed in spatial resampling. This approach results in better directionality and lower noise on small elements, while keeping noise reasonably low in general. If temporal were to also weigh by the BRDF, small elements would often find themselves without good samples.
The spatial resampling passes adjust their kernel radius depending on how many samples the reservoirs hold, becoming sharper over time. SSAO is also used to narrow down the kernel in corners. The first resampling pass varies between 12 and 32 pixels in radius, and the second one between 6 and 16. Both use spiral sampling patterns. In order to reduce bias, contributions are weighed based on their normal, depth, and SSAO similarity with the center (half-res) pixel.
To get rid of the 2x2 pixel artifacts, the final ReSTIR resolve uses 4 samples (reservoirs) to reconstruct a full-resolution image. It uses a tiny spiral kernel, jittered per pixel, and scaled depending on proximity to surfaces (estimated from ray tracing). It uses a weighted average over the half-resolution contributions, using normal, depth, and SSAO similarity:
This is then thrown at a fairly basic temporal denoiser which uses color bounding box clamping (and is informed by ReSTIR):
Additional noise reduction is performed by TAA at the end of the frame:
The above is a foundation of fairly stable, but very laggy diffuse bounce. If the lighting in the scene changes, the stored reservoir, ray, and radiance information will not be updated, and thus stale radiance values will be reused through the temporal reservoir exchange. To fix this, we must introduce sample validation from ReSTIR GI.
The basic premise is simple: we must re-trace the samples kept in reservoirs, and check if the radiance they were tracking is still the same.
Ideally we should do that without a 2x cost on ray tracing.
Due to the spatiotemporal reuse of reservoirs, especially the permutation sampling, we can't do this for a fraction of pixels at a time -- if we update some, they might be replaced by the stale ones in the next frame.
We must instead update all reservoirs at the same time. In order to hide the cost, this happens every third frame, and on that frame, no new candidates are generated for ReSTIR. That is, each frame is either a candidate generation frame, or a validation frame. Note that this should not be a hard split -- newly disoccluded pixels should be detected and traced instead of validated.
As for the actual validation process: when the old and new radiance differ significantly, the M
of the corresponding reservoir is reduced. Additionally, whenever the ray hits approximately the same point as before, its tracked radiance is also updated. The M
clamping ensures that next time new candidates are generated, they will take precedence. The radiance update makes reaction even faster. Its position check is necessary due to the validation rays being shot from old positions, which can cause self-intersection problems on moving geometry.
In order to avoid fireflies, when radiance is updated in this pass, it's only allowed to get 10x brighter than the previous value. This helps low-probability samples from suddenly hitting bright pixels, and their intensity exploding as a product of the high luminance and a large inverse PDF factor.
For the sake of performance, the ReSTIR implementation in kajiya
is the biased flavor (see the paper). Preserving micro-scale light bounce has proven to be difficult. Unless a very aggressive normal cutoff is used, every spatial resampling pass erodes detail a bit; after the spatiotemporal permutation sampling and two spatial passes, the image is visibly affected.
First the path-traced reference at 10k paths/pixel:
And a naive real-time version. Notice how the corner on the left is darkened, and that the door frame looks rather artificial:
An observation can be made that the corners are not a major source of variance, and don't require all of the ReSTIR machinery:
Following this observation, the diffuse resolve pass performs a near field - far field split, and constructs the image from two different sources of information:
- For far hits: ReSTIR reservoirs and their associated ray and radiance data;
- For near hits: the raw ray data which is traced every frame to provide candidates for ReSTIR.
A smooth blending factor is used to combine the two. "Nearness" is determined based on screen-space metrics: for points near the camera, the near threshold is low; for points far from the camera, the near threshold is high.
With this tweak applied, we are able to recover much of the micro-detail:
A final complication here comes in the form of the aforementioned ReSTIR sample validation. Since one in three frames does not produce candidates for ReSTIR, it wouldn't have data for the near-field either. While not having new ReSTIR candidates is fine, excluding the near-field from the diffuse resolve pass would bring back some of the darkening and introduce temporal instability. To overcome this, the ray tracing pass is brought back for the validation frame, but it only traces very short rays for the near field. Even with this, the cost of validation frames tends to be lower than that of candidate generation frames.
The diffuse ray tracing described above is not recursive, therefore it only provides a single bounce of light. If that was the entire story, the image would be too dark:
Compared to the reference:
One could use path tracing instead of the single-bounce trace, and that's pretty much what ReSTIR GI does, however that's a rather expensive proposition. The additional bounces of light are often very blurry, and sometimes (mostly in outdoor scenes) don't significantly contribute to the image.
Instead, kajiya
uses a low-resolution irradiance cache. It's stored as a set of 12 camera-aligned sparsely-allocated 32x32x32 clip maps -- meaning that there's a dense top-level 32x32x32x12 indirection array which indexes into a set of payload buffers pre-allocated to a max number (65536) of entries.
Entries (voxels) are allocated on-demand, and deallocated a few frames after they're last used. Note that no requests to the irradiance cache are made if the irradiance can be reprojected from the last frame's indirect diffuse output, therefore voxels in the debug visualization will often flicker to black as they're deallocated:
The cache is not temporally stable, and does not provide a spatially-smooth sampling method.
On the other hand, it is very quick to react to lighting changes, provides a reasonable approximation to multi-bounce diffuse light transport, and, for its relative simplicity, is quite resistant to light leaks.
Unlike other volumetric GI techniques (such as DDGI), this one does not have a canonical point within each voxel from which the rays would be traced. In fact, that point changes every frame for every voxel. The animation below shows cubes at ray trace origins:
The role of the irradiance cache is to answer queries coming from other ray-traced effects. It's never queried directly from any screen pixel; instead, when a diffuse or reflection ray wants to know what the "tail" of light bounces is at its hit point, it asks the cache.
Each query location becomes a candidate for the cache to trace rays from. Among the candidates in a given voxel, one is chosen with uniform probability every frame.
Notice how the candidate positions are offset slightly away from hit points; this is because the cache uses spherical traces in order to calculate directional irradiance.
This voting system makes the cache adapt to how it's used. It tackles the otherwise nightmarish case of thin walls in buildings, where the outside is exposed to intense sunlight, while the inside, not seeing the light source, must be pitch black. If an irradiance cache is not output-sensitive, it will eventually run out of resolution, and produce leaks in this case. Here, when the camera is on the inside, the candidates will also be inside, therefore leaks should not happen. Once the camera moves out, the candidates also appear on the outside.
If every voxel is only ever queried at one point, the irradiance cache can even be exact (although many factors make this impossible in practice). Averaged over time, voxels yield mean irradiance of their query points. This is somewhat inspired by the multiresolution hash encoding by Müller et al.: their hash maps allow collisions, and then neural nets learn how to resolve them. The cache in kajiya
doesn't have any neural nets or multiple overlapping hash maps, but (partially) resolves collisions via a ranking system and normal biasing.
In the animation below, the resolution of the irradiance cache has been reduced, and sky lighting disabled. The interior starts lit, then the sun angle changes, leaving the interior pitch black. Despite the sun still striking one side of the structure, the light does not leak inside as long as the camera is also inside.
For multi-bounce lighting to work, irradiance cache entries should be instantiated not just from the indirect diffuse and reflection rays that originate from the g-buffer, but from the rays that the irradiance cache itself traces to calculate lighting.
This can create a situation where irradiance cache entries on the outside of a structure (such as a building) vote for positions visible from their point of view. If the camera is on the inside, the outside votes can cause leaks:
To demonstrate this in practice, we need a more complex scene. Let's consider Epic's Sun Temple, but instantiated a few times:
On the inside, there is a secluded area lit by emissive torches:
The sun takes many bounces to get there, losing most of its energy. If we disable the torches, then at this exposure level, the image should be black. And yet, the outside votes cause the inside to light up:
Note that for illustration purposes this is still using a reduced irradiance cache resolution.
Intuitively, we don't want a candidate from a further light bounce (counting from the camera) to replace a candidate from an earlier light bounce. To achieve this, each irradiance cache entry stores the lowest light bounce index which accessed it. Anything visible from rays traced from the screen gets rank 1. Any irradiance cache entry spawned from a rank 1 entry gets rank 2, and so on. When a new trace origin vote comes in, it will only be considered if the previous one if the new rank is less or equal that of the previous one.
With ranking in place, the leaks disappear:
Even with the irradiance cache at normal resolution, there can still be cases where thin surfaces can be seen by indirect rays from both directions. A common occurrence of that is... tables. A table lit from the top should not be causing light leaks at the bottom -- yet that's a difficult case for a meshless irradiance cache.
In order to reduce those leaks, the look-up position into the irradiance cache is offset by the surface normal:
Please note that this is a tradeoff, and sometimes can result in other kinds of collisions, but it tends to work a bit better on average.
Each cache entry uses temporal reservoir resampling to calculate irradiance. The reservoirs are stratified via a tiny 4x4 octahedral map, and each frame four of the octahedral map pixels generate new candidates. At hit positions of candidate rays, direct lighting from the sun is calculated, and indirect lighting from the irradiance cache is fed back into itself (no double-buffering; race conditions are fine here).
ReSTIR GI-style sample validation is done with another four rays per entry per frame.
After the raygen shader has generated new reservoir candidates, a compute pass convolves the incident spherical radiance from reservoirs into directional irradiance, and stores as L1 spherical harmonics for sampling by other shaders.
Much like indirect diffuse, reflections are traced at half resolution. Screen-space irradiance is used whenever the ray's hit point is visible from the primary camera. Reflections are calculated after diffuse, therefore the current frame's data can be used instead of reprojecting the previous frame.
The quality of samples (ray directions) matters a lot here, with blue noise and VNDF sampling being essential.
Note that even with VNDF, some of the generated rays can end up being "invalid" because they point towards the surface rather than away from it. This is where multiple scattering happens -- the ray bounces off a microfacet, and heads inwards towards another one. Following potentially more bounces, the light either gets absorbed, or emerges out. As suggested by the simulations done by Eric Heitz et al., the multiply-scattered ray distribution still resembles the original BRDF shape. For this reason, when VNDF "fails" to generate an outgoing ray direction, it's simply attempted again (up to a few times), until a valid outgoing direction is found. Conservation of energy is assured by using a preintegrated term at the end of the reflection process instead -- along with accounting for the increase in saturation that multiple scattering causes in metals.
When roughness is above a threshold, reflection rays are not traced; instead, the previously-traced diffuse GI rays are used. Despite the different ray generation strategies, PDF-weighing ensures the correct output.
By following this procedure, we make every ray matter. Even then, the image is not very useful at this stage:
Back in the days of screen space reflections, we could rely on filtered importance sampling to get significant variance reduction. No such luck here -- with ray tracing we don't get prefiltering. Instead, we need to be much better at using those samples.
BRDF importance sampling is great when the scene has fairly uniform radiance. That generally isn't the case in practice. What we need is product sampling: generation of samples proportional to the product of BRDF and incident radiance terms. This is once again accomplished by using ReSTIR.
Similarly to how the indirect diffuse works, we throw the generated samples at temporal reservoir resampling (M
clamped to 8). The reservoirs will track the more promising samples.
At present, kajiya
doesn't have spatial reservoir exchange for reflections, but it will certainly come in handy for rough surfaces. Even then, the temporal part alone helps tremendously with smooth and mid-rough materials.
Now that we have the reservoir data, we can proceed to resolve a full-resolution image:
Once again, using half-resolution input results in a pixelated look; the noise level is also way too high. To address both issues, eight half-resolution samples are used in the resolve pass. The spatial sample pattern is based on a projection of BRDF lobe footprint.
Combining reservoir resampling with a neighbor-reusing reconstruction filter provides great sample efficiency, although at the expense of implementation complexity. ReSTIR is not directly compatible with the simple ratio estimation techniques used in some previous work, but they can be mashed together through enough voodoo magic and lerps. Great care is needed to avoid fireflies and black pixels, especially with very smooth materials; more on that in another write-up.
This is too noisy, but it's stable enough to feed into a temporal filter. The one here uses dual-source reprojection and color bounding box clamping (informed by ReSTIR sample validation). Despite its simplicity, it provides decent noise reduction:
TAA handles the final denoising:
To illustrate the win from temporal reservoir resampling, here's how the image looks without it:
Since reflections only use temporal reservoir resampling, they are less sensitive to reusing invalidated reservoirs; we don't need to check them all in the same frame. As such, a simpler scheme is applied here. Instead of temporally staggering the validation traces, they are simply done every frame, but at quarter-resolution (half of the trace resolution).
When a previous ReSTIR sample is detected to have changed sufficiently, its 2x2 quad neighbors are inspected. If a neighbor tracks a point of similar radiance, it is invalidated as well. This get part way there to running the validation at the full trace resolution, at a tiny fraction of the cost.
Shadows are traced at full resolution towards points chosen randomly (with blue noise) on the sun's disk:
They're denoised using a slightly modified version of AMD's FidelityFX Shadow Denoiser. The changes are primarily about integrating it with kajiya
s temporal reprojection pipeline -- using a shared reprojection map instead of recalculating it from scratch.
The denoised shadow mask is used in a deferred pass, and attenuates both the diffuse and specular contribution from the sun (sorry, @self_shadow...)
kajiya
uses screen-space ambient occlusion, but not for directly modulating any lighting. Instead, the AO informs certain passes, e.g. as a cross-bilateral guide in indirect diffuse denoising, and for determining the kernel radius in spatial reservoir resampling.
It is based on GTAO, but keeps the radius fixed in screen-space. Due to how it's used, we can get away with low sample counts and sloppy denoising:
Without using a feature guide like this, it's easy to over-filter detail:
With the cheap and simple SSAO-based guiding, we get better feature definition:
Note that normally, kajiya
uses very little in terms of spatial filtering, but it's forced to do it when ReSTIR reservoirs are starved for samples (e.g. upon camera jumps). If we force the spatial filters to actually run, the difference is a lot more pronounced.
Without the SSAO guide:
And with:
Atmospheric scattering directly uses Felix Westin's MinimalAtmosphere. It drives both the sky and sun color.
A tiny 64x64x6 cube map is generated every frame for the sky. It is used for reflection rays and for sky pixels directly visible to the camera. An even smaller, 16x16x6 cube map is also convolved from this one, and used for diffuse rays.
As alluded to earlier, the global illumination described here is far from perfect. It is a spare-time research project of one person. Getting it to a shippable state would be a journey of its own.
Reflections are not currently traced recursively. At their hit points, direct lighting is calculated as normal, but indirect lighting is directly sampled from the irradiance cache. This is at odds with the design goals of the irradiance cache -- it is merely a Monte Carlo integration shortcut, and not something to be displayed on the screen. As such, whenever irradiance can't be reprojected from the screen, the blocky nature of the cache is revealed:
The irradiance cache is also not temporally stable, which once again becomes clear in reflections (as large-scale fluctuations):
It it will be possible to improve the stability of the irradiance cache, and hopefully recursive tracing and filtering of reflections will make those issues less severe.
In the latest version, stochastic interpolation of irradiance cache entries makes this problem less severe.
If the scene contains sources of very high variance, ReSTIR will fail to sufficiently reduce it. For example, in this scene by burunduk lit by emissive torches and candles:
The artifacts become even more pronounced in motion, as newly revealed pixels will not have good samples in reservoirs yet (render frame rate reduced to 10Hz for illustration purposes):
While it might be possible to improve on this with better spatiotemporal reservoir exchange, this is starting to reach a limit of what ReSTIR can do with reasonable quality. A path traced version of this scene at one path per pixel looks like this:
Those emissive surfaces should be handled as explicit light sources in the future.
The denoising presented here needs additional work. Especially newly revealed areas can appear very noisy.
Stable-state frame:
After moving a large distance to the left within one frame:
In such circumstances, aggressive spatial filtering could help. Conditionally feeding back the output of spatial reservoir resampling into the temporal reservoirs might also speed up convergence.
"Events" view in Radeon GPU Profiler; please observe the additional annotations under the top chart:
kajiya
's own performance counters averaged over 30 frames; note that there is some overlap between passes, making this not entirely accurate:
There are two types of rays being traced: shadow and "gbuffer". The latter return gbuffer-style information from hit points, and don't recursively launch more rays. Lighting is done in a deferred way. There is just one light: the sun.
-
Irradiance cache: usually fewer than 16k cache entries:
- Main trace: 4/entry * (1 gbuffer ray + 1 shadow ray for the sun)
- ReSTIR validation trace: 4/entry * (1 gbuffer ray + 1 shadow ray for the sun)
- Accessibility check: 16/entry short shadow rays
-
Sun shadow pass: 1/pixel shadow ray
-
Indirect diffuse trace (final gather) done at half-res; every third frame is a ReSTIR validation frame, and instead of tracing new candidates, it checks the old ones, and updates their radiance. In addition to that, the validation frame also traces very short contact rays; on paper it seems like it would be doing more work, but it's actually slightly cheaper, so counting conservatively here:
- 2/3 frames: regular trace: 0.25/pixel * (1 gbuffer ray + 1 shadow ray)
- 1/3 frames:
- validation trace: 0.25/pixel * (1 gbuffer ray + 1 shadow ray)
- contact trace: 0.25/pixel * (1 gbuffer ray + 1 shadow ray)
-
Reflections done at half-res, validation every frame at quarter-res
- Main trace: 0.25/pixel * (1 gbuffer ray + 1 shadow ray)
- Validation trace: 0.0625/pixel * (1 gbuffer ray + 1 shadow ray)
Summing it up, we have:
- Irradiance cache: 128k gbuffer rays and 384k shadow rays
- Sun shadows: 1 shadow ray per pixel
- Final gather: 0.25..0.5 gbuffer rays and 0.25..0.5 shadow rays per pixel
- Reflections: 0.3125 gbuffer rays 0.3125 shadow rays per pixel
Therefore, averaging:
(0.65/pixel + 128k) gbuffer rays and (1.65/pixel + 384k) shadow rays per frame.