Real time Physically Based Rendering on the CPU using AVX512
Sample scenes rendered at 1080p on a 4-core laptop CPU @ ~3.5GHz (i5-11320H)
- Programmable vertex and pixel shading via concepts and template specialization
- Deferred rendering
- PBR + Image Based Lighting (or something close to it)
- Shadow Mapping with rotated-disk sampling
- Screen Space Ambient Occlusion at 1/2 resolution (with terrible results)
- Temporal Anti-Aliasing
- Hierarchical-Z occlusion
- Highly SIMD-parallelized pipeline: vertex/pixel shading and triangle setup all work on 16 elements in parallel per thread
- Texture sampling: bilinear filtering, mip mapping, seamless cube mapping, generic pixel formats
- Multi-threaded tiled rasterizer (binning)
- Guard-band clipping
Most of the effects aren't quite correct nor very well tuned, since this is a toy project and my focus was more on performance and simplicity than final quality.
The project uses CMake and vcpkg for project and dependency management. Clang or GCC must be used for building, as MSVC builds were initially about 2x slower, and there's now some __builtin_*
stuff being used. Debug builds are also too slow for most of anything, so release/optimized builds are preferred except for heavy debugging.
- Ensure
VCPKG_ROOT
environment variable is set to vcpkg root directory. (and pray that CMake picks it up properly on the first try.) - Open project on VS or VSC or wherever, or build through CLI:
cmake -S ./src/SwRast -B ./build/ -G Ninja -DCMAKE_CXX_COMPILER="C:/Program Files/LLVM/bin/clang++.exe" -DCMAKE_C_COMPILER="C:/Program Files/LLVM/bin/clang.exe"
cmake --build ./build/ --config RelWithDebInfo
- Optimizing Software Occlusion Culling
- A trip through the Graphics Pipeline
- Rasterising a Triangle - Interactive Tutorial
- Physically Based Rendering in Filament
- Image Based Lighting with Multiple Scattering
- https://github.com/rswinkle/PortableGL
- https://github.com/karltechno/SoftRast
- https://github.com/Zielon/CPURasterizer
- https://github.com/Mesa3D/mesa
- https://github.com/google/swiftshader
A boring brain dump about a few intricacies I have and haven't tried.
Mip-mapping has a quite significant impact on performance, making sampling at least 2x faster thanks to better caching and reduced memory bandwidth. LOD selection requires screen-space derivatives, but these can be easily approximated by subtracting two permutations (depending on the axis) of the scaled UVs packed in a SIMD fragment - it only takes 3 instructions per dFdx/dFdy call.
For RGBA8 textures, bilinear interpolation is done in 16-bit fixed-point using _mm512_mulhrs_epi16()
, operating on 2 channels at once. It still costs about 2.5x more than nearest sampling, so the shader switches between bilinear for magnification and nearest+nearest mipmap for minification at the fragment level. This hybrid filtering turns to be quite effective because most samples fall on lower mips for most camera angles, and the aliasing introduced by nearest sampling is relatively subtle.
A micro-optimization for sampling multiple textures of the same size is to pack them into a single layered texture, which can be implemented essentially for free with a single offset. If sample() calls are even partially inlined, the compiler can eliminate duplicated UV setup for all of those calls (wrapping, scaling, rounding, and mip selection), since it can more easily prove that all layers have the same size. This is currently done for the BaseColor + NormalMap/MetallicRoughness + Emission textures, saving about 3-5% from rasterization time.
Seamless cubemaps are relatively important as they will otherwise cause quite noticeable artifacts on high-roughness materials and pre-filtered environment maps. The current impl adjusts UVs and offsets near face edges to the nearest adjacent face when they are sampled using a LUT. An easier and likely faster way would be to pre-bake adjacent texels on the same face (or maybe some dummy memory location to avoid non-pow2 strides), so the sampling function could remain mostly unchanged. (But I just don't want to think about cubemaps again any time soon.)
GPUs typically rearrange texture data in some way to improve memory spatial locality, so that nearby texels are always close to each other independent of transformations like rotations. In my limited experiments, this didn't seem to have a significant impact outside of artificial benchmarks. After mip-mapping, sampling seems to get bound by compute more than memory latency.
Below are the benchmark results from scanning over a 2048x2048 texture in row-major order, after applying the specified rotation or zeroing UVs. With smaller textures (1024x1024) the difference is quite negligible, since the data will fit almost entirely in the L3 cache.
Indexing | Zero | 0deg | 45deg | 90deg |
---|---|---|---|---|
Linear | 4.12 ms | 4.30 ms | 11.7 ms | 17.8 ms |
Tiled 4x4 | 4.49 ms | 4.94 ms | 13.7 ms | 15.6 ms |
Z-Order | 4.55 ms | 5.96 ms | 11.0 ms | 12.3 ms |
(By the way, for the Z-order encoding function, it may be slightly faster to use the Galois field instructions instead of SIMD LUTs, see this post.)
The basic rasterizer always traverses over fixed size tiles (in this case, 4x4 pixels) to test whether any pixels are covered or not, before doing the depth test and invoking the shader. A coarse rasterizer first rasterizes bigger tiles and collect masks about which fragments are partially or fully covered, so it can skip through tiles of the bounding box that are outside the triangle much quicker.
That sounds neat in theory, but the benchmark results for my experimental implementation weren't very promising, though:
Rasterizer | Time |
---|---|
Fine only | 6.17 ms |
Coarse | 6.63 ms |
Ultimately, this isn't that surprising considering that the basic rasterizer can skip non-covered fragments in just about 5 cycles or so, and most triangles on even relatively simple scenes are very small. For Sponza at 1080p, more than 90% triangles have bounding boxes smaller than 16x16 when viewed from the center. Setup cost and other overhead dominates at least a third of processing time, so it might make sense to use smaller SIMD fragments, or even dynamically switch between them depending on triangle size (with maybe some preprocessor/template magic).
Use of multi-threading is currently fairly limited and only done for bin rasterization and full-screen passes, using the parallel loops from std::for_each(par_unseq)
. This is far from optimal because it leads to stalls between vertex shading/triangle setup and rasterization, so the CPU is never fully busy. It could probably be improved to some extent without complicating state and memory management too much, but threading is hard... Maybe OpenMP would be nice for this.