Skip to content

Transposed Backbuffer

Ethan Watson edited this page Oct 4, 2020 · 3 revisions

Transposed Backbuffer

The original Doom renderer

If you dig in to the internals of Doom, you will come across something entirely non-intuitive - textures are stored in a rotated and flipped sense. If you were to consider a bitmap as a matrix of width*height dimensions, Doom's textures are stored after doing what is known as a transpose. Essentially, you switch rows with columns.

BIGDOOR2 as you know it; and how it's laid out in memory

BIGDOOR2 as you know it; and how it's laid out in memory

This seems unintuitive, but it is a very important optimisation method. Only allowing one degree of rotational freedom (yaw) ensured that walls could be rendered straight down the screen with only a scaling operation required. By transposing textures, this optimised the read operations. Instead of needing to advance the data pointer by width to advance to the next texel in a wall when rendering, it only needed to advance the data pointer by 1. This was far nicer on the data bus, especially with caching hardware appearing on 386 and standard on 486 class systems.

Write operations needed far less consideration. The VGA hardware's memory was mapped directly in to the address space. As there was no cache in between and reading was not required, the code could write to the output address and get back to reading.

Source ports

Source ports have retained this behavior. This code is portable. Carmack used the same code to render to x86/VGA as well as the NeXT workstations the game was developed on. The changes made to the renderer are generally to add features.

The problem on modern hardware

Remember how textures are laid out to optimise cache reads? The clue to the problem lies there. We've got optimised reads. But rendering down the backbuffer, you need to advance the write pointer by width instead of 1. This is hell on the cache.

The solution

Transpose the back buffer. The creation of your backbuffer is simple - just swap your width and height values for whatever API you're using. After that is where it gets a bit tricky. There's a bunch of code that calculates output positions by doing y * width + x. This will entirely give you bad results. Every one of these needs to become x * height + y.

There's a number of functions that rely on the backbuffer being exactly laid out to original specifications too. The status bar is a big one. The automap will also both define its bounds according to the status bar and write its output pixels to those specifications. It's a bit of annoying grunt work, but once it's done it's done.

How it looks in memory now

How it looks in memory now

Thanks to the magic of modern hardware, actually presenting this to the end user in the correct orientation is exactly as efficient as presenting a non-transposed backbuffer is. And in case you're not on modern hardware, Rum and Raisin Doom still retains the SDL 2D renderer usage that Chocolate Doom has. As long as your target supports SDL, you're good to go.

And once it's done, well, everything else we're going to do to the renderer becomes possible.

The performance benefit

You will get a performance benefit straight off the bat.

This is an early performance test, comparing Rum and Raisin Doom's newly-transposed backbuffer with a Chocolate Doom containing only the modifications necessary for high resolution rendering and the profiling code required to report function timings.

The target resolution at that time was 1280x800. The test will be a standard test you'll see repeated often in these documents - Ultimate Doom, 700 frames captured from program startup through to DEMO1.

We have some clear performance gains there. Notice in DEMO1 that the viewpoint gets fairly close to walls. This is where the cache benefits of transposing your buffer truly shines. We're not jumping all over memory any more to render to a given point, we're reading from one cache line and writing to another without large changes in pointer values.

You'll notice a curiosity too - The performance metrics basically match in a lot of instances. One of the presumed performance benefits of not transposing the render buffer is that flats will render in a cache-coherent manner. We now increment the write pointer by width to write flats, which is awful on the cache as we know... yet performance is basically not affected. This suggests that reads are in fact the bottleneck with flat rendering on x64 hardware. And we have a way of cutting down reads. But that's for another article.