A better hashtable implementation #10897

Akaricchi · 2024-09-18T21:39:02Z

The original SDL hashtable implementation is pretty naive. Every key-value pair is stored in a separate allocation, and collisions are resolved via a linked list. While this scheme is very simple to implement, it's bad for cache locality (and therefore performance).

This new implementation uses an open-addressing scheme, which places every pair in one contiguous array instead of separate buckets. It automatically grows as the load factor increases, which was a TODO in the old code. Linear probing is used to resolve collisions, which keeps colliding items close in memory. Robin hood hashing is used to greatly reduce variance in the probe sequence length across all items. The backward shifting optimization for deletions (described at the previous link) is also implemented, but "smart search" is not.

This is a very versatile hashtable optimized for lookup performance. I originally wrote this for Taisei Project, where it served us well in hot paths for years. The motivation for porting it was to speed up some hash lookups in the Vulkan GPU backend.

It's definitely not the most sophisticated or optimal algorithm, but I think it's a good balance between performance and implementation simplicity. The main thing holding it back, though, is that SDL's hashtables are not templated, so we're stuck with pointer-sized keys and values. It's often just barely not enough, requiring us to malloc the key or do other silly things. This throws a wrench into all the cache-locality goodness, though doing less pointer chasing is still beneficial in the end.

slouken · 2024-09-18T22:00:49Z

Thanks for the contribution! Do you have any performance numbers to compare before and after this change?

Akaricchi · 2024-09-18T22:25:28Z

I only had some perf traces from an experimental vulkan GPU branch. I wanted to write an actual test and a benchmark for this, but since it's an internal API, I'm not sure where to put that code. I had to resort to some embarrassing hacks to get some temporary test code to run as I was working on this. Can you help me out here?

I might just copy-paste both implementations into an existing benchmark harness to get you some raw numbers tomorrow.

icculus · 2024-09-18T23:42:39Z

I'm totally fine with changing out the hashtable implementation...the code was just meant to be Good Enough but definitely has room for improvement.

I haven't looked at the patch yet but the arguments in the PR description are valid.

This implementation is coming from an MIT-licensed program; do you have permission to relicense this piece under the zlib license?

Akaricchi · 2024-09-19T00:14:36Z

This implementation is coming from an MIT-licensed program; do you have permission to relicense this piece under the zlib license?

I wrote the original code and I'm the de-facto lead developer of that project, so I'm pretty sure I do :)

slouken · 2024-09-19T04:59:04Z

I might just copy-paste both implementations into an existing benchmark harness to get you some raw numbers tomorrow.

It's probably okay to add testhashtable, or something to one of the automated tests and #include "../src/SDL_hashtable.c". It would definitely be worthwhile to add something to the testautomation suite that added random values to the hashtable to verify correctness.

This is mostly ported from Taisei Project

Akaricchi · 2024-09-22T01:00:02Z

Didn't have time for a test or dedicated benchmark yet, but I've been comparing Vulkan performance against the default hashtable as @thatcosmonaut had been working on #10910. An experimental optimization that didn't make it into that patchset was using a hashtable to track bound resources.

Under a "normal" load (running a Taisei replay without the framerate limit) the performance initially seemed worse than a linear search through an array. With the new hashtable, it was on par for this workload. Under a "torture" load that's very heavy on resource tracking (another Taisei replay, but with sprite batching disabled), the hashtable had a significant performance boost. Swapping the original hashtable for the new one pushed it further ahead.

Here are some numbers from that test:

descriptor_pool_rewrite:            16505 frames in 61.03 sec ~= 270.43 FPS
bound_hashtable:                    16505 frames in 59.24 sec ~= 278.60 FPS
bound_hashtable + robinhood:        16505 frames in 54.10 sec ~= 305.07 FPS

Note that this is far from a pure hashtable benchmark, there is a ton of overhead here. Still, ~27 extra FPS just from swapping the hashtable implementation doesn't sound too bad.

We've decided to revisit that optimization once this PR is merged.

icculus · 2024-09-29T06:14:09Z

I wanted to poke at this first, so here's a naive test case.

I tried to find some convenient existing benchmark to slot this into, but my quick look at several were sort of a pain to get wired into, so here's a little brute-force instead:

#include <SDL3/SDL.h>

#define TOTAL_ITERATIONS 1
//#define MAX_INSERTS 10000

#define START_TIMING() do { start = SDL_GetTicks(); } while (false)
#define END_TIMING(what) do { what += SDL_GetTicks() - start; } while (false)

int main(int argc, char **argv)
{
    Sint64 counter;
    char *prev;
    char *ptr;

    Uint64 start = 0;
    Uint64 creation = 0;
    Uint64 insertion = 0;
    Uint64 good_lookup = 0;
    Uint64 bad_lookup = 0;
    Uint64 remove_keys = 0;
    Uint64 destruction = 0;

    SDL_Init(0);

    size_t datalen = 0;
    void *data = SDL_LoadFile("/usr/share/dict/american-english", &datalen);
    if (!data) {
        SDL_Log("Failed to load dictionary: %s", SDL_GetError());
        return 1;
    }

    #ifdef MAX_INSERTS
    prev = (char *) data;
    counter = 0;
    while ((ptr = (char *) SDL_strchr(prev, '\n')) != NULL) {
        if (++counter >= MAX_INSERTS) {
            *ptr = '\0';
            break;
        }
        prev = ptr + 1;
    }
    #endif

    for (int i = 0; i < TOTAL_ITERATIONS; i++) {
        START_TIMING();
        SDL_PropertiesID props = SDL_CreateProperties();
        END_TIMING(creation);

        //char lastchar = 0;

        prev = (char *) data;
        counter = 0;
        START_TIMING();
        while ((ptr = (char *) SDL_strchr(prev, '\n')) != NULL) {
            *ptr = '\0';
            //SDL_Log("INSERT: %s", prev);
            //if (*prev != lastchar) { lastchar = *prev; SDL_Log("INSERT: '%s'", prev); }
            const bool rc = SDL_SetNumberProperty(props, prev, counter);
            SDL_assert(rc == true);
            *ptr ='\n';
            prev = ptr + 1;
            counter++;
        }
        END_TIMING(insertion);

        prev = (char *) data;
        counter = 0;
        START_TIMING();
        while ((ptr = (char *) SDL_strchr(prev, '\n')) != NULL) {
            *ptr = '\0';
            //SDL_Log("GOOD LOOKUP: %s", prev);
            //if (*prev != lastchar) { lastchar = *prev; SDL_Log("GOOD LOOKUP: '%s'", prev); }
            const Sint64 rc = SDL_GetNumberProperty(props, prev, -1);
            SDL_assert(rc == counter);
            *ptr ='\n';
            prev = ptr + 1;
            counter++;
        }
        END_TIMING(good_lookup);

        prev = (char *) data;
        counter = 0;
        START_TIMING();
        while ((ptr = (char *) SDL_strchr(prev, '\n')) != NULL) {
            const char ch = *prev;
            *ptr = '\0';
            //SDL_Log("BAD LOOKUP: %s", prev);
            //if (*prev != lastchar) { lastchar = *prev; SDL_Log("BAD LOOKUP: '%s'", prev); }
            *prev = '*';  // so it doesn't match anything in the hash.
            const Sint64 rc = SDL_GetNumberProperty(props, prev, -1);
            SDL_assert(rc == -1);
            *prev = ch;
            *ptr ='\n';
            prev = ptr + 1;
            counter++;
        }
        END_TIMING(bad_lookup);

        prev = (char *) data;
        counter = 0;
        START_TIMING();
        while ((ptr = (char *) SDL_strchr(prev, '\n')) != NULL) {
            *ptr = '\0';
            //SDL_Log("REMOVE: %s", prev);
            //if (*prev != lastchar) { lastchar = *prev; SDL_Log("REMOVE: '%s'", prev); }
            SDL_ClearProperty(props, prev);
            *ptr ='\n';
            prev = ptr + 1;
            counter++;
        }
        END_TIMING(remove_keys);

        START_TIMING();
        SDL_DestroyProperties(props);
        END_TIMING(destruction);
    }

    SDL_Log("Creation time: %" SDL_PRIu64 "ms", creation / TOTAL_ITERATIONS);
    SDL_Log("Insertion time: %" SDL_PRIu64 "ms", insertion / TOTAL_ITERATIONS);
    SDL_Log("Good lookup time: %" SDL_PRIu64 "ms", good_lookup / TOTAL_ITERATIONS);
    SDL_Log("Bad lookup time: %" SDL_PRIu64 "ms", bad_lookup / TOTAL_ITERATIONS);
    SDL_Log("Remove keys time: %" SDL_PRIu64 "ms", remove_keys / TOTAL_ITERATIONS);
    SDL_Log("Destruction time: %" SDL_PRIu64 "ms", destruction / TOTAL_ITERATIONS);

    SDL_Quit();
    return 0;
}

It loads every word from Ubuntu's 24.04.1's /usr/share/dict/american-english, which is 104334 of them, some of them with non-ASCII chars in UTF-8 encoding. Then it times how long it takes to insert them all, then look them all up, then lookup the same amount of non-existing keys, then remove them all. It uses SDL_Properties for this just because it's a handy public API that is basically just a layer over the hashtable.

A single run through this with our current hashtable:

Creation time: 0ms
Insertion time: 54413ms
Good lookup time: 29820ms
Bad lookup time: 87701ms
Remove keys time: 30020ms
Destruction time: 0ms

This is brutal, because the current implementation doesn't grow or rehash the table, and SDL_Properties only asks for four buckets (which is fine, generally we aren't putting more than a few things in one of these tables), so basically all the buckets fill up and then you end up with unsorted linked-lists that are 20,000+ items long.

So fine, let's give SDL_Properties 16k buckets, to remove the obvious bottleneck:

Creation time: 0ms
Insertion time: 36ms
Good lookup time: 24ms
Bad lookup time: 49ms
Remove keys time: 32ms
Destruction time: 4ms

That's significantly better!

BUT.

Here's the same run with 904222c applied and just 4 buckets:

Creation time: 0ms
Insertion time: 24ms
Good lookup time: 8ms
Bad lookup time: 7ms
Remove keys time: 11ms
Destruction time: 0ms

Significantly faster and doesn't need tons of buckets. In fact, these numbers were identical with 16k buckets.

Let's merge.

icculus

I'd rather this not reference the C runtime's assert(), btw, but we can deal with that after merging.

slouken · 2024-09-29T11:11:26Z

Thank you!

sezero · 2024-09-29T11:24:14Z

A few sdl2-compat tests are segfault'ing in CI MSVC runners and I think it coincides with this patch's merge. Can someone have a look?

sezero · 2024-09-29T11:33:43Z

MSVC builds are failing in CI

slouken · 2024-09-29T11:35:22Z

Looking now...

sezero · 2024-09-29T11:53:41Z

MSVC build errors should be fixed by something like below. Don't know about sdl2-compat test failures, though.

diff --git a/src/SDL_hashtable.c b/src/SDL_hashtable.c
index da76b03..99552b2 100644
--- a/src/SDL_hashtable.c
+++ b/src/SDL_hashtable.c
@@ -314,7 +314,8 @@ bool SDL_InsertIntoHashTable(SDL_HashTable *restrict table, const void *key, con
         return false;
     }
 
-    return insert_item(&new_item, table->table, table->hash_mask, &table->max_probe_len);
+    item = insert_item(&new_item, table->table, table->hash_mask, &table->max_probe_len);
+    return item != NULL;
 }
 
 bool SDL_FindInHashTable(const SDL_HashTable *table, const void *key, const void **_value)
@@ -330,7 +331,7 @@ bool SDL_FindInHashTable(const SDL_HashTable *table, const void *key, const void
     i = find_first_item(table, key, hash);
     *_value = i ? i->value : NULL;
 
-    return i;
+    return i != NULL;
 }
 
 bool SDL_RemoveFromHashTable(SDL_HashTable *table, const void *key)

slouken · 2024-09-29T12:07:19Z

The build is fixed.

sezero · 2024-09-29T12:44:54Z

The sdl2-compat MSVC builds' tests no longer seg, either.

Akaricchi · 2024-09-29T12:51:31Z

Thanks for taking care of the benchmark!

DanielGibson · 2024-10-09T02:43:41Z

This doesn't build with older MSVC versions, because they don't support restrict (they do support __restrict though).
According to gcc.godbolt.org it must have started supporting it with 16.6 or 16.7 (they don't have 16.6, it fails with 16.5 but builds with 16.7): https://gcc.godbolt.org/z/dEPdT8z78
"16.x" are versions of VS2019, so not super-old.

I couldn't find anything regarding restrict on MSDN, only about __restrict and __declspec(restrict)

DanielGibson · 2024-10-09T02:51:15Z

#if defined(_MSC_VER) && _MSC_VER < 1927
// apparently MSVC only started supporting restrict (instead of just __restrict)
// with VS2019 16.6 or 16.7
#define restrict __restrict
#endif

makes it build with VS2017, but I get the following warnings:

2>D:\dev\SDL\src\SDL_hashtable.c(295): warning C4028: formal parameter 1 different from declaration
2>D:\dev\SDL\src\SDL_hashtable.c(478): warning C4028: formal parameter 1 different from declaration

Probably because in SDL_hashtable.c SDL_InsertIntoHashTable() and SDL_EmptyHashTable() use SDL_HashTable *restrict table, while in SDL_hashtable.h the function declarations are without restrict

slouken · 2024-10-09T03:25:06Z

We don't need the restrict keyword there. I'll remove that.

sezero · 2024-10-09T04:00:47Z

We can add (put back a revised version of) SDL_RESTRICT, use it in private code and allow users use it too. E.g.:

diff --git a/include/SDL3/SDL_begin_code.h b/include/SDL3/SDL_begin_code.h
index acf9928..b811f21 100644
--- a/include/SDL3/SDL_begin_code.h
+++ b/include/SDL3/SDL_begin_code.h
@@ -225,3 +225,15 @@
 #define SDL_ALLOC_SIZE2(p1, p2)
 #endif
 #endif /* SDL_ALLOC_SIZE2 not defined */
+
+#ifndef SDL_RESTRICT
+#if defined(__GNUC__) && (__GNUC__ + (__GNUC_MINOR__ >= 4) > 3) /* gcc >= 3.4 */
+#define SDL_RESTRICT __restrict__
+#elif defined(_MSC_VER) && _MSC_VER >= 1400 /* MS Visual Studio >= 2005. */
+#define SDL_RESTRICT __restrict
+#elif defined(__WATCOMC__) && (__WATCOMC__ >= 1250) && !defined(__cplusplus)
+#define SDL_RESTRICT __restrict
+#else
+#define SDL_RESTRICT
+#endif
+#endif /* SDL_RESTRICT not defined */

slouken · 2024-10-09T05:37:10Z

We could, but I looked at how it was used, and it wasn't actually helping anything, so I just removed it.

sezero · 2024-10-10T02:20:22Z

Hm, with restrict removed, I get the following 'uninitialized' warning from gcc-4.4. Nothing serious, just for the records.. (And, newer gcc don't warn.)

In file included from /tmp/SDL3/src/SDL_internal.h:240,
                 from /tmp/SDL3/build/CMakeFiles/SDL3-shared.dir/cmake_pch.h:4,
                 from <command-line>:0:
/tmp/SDL3/src/SDL_hashtable.h: In function ‘SDL_InsertIntoHashTable’:
/tmp/SDL3/src/SDL_hashtable.c:168: warning: ‘new_item.probe_len’ may be used uninitialized in this function
/tmp/SDL3/src/SDL_hashtable.c:305: note: ‘new_item.probe_len’ was declared here

slouken · 2024-10-10T14:57:44Z

Yeah, newer gcc doesn't warn because it's unconditionally assigned later in the function, but technically that's a valid warning. I'll fix it, thanks.

Akaricchi force-pushed the robin-hood-hashtable branch from e2816f2 to caf21d0 Compare September 18, 2024 21:58

slouken requested a review from icculus September 18, 2024 22:00

hashtable: reimplement as open-addressed robin hood hashtable

904222c

This is mostly ported from Taisei Project

Akaricchi force-pushed the robin-hood-hashtable branch from caf21d0 to 904222c Compare September 20, 2024 18:53

slouken added this to the 3.2.0 milestone Sep 26, 2024

icculus approved these changes Sep 29, 2024

View reviewed changes

slouken merged commit ba7b346 into libsdl-org:main Sep 29, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A better hashtable implementation #10897

A better hashtable implementation #10897

Akaricchi commented Sep 18, 2024 •

edited

Loading

slouken commented Sep 18, 2024

Akaricchi commented Sep 18, 2024

icculus commented Sep 18, 2024

Akaricchi commented Sep 19, 2024

slouken commented Sep 19, 2024

Akaricchi commented Sep 22, 2024

icculus commented Sep 29, 2024

icculus left a comment

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024 •

edited

Loading

sezero commented Sep 29, 2024

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024

Akaricchi commented Sep 29, 2024

DanielGibson commented Oct 9, 2024

DanielGibson commented Oct 9, 2024

slouken commented Oct 9, 2024

sezero commented Oct 9, 2024

slouken commented Oct 9, 2024

sezero commented Oct 10, 2024

slouken commented Oct 10, 2024

A better hashtable implementation #10897

A better hashtable implementation #10897

Conversation

Akaricchi commented Sep 18, 2024 • edited Loading

slouken commented Sep 18, 2024

Akaricchi commented Sep 18, 2024

icculus commented Sep 18, 2024

Akaricchi commented Sep 19, 2024

slouken commented Sep 19, 2024

Akaricchi commented Sep 22, 2024

icculus commented Sep 29, 2024

icculus left a comment

Choose a reason for hiding this comment

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024 • edited Loading

sezero commented Sep 29, 2024

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024

slouken commented Sep 29, 2024

sezero commented Sep 29, 2024

Akaricchi commented Sep 29, 2024

DanielGibson commented Oct 9, 2024

DanielGibson commented Oct 9, 2024

slouken commented Oct 9, 2024

sezero commented Oct 9, 2024

slouken commented Oct 9, 2024

sezero commented Oct 10, 2024

slouken commented Oct 10, 2024

Akaricchi commented Sep 18, 2024 •

edited

Loading

sezero commented Sep 29, 2024 •

edited

Loading