More memory efficient hash tables #16440

rohanp · 2017-05-22T23:17:55Z

Currently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.

complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]

size = 128 * 6 = 768 bits

this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.

uint16 indexes = [-, 0, -, -, 1, -]
complex128 values = [laaaaarge1, laaaaarge2]

size: 16 * 6 + 128 * 2 = 352 bits <-- 50% smaller

More generally, the space savings would be
(val_size - index_size) * n_buckets – n_vals * val_size

Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).

However, I am not sure what the values of val_size, n_vals, and n_buckets generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-05-22T23:36:17Z

This is basically what python 3.6+ is doing? You're welcome to try this out, although modifying khash isn't exactly a trivial project!

xref wesm/pandas2#35 - pandas2 issue on hash tables, consider c++ options

In terms of reducing hash table memory usage, I suspect #14273 might be much lower hanging fruit.

rohanp · 2017-05-22T23:47:16Z

yes, that is where I got the idea. Would I have to modify khash though? I was thinking to simply modify the Cython interface to khash hashtable_class_helper_pxi.in to send the indexes to khash while keeping the values on the Cython layer.

jreback · 2017-05-22T23:49:11Z

@rohanp

As far as I can tell, the array in which klib uses to store the values is sparse.

This might be true for dynamic hashtable, and I would for sure agree for that case, but we are specifically allocating a size, always. How are they in that case?

In [2]: from pandas._libs import hashtable as ht

In [7]: t = ht.PyObjectHashTable(6)

In [8]: t.sizeof()
Out[8]: 160

In [9]: t = ht.Int64HashTable(6)

In [10]: t.sizeof()
Out[10]: 160

note that I wrote the .sizeof() which is simply

    def sizeof(self, deep=False):
        """ return the size of my table in bytes """
        return self.table.n_buckets * (sizeof({{dtype}}_t) + # keys
                                       sizeof(size_t) + # vals
                                       sizeof(uint32_t)) # flags

which may not be right.

jreback · 2017-05-22T23:54:39Z

Actually I stand corrected. It appears to allocate the next largest power of 2 for the buckets (I temp added n_buckets()).

In [1]: from pandas._libs import hashtable as ht

In [2]: t = ht.Int64HashTable(6)

In [3]: t.n_buckets()
Out[3]: 8

In [4]: t = ht.Int64HashTable(12)

In [5]: t.n_buckets()
Out[5]: 16

In [6]: t = ht.Int64HashTable(30)

In [7]: t.n_buckets()
Out[7]: 32

In [8]: t = ht.Int64HashTable(500)

In [9]: t.n_buckets()

Out[9]: 512

In [10]: t = ht.Int64HashTable(9)

In [11]: t.n_buckets()
Out[11]: 16

In [12]: t.sizeof()
Out[12]: 320

In [14]: t = ht.Int64HashTable(16)

In [15]: t.sizeof()
Out[15]: 320

so a HT of 9 and 16 cost the same.

jreback · 2017-05-23T00:04:48Z

@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn't degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast).

rohanp · 2017-05-23T19:31:17Z

I don't quite understand where in our Cython interface we define the dtype of the values in the hash table. The __cinit__ only defines the dtype of the keys

    def __cinit__(self, size_hint=1):
        self.table = kh_init_{{dtype}}()
        if size_hint is not None:
            kh_resize_{{dtype}}(self.table, size_hint)

Based on the Py_ssize_t in
cpdef set_item(self, {{dtype}}_t key, Py_ssize_t val):

it looks like the table is already configured to only store indexes. Could someone more familiar with the code please confirm?

chris-b1 · 2017-05-23T19:57:00Z

The actual hash table definitions are C macro expansions, e.g., here for int64 keys:

pandas/pandas/_libs/src/klib/khash.h

Line 576 in 92372c7

KHASH_MAP_INIT_INT64(int64, size_t)

And, yes, the values in the hash tables are are indexes (locations back into the original array). So currently I think it roughly looks this (ignoring hash storage)

array = [1.1, 2.2, 3.3]

hash_table = {
     keys: [None, None, 2.2, None, 1.1, None, 3.3],
     vals: [None, None,   1, None,  0 , None,  2],
}

IIUC, the py 3.6 approach would be:

hash_table = {
   data: [[1.1, 0],
          [2.2, 1],
          [3.3, 2]],
   sparse_index:  [None, None, 1, None, 0, None 2]
}

rohanp · 2017-05-23T21:13:00Z

Yes, but as the author of klib writes

Grouping key-value pairs or not. In the current implementation, keys and values are kept in separated arrays. This strategy will cause additional cache misses when keys and values are retrieved twice. Grouping key-value in a struct is more cache efficient. However, the good side of separating keys and values is this avoids waste of memory when key type and value type cannot be aligned well (e.g. key is an integer while value is a character). I would rather trade speed a bit for smaller memory. In addition, it is not hard to use a struct has a key in the current framework.

Doing so would only save memory if the key and values are relatively close to each other in size. I suppose this would be good to implement for int64/float64 keys, but doing so would be fairly involved as I would have to modify klib itself. I personally don't think it is worth the time but if someone else wants to feel free.

jreback · 2017-05-23T23:34:13Z

@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always Py_ssize_t, this just makes things much simpler. so we are only paying the power-of-two bucket cost once.

mroeschke · 2024-01-27T23:06:00Z

Looks like this never really took off so going to close

chris-b1 added the Performance Memory or execution speed performance label May 22, 2017

mroeschke added the Enhancement label Jun 12, 2021

mroeschke closed this as completed Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

More memory efficient hash tables #16440

More memory efficient hash tables #16440

rohanp commented May 22, 2017

chris-b1 commented May 22, 2017

Uh oh!

rohanp commented May 22, 2017

Uh oh!

jreback commented May 22, 2017

Uh oh!

jreback commented May 22, 2017 •

edited

Loading

Uh oh!

jreback commented May 23, 2017

Uh oh!

rohanp commented May 23, 2017

Uh oh!

chris-b1 commented May 23, 2017

Uh oh!

rohanp commented May 23, 2017

Uh oh!

jreback commented May 23, 2017

Uh oh!

mroeschke commented Jan 27, 2024

Uh oh!

Uh oh!

More memory efficient hash tables #16440

More memory efficient hash tables #16440

Comments

rohanp commented May 22, 2017

chris-b1 commented May 22, 2017

Uh oh!

rohanp commented May 22, 2017

Uh oh!

jreback commented May 22, 2017

Uh oh!

jreback commented May 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented May 23, 2017

Uh oh!

rohanp commented May 23, 2017

Uh oh!

chris-b1 commented May 23, 2017

Uh oh!

rohanp commented May 23, 2017

Uh oh!

jreback commented May 23, 2017

Uh oh!

mroeschke commented Jan 27, 2024

Uh oh!

jreback commented May 22, 2017 •

edited

Loading