Skip to content

More memory efficient hash tables #16440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rohanp opened this issue May 22, 2017 · 10 comments
Closed

More memory efficient hash tables #16440

rohanp opened this issue May 22, 2017 · 10 comments
Labels
Enhancement Performance Memory or execution speed performance

Comments

@rohanp
Copy link
Contributor

rohanp commented May 22, 2017

Currently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.

complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]

  • size = 128 * 6 = 768 bits

this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.

uint16 indexes = [-, 0, -, -, 1, -]
complex128 values = [laaaaarge1, laaaaarge2]
  • size: 16 * 6 + 128 * 2 = 352 bits <-- 50% smaller

More generally, the space savings would be
(val_size - index_size) * n_bucketsn_vals * val_size

Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).

However, I am not sure what the values of val_size, n_vals, and n_buckets generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.

@chris-b1
Copy link
Contributor

This is basically what python 3.6+ is doing? You're welcome to try this out, although modifying khash isn't exactly a trivial project!

xref wesm/pandas2#35 - pandas2 issue on hash tables, consider c++ options

In terms of reducing hash table memory usage, I suspect #14273 might be much lower hanging fruit.

@chris-b1 chris-b1 added the Performance Memory or execution speed performance label May 22, 2017
@rohanp
Copy link
Contributor Author

rohanp commented May 22, 2017

yes, that is where I got the idea. Would I have to modify khash though? I was thinking to simply modify the Cython interface to khash hashtable_class_helper_pxi.in to send the indexes to khash while keeping the values on the Cython layer.

@jreback
Copy link
Contributor

jreback commented May 22, 2017

@rohanp

As far as I can tell, the array in which klib uses to store the values is sparse.

This might be true for dynamic hashtable, and I would for sure agree for that case, but we are specifically allocating a size, always. How are they in that case?

In [2]: from pandas._libs import hashtable as ht

In [7]: t = ht.PyObjectHashTable(6)

In [8]: t.sizeof()
Out[8]: 160

In [9]: t = ht.Int64HashTable(6)

In [10]: t.sizeof()
Out[10]: 160

note that I wrote the .sizeof() which is simply

    def sizeof(self, deep=False):
        """ return the size of my table in bytes """
        return self.table.n_buckets * (sizeof({{dtype}}_t) + # keys
                                       sizeof(size_t) + # vals
                                       sizeof(uint32_t)) # flags

which may not be right.

@jreback
Copy link
Contributor

jreback commented May 22, 2017

Actually I stand corrected. It appears to allocate the next largest power of 2 for the buckets (I temp added n_buckets()).

In [1]: from pandas._libs import hashtable as ht

In [2]: t = ht.Int64HashTable(6)

In [3]: t.n_buckets()
Out[3]: 8

In [4]: t = ht.Int64HashTable(12)

In [5]: t.n_buckets()
Out[5]: 16

In [6]: t = ht.Int64HashTable(30)

In [7]: t.n_buckets()
Out[7]: 32

In [8]: t = ht.Int64HashTable(500)

In [9]: t.n_buckets()

Out[9]: 512
In [10]: t = ht.Int64HashTable(9)

In [11]: t.n_buckets()
Out[11]: 16

In [12]: t.sizeof()
Out[12]: 320

In [14]: t = ht.Int64HashTable(16)

In [15]: t.sizeof()
Out[15]: 320

so a HT of 9 and 16 cost the same.

@jreback
Copy link
Contributor

jreback commented May 23, 2017

@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn't degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast).

@rohanp
Copy link
Contributor Author

rohanp commented May 23, 2017

I don't quite understand where in our Cython interface we define the dtype of the values in the hash table. The __cinit__ only defines the dtype of the keys

    def __cinit__(self, size_hint=1):
        self.table = kh_init_{{dtype}}()
        if size_hint is not None:
            kh_resize_{{dtype}}(self.table, size_hint)

Based on the Py_ssize_t in
cpdef set_item(self, {{dtype}}_t key, Py_ssize_t val):

it looks like the table is already configured to only store indexes. Could someone more familiar with the code please confirm?

@chris-b1
Copy link
Contributor

The actual hash table definitions are C macro expansions, e.g., here for int64 keys:

KHASH_MAP_INIT_INT64(int64, size_t)

And, yes, the values in the hash tables are are indexes (locations back into the original array). So currently I think it roughly looks this (ignoring hash storage)

array = [1.1, 2.2, 3.3]

hash_table = {
     keys: [None, None, 2.2, None, 1.1, None, 3.3],
     vals: [None, None,   1, None,  0 , None,  2],
}

IIUC, the py 3.6 approach would be:

hash_table = {
   data: [[1.1, 0],
          [2.2, 1],
          [3.3, 2]],
   sparse_index:  [None, None, 1, None, 0, None 2]
} 

@rohanp
Copy link
Contributor Author

rohanp commented May 23, 2017

Yes, but as the author of klib writes

Grouping key-value pairs or not. In the current implementation, keys and values are kept in separated arrays. This strategy will cause additional cache misses when keys and values are retrieved twice. Grouping key-value in a struct is more cache efficient. However, the good side of separating keys and values is this avoids waste of memory when key type and value type cannot be aligned well (e.g. key is an integer while value is a character). I would rather trade speed a bit for smaller memory. In addition, it is not hard to use a struct has a key in the current framework.

Doing so would only save memory if the key and values are relatively close to each other in size. I suppose this would be good to implement for int64/float64 keys, but doing so would be fairly involved as I would have to modify klib itself. I personally don't think it is worth the time but if someone else wants to feel free.

@jreback
Copy link
Contributor

jreback commented May 23, 2017

@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always Py_ssize_t, this just makes things much simpler. so we are only paying the power-of-two bucket cost once.

@mroeschke
Copy link
Member

Looks like this never really took off so going to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants