-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
More memory efficient hash tables #16440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is basically what python 3.6+ is doing? You're welcome to try this out, although modifying khash isn't exactly a trivial project! xref wesm/pandas2#35 - pandas2 issue on hash tables, consider c++ options In terms of reducing hash table memory usage, I suspect #14273 might be much lower hanging fruit. |
yes, that is where I got the idea. Would I have to modify khash though? I was thinking to simply modify the Cython interface to khash |
This might be true for dynamic hashtable, and I would for sure agree for that case, but we are specifically allocating a size, always. How are they in that case?
note that I wrote the
which may not be right. |
Actually I stand corrected. It appears to allocate the next largest power of 2 for the buckets (I temp added
so a HT of 9 and 16 cost the same. |
@rohanp you could try this. I think this might involve a fairly large change (in cython), and you would have to measure the memory savings AND make sure perf doesn't degrade too much (as now you are doing a double lookup, though should not be by much as the 2nd access is an array indexing op which is pretty fast). |
I don't quite understand where in our Cython interface we define the dtype of the values in the hash table. The
Based on the it looks like the table is already configured to only store indexes. Could someone more familiar with the code please confirm? |
The actual hash table definitions are C macro expansions, e.g., here for int64 keys: pandas/pandas/_libs/src/klib/khash.h Line 576 in 92372c7
And, yes, the values in the hash tables are are indexes (locations back into the original array). So currently I think it roughly looks this (ignoring hash storage)
IIUC, the py 3.6 approach would be:
|
Yes, but as the author of klib writes
Doing so would only save memory if the key and values are relatively close to each other in size. I suppose this would be good to implement for int64/float64 keys, but doing so would be fairly involved as I would have to modify klib itself. I personally don't think it is worth the time but if someone else wants to feel free. |
@rohanp having separate keys/values is much easier impl wise. We just use keys of the appropriate dtype. the values are always |
Looks like this never really took off so going to close |
Currently hash tables are implemented with a Cython wrapper over the klib library. As far as I can tell, the array in which klib uses to store the values is sparse. If large values are being stored in the hash table, this results in memory inefficiency. eg.
complex128 values = [----------, laaaaarge1, ----------, ----------, laaaaarge2, ----------]
this inefficiency could be solved by storing the large values in a dense array, and instead storing the indexes to the values in the hash table.
More generally, the space savings would be
(
val_size
-index_size
) *n_buckets
–n_vals
*val_size
Because this would save memory, it would allow for larger hash tables, allowing for fewer collisions and better speed. This would likely outweigh any performance impairments from the additional array access (which is fast because the arrays are Cython arrays).
However, I am not sure what the values of
val_size
,n_vals
, andn_buckets
generally are for Pandas/klib and would appreciate any insight on whether this proposal would actually result in a performance improvement.The text was updated successfully, but these errors were encountered: