-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to different, stable hash algorithm in Bag #6723
Comments
Another, faster approach: don't support arbitrary objects, just explicitly supported specific types. I.e. enums/int/float/string/tuple/bytes and presumably numpy numeric types. And probably 5 other types I'm forgetting about. |
A different option would be to hijack the pickle infrastructure with a hash function. This has the benefit of supporting arbitrary python types as long as they're pickleable (and all types used as results in dask should be pickleable), while also being ok on perf for large objects. The overhead here is mostly on setup of the pickler per-call, so small objects (ints, small strings) would be slower while nested objects should be the same. There are faster hashing algorithms out there than those in In [16]: import cloudpickle, hashlib
In [17]: class HashFil:
...: def __init__(self):
...: self.hash = hashlib.sha1()
...: def write(self, buf):
...: self.hash.update(buf)
...: return len(buf)
...: def buffer_callback(self, buf):
...: self.write(buf.raw())
...:
In [18]: def custom_hash(x):
...: fil = HashFil()
...: pickler = cloudpickle.CloudPickler(fil, protocol=5, buffer_callback=fil.buffer_callback)
...: pickler.dump(x)
...: return fil.hash.hexdigest()
...:
In [19]: %timeit custom_hash([1, 2, 3, "hello"])
3.56 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Perf could be further improved by special casing a few common types (str, int, float, bytes, ...) and having a fast path for those. |
The worry about pickle is that some implementations might use non-deterministic iteration. E.g. if it's just iterating over an internal dict... and then we're back to the same problem, albeit in a much smaller set of cases. |
Although... dict iteration is actually deterministic these days, isn't it. So maybe it's fine. |
Thanks all for your work on this so far! Reviving this thread as today @TheNeuralBit discovered that it is the root cause of apache/beam#29365 (thanks Brian!). To summarize the context there, and my motivations:
So with that background out of the way, I'd love to re-open the discussion here as to what folks see as the most viable path forward:
? I would gladly contribute to or lead a PR on this topic (or defer to someone else who is motivated to do so), but of course looks like the first step is developing a bit more consensus about the path forward. P.S. @jacobtomlinson and I plan to present a bit on the DaskRunner at the upcoming Dask Demo Day, and I will update our draft presentation to include a mention of this issue. Footnotes
|
Notes on joblib.hashingjoblib implements the recursive python object traversal using Pickle with a special-case for NumPy: https://github.com/joblib/joblib/blob/master/joblib/hashing.py . Some problems:
Notes on the hash algorithm (separate from the traversal method)joblib uses md5/sha and those are likely somewhat slower (they're cryptographic, which isn't necessary here), but also might be fast enough. There are hash algorithms specifically designed for stability across machines and time, often designed for very fast hashing larger amounts of data where the output will be used as a key in persistent storage or distributed systems. HighwayHash is a good example: https://github.com/google/highwayhash#versioning-and-stability ("input -> hash mapping will not change"); not sure if there's a seed but if so just need to set a fixed seed. There are other alternatives as well. Another, perhaps unworkable option: convert Python
|
The deterministic serialization approach could be a workaround for the beam
runner. Beam has a library of coders for this.
…On Sat, Dec 9, 2023, 07:07 Itamar Turner-Trauring ***@***.***> wrote:
Notes on joblib.hashing
joblib implements the recursive python object traversal using Pickle with
a special-case for NumPy:
https://github.com/joblib/joblib/blob/master/joblib/hashing.py . Some
problems:
- There is no guarantee that two semantically identical objects will
pickle to the same bytes. They have special-cased code for dict and hash
but you can imagine a custom map or set type to pickle itself
inconsistently as regards to order.
- The ordering they do for dicts sometimes depends on hash() 😢
https://github.com/joblib/joblib/blob/6310841f66352bbf958cc190a973adcca611f4c7/joblib/hashing.py#L144
Notes on the hash algorithm (separate from the traversal method)
joblib uses md5/sha and those are likely somewhat slower (they're
cryptographic, which isn't necessary here), but also might be fast enough.
There are hash algorithms specifically designed for stability across
machines and time, often designed for very fast hashing larger amounts of
data where the output will be used as a key in persistent storage or
distributed systems. HighwayHash is a good example:
https://github.com/google/highwayhash#versioning-and-stability ("input ->
hash mapping will not change"); not sure if there's a seed but if so just
need to set a fixed seed. There are other alternatives as well.
Another, perhaps unworkable option: convert Python hash() to a stable hash
Given the seed, which you can extract from CPython internals, it *might*
be possible mathematically to take the output of hash(obj) and turn it
into a stable value by mathematically undoing the impact of the seed.
Another option: a deterministic/canonical serialization format.
Some serialization formats are specifically designed to be deterministic
on inputs, in order to allow for cryptographic hashing/signatures for
comparison. E.g. https://borsh.io/ is one, with wrapper in Python at
https://pypi.org/project/borsh-python/. Unfortunately it requires a
schema... so you'd need to find a self-describing (i.e. messages don't
require scheme) deterministic serialization format. Deterministic CBOR is
in a draft spec, and has at least a Rust implementation:
https://docs.rs/dcbor/latest/dcbor/ so that might be a possibility. For
best performance what you *really* want is a *streaming* self-describing
deterministic serializer, dcbor at least isn't streaming.
Given a deterministic serialization, you can then just use a hash function
of your choice (md5, HighwayHash, whatever) to get the hash.
—
Reply to this email directly, view it on GitHub
<#6723 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFEZ3ZQGPNWD342SIHPXSDYIR5DPAVCNFSM4SKLAYWKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBUHA2DGNBVG44A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Finding a solution here is still worthwhile IMO, though will presumably unfold over a somewhat longer timescale. In the meantime, please to report that dask/distributed#8400 does unblock many use cases, including the linked Beam issue. |
Thanks Charles!
…On Fri, Dec 15, 2023 at 2:39 PM Charles Stern ***@***.***> wrote:
Finding a solution here is still worthwhile IMO, though will presumably
unfold over a somewhat longer timescale.
In the meantime, please to report that dask/distributed#8400
<dask/distributed#8400> does unblock many use
cases, including the linked Beam issue.
—
Reply to this email directly, view it on GitHub
<#6723 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAGYCYUFSWKF6HBQ3TYJTGRDAVCNFSM4SKLAYWKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVHA2TOMZSGA3A>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I spoke a bit too soon here. As described at the bottom of apache/beam#29802 (comment), while dask/distributed#8400 did resolve non-deterministic hashing of strings, I subsequently came to understand that it does not resolve non-deterministic hashing of
This happens to be the next blocker for me on my motivating Beam issue, but apart from that specific case, I believe this is sufficiently problematic as to warrant moving forward with a fix for this issue. In terms of a specific proposal, I am curious about others' thoughts on leveraging Dask's existing
There would also be an opportunity to fast-path types that are known to deterministically hash (e.g. ints). |
Ping for those following here, I've just opened my proposed fix for this issue, #10734, for review. |
In #6640, it was pointed out that groupby() doesn't work on non-numerics. The issue:
hash()
is used to group,hash()
gives different responses for different Python processes. The solution was to set a hash seed.Unfortunately, Distributed has the same issue (dask/distributed#4141) and it's harder to solve. E.g.
distributed-worker
without a nanny doesn't have a good way to set a consistent seed.Given that:
hash()
.A better approach might be a different hash algorithm. For example, https://deepdiff.readthedocs.io/en/latest/deephash.html.
deephash seems promising:
Results in:
Downsides:
hash()
.__hash__
won't work.That being said, bag has used
hash()
this way for 4 years, maybe more, it's been broken with Distributed the whole time I assume, so probably that second downside is less relevant. Not sure about the performance aspect, though.The text was updated successfully, but these errors were encountered: