-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask requires consistent Python hashing #4141
Comments
Hrm, so Doing the same fix as we did in dask.multiprocessing would be a good first step here, and would probably solve 90% of the problem. Going beyond that though, I'm not sure. We could consider alternative hashing functions. Are there ways to specify the hash see programmatically somehow? |
There's no public API for setting it after startup, which makes sense given what it does. Only thing I can think of is something like import os, sys
if not os.environ.get("PYTHONHASHSEED") == "6640":
os.environ["PYTHONHASHSEED"] = "6640"
os.execv(sys.argv) in the startup code (which won't work on Windows...). |
For the Distributed-specific issue, is non-nanny mode actually useful, or could it be dropped? Stepping back to fundamental solution— with set_hash_seed(6640):
h = hash(obj) is maybe possible... but there are no public APIs, so you have to be OK with munging private CPython internals. And it's global state, so it will impact other threads' hashing which will e.g. break objects. Might be better to get a PEP through so that it's possible with future Python. Another approach is a reimplemented hash algorithm. For example https://deepdiff.readthedocs.io/en/latest/deephash.html. The question here is how good it is at hashing arbitrary objects, would need to do some digging/reading/testing. Basic sense of what it supports: https://github.com/seperman/deepdiff/blob/master/deepdiff/deephash.py#L429 |
deephash seems promising: from deepdiff import DeepHash
from pprint import pprint
class CustomHash:
def __init__(self, x):
self.x = x
def __repr__(self):
return f"CustomHash({self.x})"
objects = [
125,
"lalala",
(12, 17),
CustomHash(17),
CustomHash(17),
CustomHash(25),
]
for o in objects:
print(repr(o), "has hash", DeepHash(o)[o]) Results in:
|
Moved discussion of Dask-side fix to dask/dask#6723 |
There really ought to be a warning about this in the documentation. It would be one thing if this raised an exception and just broke your application, but instead it unexpectedly gives completely wrong results. If it's impossible to come up with a good hash function in two and a half years, maybe the solution is to just give |
For those following this, noting that I've just revived the discussion over in dask/dask#6723 (comment). |
Gradually understanding the problem/solution space here, and realized that this basic fix has not yet been implemented, which I agree that would solve a large percentage of use cases. I will work up a PR for this for others to consider. |
As per dask/dask#6640, Dask breaks if Python hashing is inconsistent across workers. This appears to be a bug in Distributed Dask backend as well:
hashing.py:
When run:
Solving this
The solution for Dask (dask/dask#6660) was to set PYTHONHASHSEED for worker processes. And you can do that similar solution for Distributed in some cases, e.g.
Client()
.However, I'm pretty sure the
distributed-worker
CLI just runs the worker inline, it's not a subprocess, so by the time you're running Python code the hash seed has already been set, and you can't change it. It could e.g. set it and the fork()+exec() Python again, I suppose.The text was updated successfully, but these errors were encountered: