-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant hash value for None to aid reproducibility #99540
Comments
Thanks for the suggestion but this doesn't make sense. The default hash for every object is its object id. There is nothing special about |
If you're definitely not in the business of making hashes constant, why is the hash of the The reality is that there is no grand design behind the current behavior. It just happens that I can't imagine you've spent more than a few minutes thinking about this. Can I appeal your decision somewhere? |
Come to think of it, we could have the hash of |
Would it? Isn't that (in CPython) always 0? |
To my surprise it is. I was sure string hash calculations were always dependent on the hashing secret, but turns out the empty string isn't. It makes sense - the code explains it is done to avoid leaking information about the hash secret. So what we can do instead, is to hash some constant bytes, once, upon setting the hash secret, then cache that result and return it from None's hash function. Then, it is deterministic if PYTHONHASHSEED is set, but otherwise it's not, and it should be fine security-wise. |
That's not true. The contract is that That's not to say that making |
Funny, I used Python for so many years and never knew TIL. Anyway - I concur that it shoots down my argument from consistency entirely. I think the root cause of what I'm trying to fix is that, at some point, we started using One could argue a cleaner fix is to come up with an actual I don't know, if no one else thinks it's really a problem, the issue can stay closed |
I'm not sure, but won't immortal objects fix this? |
No. These are all implementation details. Code depending on them does so at its peril. Related to the OP: Python sets do not guarantee any form of stability. Code depending on that is already depending on something it should not because it never explicitly asked for whatever stability it depends on. Sets are fundamentally unordered. https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset |
The scenario does not involve code that depends on the specific iteration order for its correctness It's useful for testing, debugging and research purposes. Again, has nothing to do with assumptions in the code. The code merely assumes that it will traverse the items in the set in some order. No further assumptions are needed for correctness. |
You cannot run the same code on the same input multiple times and expect it to behave the same unless everything you've done in the code provides that guarantee. Python set types explicitly do not provide this guarantee. Every invocation of code interating over a set will produce each value once in "some order". That order may or may not be the same as "some order" in another invocation. |
You just told me what the requirements are. Not what the actual behavior is. Can you bring a set x1 = tuple(s) you end up with |
Another related question can you find a series of operations on a set, starting with its creation, that involves fixed data with fixed hashes, and ends with converting the set into a tuple, that will return a different result every time? Not hypothetically, but actual code that does this |
This issue is closed and you are no longer discussing anything related to it. Please take it up on a discuss.python.org thread. |
The fact you think my questions are unrelated to the change is a strong indicator that you do not understand it. |
However, many objects have a specific and unchanging hash, the most obvious example being integers.
But it is possible to set the seed for that randomization so that multiple runs can produce the same results, which can be vital for debugging.
Indeed, |
We still have to decide whether to use a I've updated my PR with an implementation that does make it depend on Personally, I will gain a tiny bit more from the PYTHONHASHSEED dependency, as it allows me to "fuzz" my program slightly better. But we have to consider the effect on all users. Some might be currently using |
Let's do a constant hash for now. |
…9541) Needed for ASLR builds of Python.
* main: pythongh-99540: Constant hash for _PyNone_Type to aid reproducibility (pythonGH-99541) pythongh-100039: enhance __signature__ to work with str and callables (pythonGH-100168) pythongh-99830: asyncio: Document returns of remove_{reader,writer} (python#100302) "Compound statement" docs: Fix with-statement step indexing (python#100286) pythonGH-90043: Handle NaNs in COMPARE_OP_FLOAT_JUMP (pythonGH-100278)
* origin/main: (1306 commits) Correct CVE-2020-10735 documentation (python#100306) pythongh-100272: Fix JSON serialization of OrderedDict (pythonGH-100273) pythongh-93649: Split tracemalloc tests from _testcapimodule.c (python#99551) Docs: Use `PY_VERSION_HEX` for version comparison (python#100179) pythongh-97909: Fix markup for `PyMethodDef` members (python#100089) pythongh-99240: Reset pointer to NULL when the pointed memory is freed in argument parsing (python#99890) pythongh-99240: Reset pointer to NULL when the pointed memory is freed in argument parsing (python#99890) pythonGH-98831: Add DECREF_INPUTS(), expanding to DECREF() each stack input (python#100205) pythongh-78707: deprecate passing >1 argument to `PurePath.[is_]relative_to()` (pythonGH-94469) pythongh-99540: Constant hash for _PyNone_Type to aid reproducibility (pythonGH-99541) pythongh-100039: enhance __signature__ to work with str and callables (pythonGH-100168) pythongh-99830: asyncio: Document returns of remove_{reader,writer} (python#100302) "Compound statement" docs: Fix with-statement step indexing (python#100286) pythonGH-90043: Handle NaNs in COMPARE_OP_FLOAT_JUMP (pythonGH-100278) Improve stats presentation for calls. (pythonGH-100274) Better stats for `LOAD_ATTR` and `STORE_ATTR` (pythonGH-100295) pythongh-81057: Move the Cached Parser Dummy Name to _PyRuntimeState (python#100277) Document that zipfile's pwd parameter is a `bytes` object (python#100209) pythongh-99767: mark `PyTypeObject.tp_watched` as internal use only in table (python#100271) Fix typo in introduction.rst (python#100266) ...
Feature or enhancement
Fix
hash(None)
to a constant value.Pitch
(Updated 2022.11.18)
Under current behavior, the runtime leaks the ASLR offset, since the original address of the
None
singleton is fixed and_Py_HashPointerRaw
is reversible. Admittedly, there are other similar objects, likeNotImplemented
orEllipsis
that also have this problem, and need to be similarly fixed.Because of ASLR,
hash(None)
changes every run; that consequently means the hash of many useful "key" types changes every run, particularly tuples, NamedTuples and frozen dataclasses that haveOptional
fields.The other source of hash value instability across runs in common "key" types like str or Enum, can be fixed using the
PYTHONHASHSEED
environment var.other singletons commonly used as (or as part of) mapping keys,
True
andFalse
already have fixed hash values.CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.
This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.
There is a counterargument that we should simply just offer
StableSet
andStableFrozenSet
that guarantee a specific order, the same way thatdict
does.A few things to note about that:
dict[T, None]
, there is a substantial perf overhead to thatMy PR makes a small change to CPython, in
objects.c
, that sets thetp_hash
descriptor ofNoneType
to a function that simply returns a constant value.Admittedly, determinism between runs isn't a concern that most users/programs care about. It is rather niche. However, I argue that still, there is no externalized cost to this change.
Previous discussion
https://discuss.python.org/t/constant-hash-for-none/21110
Linked PRs
The text was updated successfully, but these errors were encountered: