Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes dict_hash discrepancy #3195

Merged
merged 3 commits into from
Mar 3, 2023
Merged

Fixes dict_hash discrepancy #3195

merged 3 commits into from
Mar 3, 2023

Conversation

w4nderlust
Copy link
Collaborator

No description provided.

@w4nderlust w4nderlust requested a review from arnavgarg1 March 3, 2023 08:53
Copy link
Contributor

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

"feature_proc_columns": {feature[PROC_COLUMN] for feature in features},
# creating a sorted list out of the dict because hash_dict requires all values
# of the dict to be ordered object to ensure the creation fo the same hash
"feature_proc_columns": sorted({feature[PROC_COLUMN] for feature in features}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there's a way we can test this. Something that would force the behavior of changing the insertion order of the set.

This looks promising:

By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can put a test that exploits this together and add it to this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added a test and verified it repros the issue without this fix, and with this fix succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for looking into Python docs @tgaddair. It makes sense that the ordering of sets would cause some problems from the docs, so this is a good change!

Also pretty interesting that the salts are scoped to individual processes rather than across processes, but it makes sense if you think about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the salts were the same between processes, then it wouldn't serve its purpose of making the hash function unpredictable (to an attacker). But if they were different within a process, then hash lookups wouldn't work ;). So it makes sense, just never considered that they designed their hash function with that exploit in mind.

@tgaddair tgaddair added release-0.7 bug Something isn't working labels Mar 3, 2023
Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 this is great!

@tgaddair tgaddair merged commit 3844543 into master Mar 3, 2023
@tgaddair tgaddair deleted the fix_hash branch March 3, 2023 22:52
tgaddair added a commit that referenced this pull request Mar 4, 2023
Co-authored-by: Travis Addair <tgaddair@gmail.com>
tgaddair added a commit that referenced this pull request Mar 4, 2023
Co-authored-by: Travis Addair <tgaddair@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants