Fixes dict_hash discrepancy #3195

w4nderlust · 2023-03-03T08:53:36Z

No description provided.

justinxzhao

Nice!

tgaddair · 2023-03-03T18:09:42Z

ludwig/data/cache/util.py

-        "feature_proc_columns": {feature[PROC_COLUMN] for feature in features},
+        # creating a sorted list out of the dict because hash_dict requires all values
+        # of the dict to be ordered object to ensure the creation fo the same hash
+        "feature_proc_columns": sorted({feature[PROC_COLUMN] for feature in features}),


I'm wondering if there's a way we can test this. Something that would force the behavior of changing the insertion order of the set.

This looks promising:

By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

Let me see if I can put a test that exploits this together and add it to this PR.

Okay, added a test and verified it repros the issue without this fix, and with this fix succeeds.

Nice, thanks for looking into Python docs @tgaddair. It makes sense that the ordering of sets would cause some problems from the docs, so this is a good change!

Also pretty interesting that the salts are scoped to individual processes rather than across processes, but it makes sense if you think about it

If the salts were the same between processes, then it wouldn't serve its purpose of making the hash function unpredictable (to an attacker). But if they were different within a process, then hash lookups wouldn't work ;). So it makes sense, just never considered that they designed their hash function with that exploit in mind.

arnavgarg1

🚢 this is great!

tests/ludwig/data/test_cache_util.py

Co-authored-by: Travis Addair <tgaddair@gmail.com>

Fixes dict_hash discrepancy

d1763b8

w4nderlust requested a review from arnavgarg1 March 3, 2023 08:53

tgaddair approved these changes Mar 3, 2023

View reviewed changes

justinxzhao approved these changes Mar 3, 2023

View reviewed changes

tgaddair reviewed Mar 3, 2023

View reviewed changes

Added determinism test

581b4ff

tgaddair added release-0.7 bug Something isn't working labels Mar 3, 2023

Update test_cache_util.py

eadcf6a

arnavgarg1 approved these changes Mar 3, 2023

View reviewed changes

arnavgarg1 reviewed Mar 3, 2023

View reviewed changes

tests/ludwig/data/test_cache_util.py Show resolved Hide resolved

tgaddair merged commit 3844543 into master Mar 3, 2023

tgaddair deleted the fix_hash branch March 3, 2023 22:52

tgaddair added a commit that referenced this pull request Mar 4, 2023

Fixes dict_hash discrepancy (#3195)

082be82

Co-authored-by: Travis Addair <tgaddair@gmail.com>

tgaddair added a commit that referenced this pull request Mar 4, 2023

Fixes dict_hash discrepancy (#3195)

8654153

Co-authored-by: Travis Addair <tgaddair@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes dict_hash discrepancy #3195

Fixes dict_hash discrepancy #3195

w4nderlust commented Mar 3, 2023

justinxzhao left a comment

tgaddair Mar 3, 2023

tgaddair Mar 3, 2023

tgaddair Mar 3, 2023

arnavgarg1 Mar 3, 2023

tgaddair Mar 3, 2023

arnavgarg1 left a comment

Fixes dict_hash discrepancy #3195

Fixes dict_hash discrepancy #3195

Conversation

w4nderlust commented Mar 3, 2023

justinxzhao left a comment

Choose a reason for hiding this comment

tgaddair Mar 3, 2023

Choose a reason for hiding this comment

tgaddair Mar 3, 2023

Choose a reason for hiding this comment

tgaddair Mar 3, 2023

Choose a reason for hiding this comment

arnavgarg1 Mar 3, 2023

Choose a reason for hiding this comment

tgaddair Mar 3, 2023

Choose a reason for hiding this comment

arnavgarg1 left a comment

Choose a reason for hiding this comment