Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[superseded] Different Hashing to avoid Collisions. #11767

Closed
wants to merge 4 commits into from

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Jul 17, 2019

  • fixes HashSet[uint64] slow insertion depending on values #11764

  • it turns out all primitive types of size >= 4 bytes (eg int32, pointer, uint, float etc) are affected by the bug, resulting in 100 to 1000 slowdown depending on conditions
    for eg for pointer, the 3rd case is 1K slower; float also is similar; this PR fixes that

  • also fixes another bug that would cause hash collisions for small floats (previous code was using x+1.0 which leads to 0 for small floats)

see test cases here to reproduce:
results: see https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim

note

as I also observed (see test case), the faster hashing PR #11203 introduced a 5X slowdown 2.229572/0.424859 in some cases, eg using uint64 or uint32 and more (for 3rd case, with let n = 100_000 * 10), even after my PR #11767 is applied (ie after hash input as as a string): in other words, bytewiseHashing and murmur3 are 5X faster compared to the multibyte hash introduced in #11203

this is related to what I had observed in #11581 but introduces the new observation that the multibyte extension to the jenkins hash is also affected for smaller inputs than oids, such as uint64 or even uint32. The same conclusion as #11581 follow: we should adopt murmur3 (or at least reconsider implementation of #11203) which always comes out the fastest ; I have provided a pure nim implementation and suggested how to make it work at CT (via vm register callback)

[EDIT] that 2nd point won't be observable after latest commit since code now uses hashData(cast[pointer](unsafeAddr x), T.sizeof) which for some reason is implemented differently than hash*[A](aBuf: openArray[A], sPos, ePos: int), ie doesn't use multibyte jenkins anymore, but the point remains that multibyte jenkins can still result in 5x slowdown even for small (4B) inputs

@timotheecour timotheecour changed the title fix #11764: make sets 1000 times faster for pointer, int64, int etc fix #11764: make sets (tables etc) 1000 times faster for pointer, int64, int etc Jul 17, 2019
@timotheecour timotheecour marked this pull request as ready for review July 17, 2019 21:11
lib/pure/hashes.nim Outdated Show resolved Hide resolved
@krux02
Copy link
Contributor

krux02 commented Jul 17, 2019

can you please explain the problem and what you did differently in order to fix it, before you claim that you made it 1000 times faster.

@timotheecour
Copy link
Member Author

before PR, the hash was hash(x)=x bitand 2^n-1 which is a terrible hash resulting in lots of (trivial) collisions, ignoring all high order bits
after PR, string hash (based on a multibyte modification of jenkins hash) is applied for all x: sizeof(x)>=4
sizeof(x)<4 was faster using the preexisting identity hash so the fix checks for that sizeof(x)>=4 criterion

@Araq
Copy link
Member

Araq commented Jul 18, 2019

@narimiran is working on a Murmur3 implementation.

@mratsim
Copy link
Collaborator

mratsim commented Jul 18, 2019

Obviously we can't have fancy intrinsics in the VM but if it's not too complex CityHash or Daniel Lemire's CLHash can be considered:

@narimiran
Copy link
Member

Obviously we can't have fancy intrinsics in the VM

I've already made Murmur3 work in the VM. The only remaining problem is JS backend.

@Varriount
Copy link
Contributor

@timotheecour Wow! Nice catch!

@krux02 krux02 changed the title fix #11764: make sets (tables etc) 1000 times faster for pointer, int64, int etc Different Hashing to avoid Collisions. Jul 18, 2019
@Araq
Copy link
Member

Araq commented Oct 2, 2019

New hash algorithm is shipping with v1, closing.

@Araq Araq closed this Oct 2, 2019
@timotheecour timotheecour changed the title Different Hashing to avoid Collisions. [superseded] Different Hashing to avoid Collisions. Feb 19, 2020
@timotheecour
Copy link
Member Author

superseded by #13418

@timotheecour timotheecour deleted the pr_fix_11764 branch February 19, 2020 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HashSet[uint64] slow insertion depending on values
6 participants