Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Refactor hash table support #15725

Closed
asfimport opened this issue May 31, 2018 · 6 comments
Closed

[C++] Refactor hash table support #15725

asfimport opened this issue May 31, 2018 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 31, 2018

Currently our hash table support is scattered in several places:

  • compute/kernels/hash.cc

  • util/hash.h and util/hash.cc

  • builder.cc (in the DictionaryBuilder implementation)

    Perhaps we should have something like a type-parametered hash table class (perhaps backed by non-owned memory) with several primitives:

  • decide allocation size for a given number of items

  • lookup an item

  • insert an item

  • decide whether resizing is needed

  • resize to a new memory area

  • ...

Reporter: Antoine Pitrou / @pitrou
Assignee: Antoine Pitrou / @pitrou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-2653. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I agree with this, with the caveat that in my experience this type of code (hashing) is very performance sensitive. I've been surprised at the changes that can cause 30-50% change in execution time.

Luckily, we have plenty of other hash table implementations in other projects to look at to see how efficient our implementations are relative to comparables

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I have already started on that. The benchmarks on DictionaryBuilder are positive: i.e. the rewritten hash support performs slightly better. I haven't tried to migrate kernels yet, as that part seems a bit messier currently.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
That's great to hear. AFAIK we aren't using any SSE4.2 intrinsics like crc32 for hashing yet, so based on what features the CPU supports we will eventually want to select a fast path vs. a compatible path

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Basically I'm using xxhash (the 64-bit variant). There's also a special case for integers.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Apparently SSE4.2 (which includes the CRC instructions) is supported on all recent x86 CPUs: https://en.wikipedia.org/wiki/SSE4#SSE4.2

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 3005
#3005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants