-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for BytesRefHash #8788
Performance improvements for BytesRefHash #8788
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #8788 +/- ##
============================================
+ Coverage 71.01% 71.15% +0.14%
- Complexity 57420 57497 +77
============================================
Files 4778 4779 +1
Lines 270922 270995 +73
Branches 39585 39589 +4
============================================
+ Hits 192403 192840 +437
+ Misses 62338 61938 -400
- Partials 16181 16217 +36
|
Gradle Check (Jenkins) Run Completed with:
|
@backslasht @dblock Need your inputs!
What do you think? |
@ketanv3 - Would the benefit of CPU cache lines applicable when large numbers of keys are used in |
Recency criteria (i.e. correlated adds) isn't considered for BytesRefHash; it made more sense for LongHash where similar timestamps across consecutive hits made that optimization possible. Only the PSL criteria is considered which makes CPU caches more effective. Performance should be similar no matter whether keys arrive in or out of order. Theoretically,
Since the number of fingerprint bits is more in (1) (over 65K times more), it's not a 1-1 comparison between the two. With the current byte-packing scheme, performance is balanced between the two. But we can experiment with |
@backslasht Here's some follow up: I wrote an alternative benchmark that highlights the overhead of "new inserts" and the improvement it brings to "subsequent lookups". I ran it for large table sizes (10K to 100K) interleaved across 50 hash tables.
On average, pure inserts were 8.68% slower and pure lookups were 3.47% faster when re-organizing. Inserts were about 30% more expensive than lookups, so a key needs to be repeated at least 4 times for the lookup improvements to compensate for the insertion overhead. Note: Take these numbers with a grain of salt; I faced a lot of variance while benchmarking these large tables. In hindsight, this overhead wasn't noticeable with |
Thanks @ketanv3 for the experiments. I think we can go with Given the number of operations for each key is going to be 2 (1 insert and 1 lookup), the benefit |
@backslasht To summarise:
|
Thanks for the detailed explanation @ketanv3 I agree, it boils down to size of the table and the number of repeated keys where |
@dblock Would like to know your thoughts too. |
I am a bit out of my depth here - would love to see you and @backslasht agree on what should be a mergable state of this PR, first. If you need someone to break a tie I can give it a try for sure! |
67ee51c
to
8870419
Compare
Gradle Check (Jenkins) Run Completed with:
|
8870419
to
6ca2eb1
Compare
Gradle Check (Jenkins) Run Completed with:
|
On top of the encoding improvements (#9412), this change would further reduce the latency by a decent amount. These tests in particular have pretty small composite keys (< 20 bytes), but improvements would be larger for larger keys.
|
Hi @dblock, can you take a look now? Thank you! |
Signed-off-by: Ketan Verma <ketan9495@gmail.com>
Hi @reta @backslasht! I was profiling hot methods (for a different optimization) and noticed around 3.93% CPU time being spent in the "reinsert" logic. I had deliberately removed the use of pre-computed hashes as it felt less useful with the use of a faster hash function and fingerprinting. But it seems like adding it back can still make things faster, especially on pathological cases where the number of buckets (i.e. keys in the table) are large. We can expect another 5% reduction in latency compared to the table I shared above, and around 10% reduction in latency compared to the current baseline.
I have made these minor changes. Please review the diff. Thank you! |
Gradle Check (Jenkins) Run Completed with:
|
Looks good to me! thanks @ketanv3 |
Hi @dblock, did you get a chance to review this? |
libs/common/src/test/java/org/opensearch/common/hash/HashFunctionTestCase.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Ketan Verma <ketan9495@gmail.com>
8268950
to
58d3394
Compare
Compatibility status:Checks if related components are compatible with change 58d3394 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git] |
Gradle Check (Jenkins) Run Completed with:
|
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> (cherry picked from commit 3a8bbe9) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Performance improvements for BytesRefHash * Replace BytesRefHash and clean up alternative implementations * Added t1ha1 to replace xxh3 hash function * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above * Add link to the reference implementation for t1ha1 * Annotate t1ha1 with @opensearch.internal * Run spotless * Add pre-computed hashes to speed up reinserts * Refactor HashFunctionTestCase --------- (cherry picked from commit 3a8bbe9) Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: Gagan Juneja <gjjuneja@amazon.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: Kiran Reddy <kkreddy@amazon.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
* Performance improvements for BytesRefHash Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Replace BytesRefHash and clean up alternative implementations Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Added t1ha1 to replace xxh3 hash function Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Update t1ha1 to use unsignedMultiplyHigh on JDK 18 and above Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add link to the reference implementation for t1ha1 Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Annotate t1ha1 with @opensearch.internal Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Run spotless Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Add pre-computed hashes to speed up reinserts Signed-off-by: Ketan Verma <ketan9495@gmail.com> * Refactor HashFunctionTestCase Signed-off-by: Ketan Verma <ketan9495@gmail.com> --------- Signed-off-by: Ketan Verma <ketan9495@gmail.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Description
Performance improvements for BytesRefHash which includes:
Related Issues
Meta issue: #8710
JMH Benchmarks
These results indicate the time to perform 1 million
add(...)
operations with varying number of unique keys. These operations were repeated and interleaved between 20 hash tables in order to make caches less effective. The following implementations are being compared:Peak improvement
Whether or not to minimize the longest probe sequence length
The following implementations are being compared:
(A) is simpler and marginally faster for insert-heavy workloads. It also packs 32-bits of fingerprint information which is 65,536 times of (B), but false positives are already rare with (B). On the other hand, (B) provides marginally faster worst-case lookups due to better use of CPU cache lines.
Benchmarks show no appreciable performance difference between (A) and (B) since the latency is dominated by the time to copy the key.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.