-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize readInts24 performance for DocIdsWriter #12527
Comments
I honestly don't know how big If this thing can be large ((Integer.MAX_VALUE - 1)/8)*3, it would be good to use some sort of statically limited buffer (or load 3 vectors at a time in each loop). |
|
I ran the workload few more times, and somehow the difference was not as much: Without the patch Run 1:
Run 2:
Run 3:
|
With the change:
Run 2:
Run 3:
|
I like this idea, reducing possible IO overhead. But I tested it with
It looks like |
@mikemccand - Thanks for sharing the numbers. This is truly surprising result. Even though the impact of this small change not positive, it is significant enough to explore areas of improvement on this. I am thinking of trying out couple of things below:
|
The main difference here is that by default Lucene uses MMapDirectoy. Loading many small arrays instead of single longs is not a good idea, as copyMemory off-heap to heap is heavy due to JVM overhead. When using MMapDirectory all the I wonder why you see higher syscalls in Opensearch, could it be because it uses hybridfs? Nowadays the use of hybridfs should be avoided as the old "number kernel mappings" limitations are no longer applicable since Java 19. So I disagree with chainging this loop of reading longs, it would be better to figure out why it is slower for you. Another issue: How did you measure the overhead of syscalls? Most profilers do it wrong on those lower levels. Never ever trust a profiler telling you that readLong() takes roo long, it is in most cases a measurement error. Due to the use of variable handles in MMapDirectory you easily see wrong results because optimizer takes muuuuch longer to start inlining the reads. |
@uschindler - Thanks for the explanation. That makes sense.
While the default is hybridfs in Opensearch, it loads the compound files and point range files using MMapDirectory
I will spend more time understanding this better. Although, I am still wondering why the |
Indeed the inconsistency between Let's see what the silicon tells us -- I kicked off another run on |
OK I tested the "read into scratch array" approach from this comment:
It looks like Maybe next we should try 4 |
Why 4? we can just do single long for 2 ints. Although, I also doubt that we really need 32 bits to encode the docid deltas in a BKD leaf block in the benchmark. Following patch should work for achieving this:
|
Description
While recently working on numeric range queries, I noticed readInts24 to be consuming significant CPU cycles. When I looked into the code, I noticed multiple consecutive invocations of readLong.
Initially, it seems that the overhead from multiple syscalls should not be as much, but I tried quick patch by reading all the longs together and it seemed to help. Sharing the patch and numbers below (nyc_taxis range query):
Without this change:
With this change:
The text was updated successfully, but these errors were encountered: