Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

Merged
merged 37 commits into from
Feb 7, 2022

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Dec 14, 2021

Elasticsearch (which based on lucene) can automatically infers types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as long, and ES uses BKD as the index of long fields. When the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high.

This is a flame graph we obtained from the production environment:
flame.svg

It can be seen that almost all CPU is used in addAll. When we reindex long to keyword, the cluster load and search latency are greatly reduced ( We spent weeks of time to reindex all indices... ). I know that ES recommended to use keyword for term/terms query and long for range query in the document, but there are always some users who didn't realize this and keep their habit of using sql database, or dynamic mapping automatically selects the type for them. All in all, users won't realize that there would be such a big difference in performance between long and keyword fields in low cardinality fields. So from my point of view it will make sense if we can make BKD works better for the low/medium cardinality fields.

As far as i can see, for low cardinality fields, there are two advantages of keyword over long:

  1. ForUtil used in keyword postings is much more efficient than BKD's delta VInt, because its batch reading (readLongs) and SIMD decode.
  2. When the query term count is less than 16, TermsInSetQuery can lazily materialize of its result set, and when another small result clause intersects with this low cardinality condition, the low cardinality field can avoid reading all docIds into memory.

This approach is targeting to solve the first point. The basic idea is trying to use a 512 ints ForUtil for BKD ids codec. I benchmarked this optimization by mocking some random LongPoint and querying them with PointInSetQuery.

Benchmark Result

doc count field cardinality query point baseline QPS candidate QPS diff percentage
100000000 32 1 51.44 148.26 188.22%
100000000 32 2 26.8 101.88 280.15%
100000000 32 4 14.04 53.52 281.20%
100000000 32 8 7.04 28.54 305.40%
100000000 32 16 3.54 14.61 312.71%
100000000 128 1 110.56 350.26 216.81%
100000000 128 8 16.6 89.81 441.02%
100000000 128 16 8.45 48.07 468.88%
100000000 128 32 4.2 25.35 503.57%
100000000 128 64 2.13 13.02 511.27%
100000000 1024 1 536.19 843.88 57.38%
100000000 1024 8 109.71 251.89 129.60%
100000000 1024 32 33.24 104.11 213.21%
100000000 1024 128 8.87 30.47 243.52%
100000000 1024 512 2.24 8.3 270.54%
100000000 8192 1 3333.33 5000 50.00%
100000000 8192 32 139.47 214.59 53.86%
100000000 8192 128 54.59 109.23 100.09%
100000000 8192 512 15.61 36.15 131.58%
100000000 8192 2048 4.11 11.14 171.05%
100000000 1048576 1 2597.4 3030.3 16.67%
100000000 1048576 32 314.96 371.75 18.03%
100000000 1048576 128 99.7 116.28 16.63%
100000000 1048576 512 30.5 37.15 21.80%
100000000 1048576 2048 10.38 12.3 18.50%
100000000 8388608 1 2564.1 3174.6 23.81%
100000000 8388608 32 196.27 238.95 21.75%
100000000 8388608 128 55.36 68.03 22.89%
100000000 8388608 512 15.58 19.24 23.49%
100000000 8388608 2048 4.56 5.71 25.22%

The indices size is reduced for low cardinality fields and flat for high cardinality fields.

113M    index_100000000_doc_32_cardinality_baseline
114M    index_100000000_doc_32_cardinality_candidate

140M    index_100000000_doc_128_cardinality_baseline
133M    index_100000000_doc_128_cardinality_candidate

241M    index_100000000_doc_8192_cardinality_baseline
233M    index_100000000_doc_8192_cardinality_candidate

193M    index_100000000_doc_1024_cardinality_baseline
174M    index_100000000_doc_1024_cardinality_candidate

314M    index_100000000_doc_1048576_cardinality_baseline
315M    index_100000000_doc_1048576_cardinality_candidate

392M    index_100000000_doc_8388608_cardinality_baseline
391M    index_100000000_doc_8388608_cardinality_candidate

@gf2121 gf2121 marked this pull request as draft December 16, 2021 12:55
@gf2121 gf2121 marked this pull request as ready for review December 20, 2021 08:26
@jpountz
Copy link
Contributor

jpountz commented Jan 5, 2022

Nice. I wonder if we need to specialize for so many numbers of bits per value like we do for postings, or if we should only specialize for a few numbers of bits per value that are both useful and fast, e.g. 0, 4, 8, 16, 24 and 32.

@gf2121
Copy link
Contributor Author

gf2121 commented Jan 6, 2022

Thanks @jpountz ! I have two points to conifrm:

  1. Do you mean only specialize a few numbers of bits per value (other bpvs using decodeslow), or just support these numbers when writing (round up bpvs like what we did in DirectWriter) ?

  2. This issue is having the same problem as LUCENE-10319: make ForUtil#BLOCK_SIZE changeable #545 that constants with complex names making codes harder to read, maybe we need to solve it first? I made some change in that approach, hoping it will make sense to you :)

@jpountz
Copy link
Contributor

jpountz commented Jan 6, 2022

I was thinking of only supporting these numbers of bits per value indeed. For postings, numbers are always deltas, so we can generally expect them to be small. But for BKD trees it tends to be more an exception so I don't think we should spend too much effort on supporting so many bits per value and only focus on the ones that matter:

  • 32 bits per value for large segments where the doc ID order is random.
  • 24 bits per value for medium segments (less than 2^24 docs) where the doc ID order is random.
  • 16 bits per value plus delta coding from the minimum doc ID in the block for the case where there is some clustering of doc IDs.
  • And maybe the bitset strategy you added recently already covers the other cases like sorted indexes and values that exist in many docs, so that we don't need the delta-coding between consecutive anymore, which is slow anyway due to the cumulative sum?

@gf2121
Copy link
Contributor Author

gf2121 commented Jan 6, 2022

Thanks @jpountz for the great advice! I have implemented it and codes are simplified a lot.

for (int i = 0; i < 192; ++i) {
tmp[i] = longs[i] << 8;
}
for (int i = 0; i < 64; i++) {
Copy link
Contributor Author

@gf2121 gf2121 Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encoding logic is a bit different from the ForUtil we used in postings, as i want to make the remainder decoding can also trigger SIMD. Here is the JMH result:

Benchmark                            (bitsPerValue)  (byteOrder)   Mode  Cnt        Score         Error  Units
PackedIntsDecodeBenchmark.baseline               24           LE  thrpt    5  7756530.597 ± 1654468.198  ops/s
PackedIntsDecodeBenchmark.candidate              24           LE  thrpt    5  9681438.494 ± 2528482.525  ops/s

return expandMask32((1L << bitsPerValue) - 1);
}

private static void expand16(long[] arr, final long base) {
Copy link
Contributor Author

@gf2121 gf2121 Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky but indeed helping the performance:

Benchmark                                  (bitsPerValue)  (byteOrder)   Mode  Cnt         Score        Error  Units
PackedIntsDecodeBenchmark.plusAfterExpand              16           LE  thrpt   20   7996083.715 ± 618197.203  ops/s
PackedIntsDecodeBenchmark.plusWhenExpand               16           LE  thrpt   20  10681376.808 ± 542945.909  ops/s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating these micro benchmarks.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gf2121.

If you have the time, I'd be curious to see how it would compare with a similar approach where we would use an int[] to hold the doc IDs instead of longs. Postings started with longs because it helped do "fake" SIMD operations, e.g. summing up a single long would sum up two ints under the hood. But as our BKD tree doesn't need to perform prefix sums, it doesn't need such tricks. We could add a new IndexInput#readInts similar to IndexInput#readLongs and see how the impl that uses longs compares to the one that uses ints?

return expandMask32((1L << bitsPerValue) - 1);
}

private static void expand16(long[] arr, final long base) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating these micro benchmarks.

@gf2121
Copy link
Contributor Author

gf2121 commented Jan 23, 2022

@iverase Thank you very much for helping debug so much!

I had a look into the method and I could not spot anything strange. More over, I replaced the line:
guard.getInts(curIntBufferViews[position & 0x03].position(position >>> 2), dst, offset, length);
with: (Not reusing the int buffers)
IntBuffer intBuffer = curBuf.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(position).asIntBuffer();
guard.getInts(intBuffer, dst, offset, length);

I found out the bug based on this clue you provided, and fixed it in this commit. Also added some more tests around the readInts.

Very sorry for my negligence, and thanks again for your help!

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add more test to readInts. Probably enough to add the readInts counterpart that are on BaseDirectoryTestCase and BaseChunkedDirectoryTestCase.

Otherwise just left a few comments but I think this a good change.

@@ -503,6 +538,8 @@ private void unsetBuffers() {
curBuf = null;
curBufIndex = 0;
curLongBufferViews = null;
curFloatBufferViews = null;
curIntBufferViews = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curFloatBufferViews does not belong to this PR. I wonder if we should open a separate issue for this as it might lead to unknown bugs? what do you think @jpountz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to open a separate issue

private final int[] tmp;

BKDForUtil(int maxPointsInLeaf) {
tmp = new int[maxPointsInLeaf / 4 * 3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused with this, can you add an explanation of why we are choosing that size. Probably it would be good to use parenthesis to clarify the order of execution,


final class BKDForUtil {

static final int BLOCK_SIZE = 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed any more, should be remove it?

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gf2121
Copy link
Contributor Author

gf2121 commented Jan 26, 2022

Hi @jpountz ! I wonder if you'd like to take another look at this PR? I'll merge and backport this if you agree.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We should look into improving testing in a follow-up, there is now little testing for the legacy vint logic.

@@ -503,6 +538,8 @@ private void unsetBuffers() {
curBuf = null;
curBufIndex = 0;
curLongBufferViews = null;
curFloatBufferViews = null;
curIntBufferViews = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to open a separate issue

@@ -169,6 +169,13 @@ public void readLongs(long[] dst, int offset, int length) throws IOException {
}
}

public void readInts(int[] dst, int offset, int length) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs?

}
for (int i = 0; i < quarterLen; i++) {
final int longIdx = off + i + quarterLen3;
tmp[i] |= (ints[longIdx] >>> 16) & 0xFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't even need toe mask since values use 24 bits?

Suggested change
tmp[i] |= (ints[longIdx] >>> 16) & 0xFF;
tmp[i] |= ints[longIdx] >>> 16;

out.writeVInt(doc - previous);
previous = doc;

if (Integer.toUnsignedLong(min2max) <= 0xFFFFL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since doc IDs are integers in [0, MAX_VALUE) we don't need to convert to an unsigned long (MAX_VALUE is not a legal doc ID)

Suggested change
if (Integer.toUnsignedLong(min2max) <= 0xFFFFL) {
if (min2max <= 0xFFFF) {

out.writeShort((short) (docIds[start + i] >>> 8));
out.writeByte((byte) docIds[start + i]);
}
if (Integer.toUnsignedLong(max) <= 0xFFFFFFL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Integer.toUnsignedLong(max) <= 0xFFFFFFL) {
if (max <= 0xFFFFFF) {

@gf2121 gf2121 merged commit 8c67a38 into apache:main Feb 7, 2022
gf2121 added a commit that referenced this pull request Feb 7, 2022
gf2121 pushed a commit to gf2121/lucene that referenced this pull request Feb 24, 2022
benwtrent pushed a commit that referenced this pull request May 10, 2024
Elasticsearch (which based on lucene) can automatically infer types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as long, and ES uses BKD as the index of long fields. 

Just as #541 said, when the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high even if we use a boolean query with filter clauses for low-cardinality fields. 

One reason is that it uses a ReentrantLock to limit accessing LRUQueryCache. QPS and costs of their queries are often high,  which often causes trying locking failures when obtaining the cache, resulting in low concurrency in accessing the cache.

So I replace the ReentrantLock with a ReentrantReadWriteLock. I only use the read lock when I need to get the cache for a query,
benwtrent pushed a commit that referenced this pull request May 14, 2024
Elasticsearch (which based on lucene) can automatically infer types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as long, and ES uses BKD as the index of long fields. 

Just as #541 said, when the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high even if we use a boolean query with filter clauses for low-cardinality fields. 

One reason is that it uses a ReentrantLock to limit accessing LRUQueryCache. QPS and costs of their queries are often high,  which often causes trying locking failures when obtaining the cache, resulting in low concurrency in accessing the cache.

So I replace the ReentrantLock with a ReentrantReadWriteLock. I only use the read lock when I need to get the cache for a query,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants