LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

gf2121 · 2021-12-14T18:05:14Z

Elasticsearch (which based on lucene) can automatically infers types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as long, and ES uses BKD as the index of long fields. When the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high.

This is a flame graph we obtained from the production environment:
flame.svg

It can be seen that almost all CPU is used in addAll. When we reindex long to keyword, the cluster load and search latency are greatly reduced ( We spent weeks of time to reindex all indices... ). I know that ES recommended to use keyword for term/terms query and long for range query in the document, but there are always some users who didn't realize this and keep their habit of using sql database, or dynamic mapping automatically selects the type for them. All in all, users won't realize that there would be such a big difference in performance between long and keyword fields in low cardinality fields. So from my point of view it will make sense if we can make BKD works better for the low/medium cardinality fields.

As far as i can see, for low cardinality fields, there are two advantages of keyword over long:

ForUtil used in keyword postings is much more efficient than BKD's delta VInt, because its batch reading (readLongs) and SIMD decode.
When the query term count is less than 16, TermsInSetQuery can lazily materialize of its result set, and when another small result clause intersects with this low cardinality condition, the low cardinality field can avoid reading all docIds into memory.

This approach is targeting to solve the first point. The basic idea is trying to use a 512 ints ForUtil for BKD ids codec. I benchmarked this optimization by mocking some random LongPoint and querying them with PointInSetQuery.

Benchmark Result

doc count	field cardinality	query point	baseline QPS	candidate QPS	diff percentage
100000000	32	1	51.44	148.26	188.22%
100000000	32	2	26.8	101.88	280.15%
100000000	32	4	14.04	53.52	281.20%
100000000	32	8	7.04	28.54	305.40%
100000000	32	16	3.54	14.61	312.71%
100000000	128	1	110.56	350.26	216.81%
100000000	128	8	16.6	89.81	441.02%
100000000	128	16	8.45	48.07	468.88%
100000000	128	32	4.2	25.35	503.57%
100000000	128	64	2.13	13.02	511.27%
100000000	1024	1	536.19	843.88	57.38%
100000000	1024	8	109.71	251.89	129.60%
100000000	1024	32	33.24	104.11	213.21%
100000000	1024	128	8.87	30.47	243.52%
100000000	1024	512	2.24	8.3	270.54%
100000000	8192	1	3333.33	5000	50.00%
100000000	8192	32	139.47	214.59	53.86%
100000000	8192	128	54.59	109.23	100.09%
100000000	8192	512	15.61	36.15	131.58%
100000000	8192	2048	4.11	11.14	171.05%
100000000	1048576	1	2597.4	3030.3	16.67%
100000000	1048576	32	314.96	371.75	18.03%
100000000	1048576	128	99.7	116.28	16.63%
100000000	1048576	512	30.5	37.15	21.80%
100000000	1048576	2048	10.38	12.3	18.50%
100000000	8388608	1	2564.1	3174.6	23.81%
100000000	8388608	32	196.27	238.95	21.75%
100000000	8388608	128	55.36	68.03	22.89%
100000000	8388608	512	15.58	19.24	23.49%
100000000	8388608	2048	4.56	5.71	25.22%

The indices size is reduced for low cardinality fields and flat for high cardinality fields.

113M    index_100000000_doc_32_cardinality_baseline
114M    index_100000000_doc_32_cardinality_candidate

140M    index_100000000_doc_128_cardinality_baseline
133M    index_100000000_doc_128_cardinality_candidate

241M    index_100000000_doc_8192_cardinality_baseline
233M    index_100000000_doc_8192_cardinality_candidate

193M    index_100000000_doc_1024_cardinality_baseline
174M    index_100000000_doc_1024_cardinality_candidate

314M    index_100000000_doc_1048576_cardinality_baseline
315M    index_100000000_doc_1048576_cardinality_candidate

392M    index_100000000_doc_8388608_cardinality_baseline
391M    index_100000000_doc_8388608_cardinality_candidate

jpountz · 2022-01-05T14:43:43Z

Nice. I wonder if we need to specialize for so many numbers of bits per value like we do for postings, or if we should only specialize for a few numbers of bits per value that are both useful and fast, e.g. 0, 4, 8, 16, 24 and 32.

gf2121 · 2022-01-06T08:36:30Z

Thanks @jpountz ! I have two points to conifrm:

Do you mean only specialize a few numbers of bits per value (other bpvs using decodeslow), or just support these numbers when writing (round up bpvs like what we did in DirectWriter) ?
This issue is having the same problem as LUCENE-10319: make ForUtil#BLOCK_SIZE changeable #545 that constants with complex names making codes harder to read, maybe we need to solve it first? I made some change in that approach, hoping it will make sense to you :)

jpountz · 2022-01-06T09:28:29Z

I was thinking of only supporting these numbers of bits per value indeed. For postings, numbers are always deltas, so we can generally expect them to be small. But for BKD trees it tends to be more an exception so I don't think we should spend too much effort on supporting so many bits per value and only focus on the ones that matter:

32 bits per value for large segments where the doc ID order is random.
24 bits per value for medium segments (less than 2^24 docs) where the doc ID order is random.
16 bits per value plus delta coding from the minimum doc ID in the block for the case where there is some clustering of doc IDs.
And maybe the bitset strategy you added recently already covers the other cases like sorted indexes and values that exist in many docs, so that we don't need the delta-coding between consecutive anymore, which is slow anyway due to the cumulative sum?

gf2121 · 2022-01-06T17:54:39Z

Thanks @jpountz for the great advice! I have implemented it and codes are simplified a lot.

gf2121 · 2022-01-06T19:14:55Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+    for (int i = 0; i < 192; ++i) {
+      tmp[i] = longs[i] << 8;
+    }
+    for (int i = 0; i < 64; i++) {


This encoding logic is a bit different from the ForUtil we used in postings, as i want to make the remainder decoding can also trigger SIMD. Here is the JMH result:

Benchmark (bitsPerValue) (byteOrder) Mode Cnt Score Error Units PackedIntsDecodeBenchmark.baseline 24 LE thrpt 5 7756530.597 ± 1654468.198 ops/s PackedIntsDecodeBenchmark.candidate 24 LE thrpt 5 9681438.494 ± 2528482.525 ops/s

gf2121 · 2022-01-06T19:45:25Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+    return expandMask32((1L << bitsPerValue) - 1);
+  }
+
+  private static void expand16(long[] arr, final long base) {


This is a bit tricky but indeed helping the performance:

Benchmark (bitsPerValue) (byteOrder) Mode Cnt Score Error Units PackedIntsDecodeBenchmark.plusAfterExpand 16 LE thrpt 20 7996083.715 ± 618197.203 ops/s PackedIntsDecodeBenchmark.plusWhenExpand 16 LE thrpt 20 10681376.808 ± 542945.909 ops/s

Thanks for creating these micro benchmarks.

jpountz

Thanks @gf2121.

If you have the time, I'd be curious to see how it would compare with a similar approach where we would use an int[] to hold the doc IDs instead of longs. Postings started with longs because it helped do "fake" SIMD operations, e.g. summing up a single long would sum up two ints under the hood. But as our BKD tree doesn't need to perform prefix sums, it doesn't need such tricks. We could add a new IndexInput#readInts similar to IndexInput#readLongs and see how the impl that uses longs compares to the one that uses ints?

jpountz · 2022-01-11T17:16:18Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+    return expandMask32((1L << bitsPerValue) - 1);
+  }
+
+  private static void expand16(long[] arr, final long base) {


Thanks for creating these micro benchmarks.

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

gf2121 · 2022-01-23T12:02:49Z

@iverase Thank you very much for helping debug so much!

I had a look into the method and I could not spot anything strange. More over, I replaced the line:
guard.getInts(curIntBufferViews[position & 0x03].position(position >>> 2), dst, offset, length);
with: (Not reusing the int buffers)
IntBuffer intBuffer = curBuf.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(position).asIntBuffer();
guard.getInts(intBuffer, dst, offset, length);

I found out the bug based on this clue you provided, and fixed it in this commit. Also added some more tests around the readInts.

Very sorry for my negligence, and thanks again for your help!

iverase

I think it would be nice to add more test to readInts. Probably enough to add the readInts counterpart that are on BaseDirectoryTestCase and BaseChunkedDirectoryTestCase.

Otherwise just left a few comments but I think this a good change.

iverase · 2022-01-24T08:31:42Z

lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java

@@ -503,6 +538,8 @@ private void unsetBuffers() {
    curBuf = null;
    curBufIndex = 0;
    curLongBufferViews = null;
+    curFloatBufferViews = null;
+    curIntBufferViews = null;


curFloatBufferViews does not belong to this PR. I wonder if we should open a separate issue for this as it might lead to unknown bugs? what do you think @jpountz

+1 to open a separate issue

iverase · 2022-01-24T08:33:21Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+  private final int[] tmp;
+
+  BKDForUtil(int maxPointsInLeaf) {
+    tmp = new int[maxPointsInLeaf / 4 * 3];


I am a bit confused with this, can you add an explanation of why we are choosing that size. Probably it would be good to use parenthesis to clarify the order of execution,

iverase · 2022-01-24T08:33:23Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+
+final class BKDForUtil {
+
+  static final int BLOCK_SIZE = 512;


I think this is not needed any more, should be remove it?

iverase

LGTM

gf2121 · 2022-01-26T08:35:18Z

Hi @jpountz ! I wonder if you'd like to take another look at this PR? I'll merge and backport this if you agree.

jpountz

LGTM

We should look into improving testing in a follow-up, there is now little testing for the legacy vint logic.

jpountz · 2022-01-28T16:59:03Z

lucene/core/src/java/org/apache/lucene/store/ByteBufferIndexInput.java

@@ -503,6 +538,8 @@ private void unsetBuffers() {
    curBuf = null;
    curBufIndex = 0;
    curLongBufferViews = null;
+    curFloatBufferViews = null;
+    curIntBufferViews = null;


+1 to open a separate issue

jpountz · 2022-01-28T16:59:46Z

lucene/core/src/java/org/apache/lucene/store/DataInput.java

@@ -169,6 +169,13 @@ public void readLongs(long[] dst, int offset, int length) throws IOException {
    }
  }

+  public void readInts(int[] dst, int offset, int length) throws IOException {


Can you add javadocs?

jpountz · 2022-01-28T17:10:05Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDForUtil.java

+    }
+    for (int i = 0; i < quarterLen; i++) {
+      final int longIdx = off + i + quarterLen3;
+      tmp[i] |= (ints[longIdx] >>> 16) & 0xFF;


I think we don't even need toe mask since values use 24 bits?

Suggested change

tmp[i] |= (ints[longIdx] >>> 16) & 0xFF;

tmp[i] |= ints[longIdx] >>> 16;

jpountz · 2022-01-28T17:21:37Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

-        out.writeVInt(doc - previous);
-        previous = doc;
+
+    if (Integer.toUnsignedLong(min2max) <= 0xFFFFL) {


since doc IDs are integers in [0, MAX_VALUE) we don't need to convert to an unsigned long (MAX_VALUE is not a legal doc ID)

Suggested change

if (Integer.toUnsignedLong(min2max) <= 0xFFFFL) {

if (min2max <= 0xFFFF) {

jpountz · 2022-01-28T17:22:09Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

-          out.writeShort((short) (docIds[start + i] >>> 8));
-          out.writeByte((byte) docIds[start + i]);
-        }
+      if (Integer.toUnsignedLong(max) <= 0xFFFFFFL) {


Suggested change

if (Integer.toUnsignedLong(max) <= 0xFFFFFFL) {

if (max <= 0xFFFFFF) {

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

Co-authored-by: Adrien Grand <jpountz@gmail.com>

…apache#541)

…#541)

… ForUtil (apache#541)" This reverts commit 8c67a38.

…apache#541)

Elasticsearch (which based on lucene) can automatically infer types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as long, and ES uses BKD as the index of long fields. Just as #541 said, when the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high even if we use a boolean query with filter clauses for low-cardinality fields. One reason is that it uses a ReentrantLock to limit accessing LRUQueryCache. QPS and costs of their queries are often high, which often causes trying locking failures when obtaining the cache, resulting in low concurrency in accessing the cache. So I replace the ReentrantLock with a ReentrantReadWriteLock. I only use the read lock when I need to get the cache for a query,

gf2121 added 7 commits December 14, 2021 23:07

stash

88f0bd5

stash

141dc40

check

2bfe1aa

add forutil

8b79de0

name codes

05d7ba2

note

d80bb28

for util

e4bd039

gf2121 marked this pull request as draft December 16, 2021 12:55

reduce code num

92bfc83

gf2121 marked this pull request as ready for review December 20, 2021 08:26

spotless

73d00ff

gf2121 added 6 commits January 6, 2022 22:46

limit bpv to 16/24/32 and using floor delta codec

4648f83

make diff a bit more beautiful

634e56e

iter

ffdfb26

assert count

92e6710

make writer final

8c70d9c

iter

cafe4fc

try to make remainder also SIMD

a387949

gf2121 commented Jan 6, 2022

View reviewed changes

plus when expand

7aa92ea

gf2121 commented Jan 6, 2022

View reviewed changes

gf2121 added 3 commits January 7, 2022 13:07

judge cluster should not rely on sorted

6ff0cec

add an assert

1f09b3a

assert is making CI angry, remove

2708036

gf2121 mentioned this pull request Jan 10, 2022

LUCENE-10366: Override #readVInt and #readVLong for ByteBufferDataInput to avoid the abstraction confusion of #readByte. #592

Merged

jpountz reviewed Jan 11, 2022

View reviewed changes

gf2121 added 4 commits January 23, 2022 17:35

unset int buffer

6a84978

add some tests for read Ints

09f2999

spotless

51056c2

iter

aa73d07

iverase reviewed Jan 24, 2022

View reviewed changes

gf2121 added 3 commits January 24, 2022 18:07

fix tests and add some notes for tmp length

9e084b8

fix typo

38abc4c

spotless

4db27a2

iverase approved these changes Jan 25, 2022

View reviewed changes

jpountz approved these changes Jan 28, 2022

View reviewed changes

iter on feed back

8608db0

jpountz approved these changes Jan 31, 2022

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java Outdated Show resolved Hide resolved

gf2121 and others added 3 commits February 7, 2022 14:34

Update lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

4893afc

Co-authored-by: Adrien Grand <jpountz@gmail.com>

Update lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

be14afa

Co-authored-by: Adrien Grand <jpountz@gmail.com>

Merge remote-tracking branch 'origin/main' into LUCENE-10315

3376f9a

gf2121 merged commit 8c67a38 into apache:main Feb 7, 2022

gf2121 added a commit to gf2121/lucene that referenced this pull request Feb 7, 2022

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (…

1b88d75

…apache#541)

This was referenced Feb 7, 2022

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (backport 9x) #652

Merged

LUCENE-10315: add CHANGES for #541 #653

Merged

gf2121 added a commit that referenced this pull request Feb 7, 2022

LUCENE-10315: Add CHANGES for #541 (#653)

e93b08f

gf2121 added a commit that referenced this pull request Feb 7, 2022

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (…

28ba89b

…#541)

gf2121 mentioned this pull request Feb 7, 2022

LUCENE-10410: Add more tests for legacy decoding logic in DocIdsWriter #654

Merged

gf2121 pushed a commit to gf2121/lucene that referenced this pull request Feb 24, 2022

Revert "LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints…

2872ba4

… ForUtil (apache#541)" This reverts commit 8c67a38.

gf2121 added a commit to gf2121/lucene that referenced this pull request Apr 6, 2022

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (…

3ad55ce

…apache#541)

boicehuang mentioned this pull request Apr 15, 2024

Performance improvements to use RWLock to access LRUQueryCache #13306

Merged

gf2121 mentioned this pull request Jan 28, 2025

Introduce bpv24 vectorized decoding for DocIdsWriter #14176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

gf2121 commented Dec 14, 2021 •

edited

Loading

jpountz commented Jan 5, 2022

gf2121 commented Jan 6, 2022

jpountz commented Jan 6, 2022

gf2121 commented Jan 6, 2022

gf2121 Jan 6, 2022 •

edited

Loading

gf2121 Jan 6, 2022 •

edited

Loading

jpountz Jan 11, 2022

jpountz left a comment

jpountz Jan 11, 2022

gf2121 commented Jan 23, 2022 •

edited

Loading

iverase left a comment

iverase Jan 24, 2022

jpountz Jan 28, 2022

iverase Jan 24, 2022

iverase Jan 24, 2022

iverase left a comment

gf2121 commented Jan 26, 2022

jpountz left a comment

jpountz Jan 28, 2022

jpountz Jan 28, 2022

jpountz Jan 28, 2022

jpountz Jan 28, 2022

jpountz Jan 28, 2022

	tmp[i] \|= (ints[longIdx] >>> 16) & 0xFF;
	tmp[i] \|= ints[longIdx] >>> 16;

	if (Integer.toUnsignedLong(min2max) <= 0xFFFFL) {
	if (min2max <= 0xFFFF) {

	if (Integer.toUnsignedLong(max) <= 0xFFFFFFL) {
	if (max <= 0xFFFFFF) {

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541

Conversation

gf2121 commented Dec 14, 2021 • edited Loading

jpountz commented Jan 5, 2022

gf2121 commented Jan 6, 2022

jpountz commented Jan 6, 2022

gf2121 commented Jan 6, 2022

gf2121 Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

gf2121 Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gf2121 commented Jan 23, 2022 • edited Loading

iverase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

gf2121 commented Jan 26, 2022

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gf2121 commented Dec 14, 2021 •

edited

Loading

gf2121 Jan 6, 2022 •

edited

Loading

gf2121 Jan 6, 2022 •

edited

Loading

gf2121 commented Jan 23, 2022 •

edited

Loading