Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

jackluo923 · 2024-10-15T18:08:20Z

Summary

CLPMutableForwardIndexV2 is a comprehensive improvement over CLPMutableForwardIndexV1, offering enhancements in both performance and functionality. CLPMutableForwardIndexV2 is also compatible with CLPMutableForwardIndex and CLPMutableForwardIndexV2 is now the default CLP mutable forward index to be used within Pinot.

This PR is the first in a series of three PRs:

CLPMutableForwardIndexV2: Introduces the updated mutable forward index.
CLPForwardIndexV2: Implements an improved version of immutable forward index offering much higher compression ratio.
Codec + Config Support: Adds the necessary codec and config support to enable the toggle between V1 & V2 immutable forward indexes.

Compatibility

CLPMutableForwardIndexV2 is compatible with CLPMutableForwardIndex. A unit test validating this compatibility can be found in CLPMutableForwardIndexTest#testStringV2(). In the future, we may consider replacing CLPMutableForwardIndex with CLPMutableForwardIndexV2. However, keeping both implementations for now provides a safety net, allowing us to revert to the older version should any unforeseen issues arise with V2.

Details

Performance Improvement

The serialization/deserialization overhead is significantly lower compared to CLPMutableForwardIndexV1. With the upgrade to clp-ffi-java version 0.4.7, we can now directly access the raw byte[] content of CLP-encoded messages, rather than operating over boxed String objects for logtype and dictVar and Long objects for encodedVar. On the forward index storage side, the composite mutable forward index now exclusively stores primitive types. For decoding, we also work with raw byte and long arrays, eliminating the need to first encode the bytes into a String/Long before passing them to CLP's decoder to unbox and decode the log message.

Functionality Improvement

The new mutable forward index implements a composite index for string-typed columns with dynamic encoding options:

Pure CLP encoding when the dictionary cardinality is below a configurable threshold.
CLP encoding combined with a raw string forward index when the dictionary cardinality exceeds the threshold.

Initially, CLP encoding transforms an extremely high-cardinality log message string into three data columns:

Logtype (very low cardinality) - essentially an inferred format string of the log message
Dictionary variables (medium cardinality) - variables with both alphabets and numbers
Encoded variables (high cardinality) - pure fixed point and floating point numbers

The logtype and dictionary variables are dictionary-encoded, while the encoded variables are stored as longs. Notably, both encodedVarIds and encodedVars are multi-valued, but they are stored using a flattened single-value mutable forward index, along with a separate forward index to capture the end offsets for each multi-valued document. This approach is necessary because the maximum number of values per document is unknown during ingestion, unlike the existing multi-value forward index, which requires this information upfront. During the conversion from mutable to immutable forward index, the two single-value mutable indices are merged into a single immutable multi-valued forward index, as the max length is now known post ingestion (conversion time).

During ingestion, if the cardinality of either the logtypeId or dictVarID exceeds a predefined threshold, the ingestion mode switches to a raw bytes forward index for subsequent documents. Maintaining very large dictionaries is inefficient in Pinot due to memory and I/O constraints (memory-mapped). Switching to a raw bytes forward index helps avoid these issues. During reads, if the requested docId is in the raw forward index, the raw bytes are returned. Otherwise, the log type, dictionary variables, and encoded variables are decoded using the CLP decoder to return the original log message's bytes.

Notes:

Writes are strictly sequential, while reads can be performed randomly. The supported append operations are:

setString(int docId, String value) - Encodes the log message using CLP and invokes appendEncodedMessage(@NotNull EncodedMessage clpEncodedMessage)
appendEncodedMessage(@NotNull EncodedMessage clpEncodedMessage)

Future Work:

enable conversion between mutable -> immutable CLP forward index without re-encoding
enable columnar conversion between mutable -> immutable CLP forward index
utilize dynamic CLP encoding feature provided in CLPMutableForwardIndexV2

deemoliu · 2024-10-15T20:50:09Z

...main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndexV2.java

+  protected int _nextDocId = 0;
+  protected int _nextDictVarDocId = 0;
+  protected int _nextEncodedVarId = 0;
+  protected int _bytesRawFwdIndexDocIdStartOffset = Integer.MAX_VALUE;


is this only for CLPV2? can we add a comment on how this value is dynamic updated?

Yes, this is only used for CLPV2 (but might be used for CLPV3 in the future). Internally, we have classes which extends the functionality of this class, so some class member variables are marked as protected.

The value is updated at the end of appendEncodedMessage method call. I will add a comment to make this clear.

...main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndexV2.java

codecov-commenter · 2024-10-15T22:13:29Z

Codecov Report

Attention: Patch coverage is 75.00000% with 43 lines in your changes missing coverage. Please review.

Project coverage is 63.84%. Comparing base (59551e4) to head (7e2ef60).
Report is 1213 commits behind head on master.

Files with missing lines	Patch %	Lines
...ealtime/impl/forward/CLPMutableForwardIndexV2.java	77.71%	24 Missing and 13 partials ⚠️
.../local/segment/index/forward/ForwardIndexType.java	0.00%	4 Missing ⚠️
...verter/stats/MutableNoDictionaryColStatistics.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14241      +/-   ##
============================================
+ Coverage     61.75%   63.84%   +2.09%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2627     +191     
  Lines        133233   144845   +11612     
  Branches      20636    22162    +1526     
============================================
+ Hits          82274    92477   +10203     
- Misses        44911    45526     +615     
- Partials       6048     6842     +794

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.72% <75.00%> (+2.01%)`	⬆️
java-21	`63.72% <75.00%> (+2.10%)`	⬆️
skip-bytebuffers-false	`63.81% <75.00%> (+2.06%)`	⬆️
skip-bytebuffers-true	`63.63% <75.00%> (+35.90%)`	⬆️
temurin	`63.84% <75.00%> (+2.09%)`	⬆️
unittests	`63.84% <75.00%> (+2.09%)`	⬆️
unittests1	`55.37% <0.00%> (+8.48%)`	⬆️
unittests2	`34.44% <75.00%> (+6.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deemoliu

LGTM

…rdIndex during mutable->immutable segment conversion.

…ted to Pinot null value handling.

…performance and efficiency apache#14241

Jackie-Jiang

For mutable index, v1 and v2 cannot co-exist. If we are not planning to make it configurable, let's just remove v1 and only keep one mutable index format

jackluo923 · 2024-10-23T13:59:41Z

For mutable index, v1 and v2 cannot co-exist. If we are not planning to make it configurable, let's just remove v1 and only keep one mutable index format

Sure, I will make the change in another PR.

jackluo923 added 2 commits October 16, 2024 02:05

Add the initial implementation of CLPMutableForwardIndexV2

cbc8e75

Upgraded clp-ffi.version from 0.4.6 to 0.4.7

8765713

jackluo923 mentioned this pull request Oct 15, 2024

Upgraded clp-ffi.version from 0.4.5 to 0.4.7 #14242

Merged

Updated comments

2846ce3

deemoliu requested review from deemoliu and Jackie-Jiang October 15, 2024 20:46

deemoliu reviewed Oct 15, 2024

View reviewed changes

...main/java/org/apache/pinot/segment/local/realtime/impl/forward/CLPMutableForwardIndexV2.java Outdated Show resolved Hide resolved

Address code review concerns

370d907

deemoliu approved these changes Oct 15, 2024

View reviewed changes

jackluo923 added 2 commits October 17, 2024 04:35

Enable CLPMutableForwardIndexV2 to be compatible with CLPMutableForwa…

c03c308

…rdIndex during mutable->immutable segment conversion.

Change Pinot to use CLPMutableForwardIndexV2 by default

16b0268

jackluo923 changed the title ~~Add the initial implementation of CLPMutableForwardIndexV2~~ Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency Oct 16, 2024

jackluo923 added 4 commits October 17, 2024 07:25

Fix integration issues related to dictionary encoding.

db9c47c

Fix left-over bugs introduced by class name changes.

93092d0

Fixed small bug in CLPMutableForwardIndexV2#appendEncodedMessage rela…

6900ee1

…ted to Pinot null value handling.

Fix style

6d69af3

jackluo923 added a commit to y-scope/pinot that referenced this pull request Oct 21, 2024

Add and use CLPMutableForwardIndexV2 by default to improve ingestion …

7ad777f

…performance and efficiency apache#14241

Merge branch 'apache:master' into master-clp-mutable-fwd-index-v2

7e2ef60

xiangfu0 added feature ingestion labels Oct 21, 2024

deemoliu approved these changes Oct 22, 2024

View reviewed changes

deemoliu merged commit 8774d32 into apache:master Oct 22, 2024
21 checks passed

Jackie-Jiang reviewed Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

jackluo923 commented Oct 15, 2024 •

edited

Loading

deemoliu Oct 15, 2024

jackluo923 Oct 15, 2024

jackluo923 Oct 15, 2024

codecov-commenter commented Oct 15, 2024 •

edited

Loading

deemoliu left a comment

Jackie-Jiang left a comment

jackluo923 commented Oct 23, 2024

Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

Conversation

jackluo923 commented Oct 15, 2024 • edited Loading

Summary

Compatibility

Details

Performance Improvement

Functionality Improvement

Notes:

Future Work:

deemoliu Oct 15, 2024

Choose a reason for hiding this comment

jackluo923 Oct 15, 2024

Choose a reason for hiding this comment

jackluo923 Oct 15, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 15, 2024 • edited Loading

Codecov Report

deemoliu left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

jackluo923 commented Oct 23, 2024

jackluo923 commented Oct 15, 2024 •

edited

Loading

codecov-commenter commented Oct 15, 2024 •

edited

Loading