Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency #14241

Merged
merged 11 commits into from
Oct 22, 2024

Conversation

jackluo923
Copy link
Contributor

@jackluo923 jackluo923 commented Oct 15, 2024

Summary

CLPMutableForwardIndexV2 is a comprehensive improvement over CLPMutableForwardIndexV1, offering enhancements in both performance and functionality. CLPMutableForwardIndexV2 is also compatible with CLPMutableForwardIndex and CLPMutableForwardIndexV2 is now the default CLP mutable forward index to be used within Pinot.

This PR is the first in a series of three PRs:

  1. CLPMutableForwardIndexV2: Introduces the updated mutable forward index.
  2. CLPForwardIndexV2: Implements an improved version of immutable forward index offering much higher compression ratio.
  3. Codec + Config Support: Adds the necessary codec and config support to enable the toggle between V1 & V2 immutable forward indexes.

Compatibility

CLPMutableForwardIndexV2 is compatible with CLPMutableForwardIndex. A unit test validating this compatibility can be found in CLPMutableForwardIndexTest#testStringV2(). In the future, we may consider replacing CLPMutableForwardIndex with CLPMutableForwardIndexV2. However, keeping both implementations for now provides a safety net, allowing us to revert to the older version should any unforeseen issues arise with V2.

Details

Performance Improvement

The serialization/deserialization overhead is significantly lower compared to CLPMutableForwardIndexV1. With the upgrade to clp-ffi-java version 0.4.7, we can now directly access the raw byte[] content of CLP-encoded messages, rather than operating over boxed String objects for logtype and dictVar and Long objects for encodedVar. On the forward index storage side, the composite mutable forward index now exclusively stores primitive types. For decoding, we also work with raw byte and long arrays, eliminating the need to first encode the bytes into a String/Long before passing them to CLP's decoder to unbox and decode the log message.

Functionality Improvement

The new mutable forward index implements a composite index for string-typed columns with dynamic encoding options:

  1. Pure CLP encoding when the dictionary cardinality is below a configurable threshold.
  2. CLP encoding combined with a raw string forward index when the dictionary cardinality exceeds the threshold.

Initially, CLP encoding transforms an extremely high-cardinality log message string into three data columns:

  • Logtype (very low cardinality) - essentially an inferred format string of the log message
  • Dictionary variables (medium cardinality) - variables with both alphabets and numbers
  • Encoded variables (high cardinality) - pure fixed point and floating point numbers

The logtype and dictionary variables are dictionary-encoded, while the encoded variables are stored as longs. Notably, both encodedVarIds and encodedVars are multi-valued, but they are stored using a flattened single-value mutable forward index, along with a separate forward index to capture the end offsets for each multi-valued document. This approach is necessary because the maximum number of values per document is unknown during ingestion, unlike the existing multi-value forward index, which requires this information upfront. During the conversion from mutable to immutable forward index, the two single-value mutable indices are merged into a single immutable multi-valued forward index, as the max length is now known post ingestion (conversion time).

During ingestion, if the cardinality of either the logtypeId or dictVarID exceeds a predefined threshold, the ingestion mode switches to a raw bytes forward index for subsequent documents. Maintaining very large dictionaries is inefficient in Pinot due to memory and I/O constraints (memory-mapped). Switching to a raw bytes forward index helps avoid these issues. During reads, if the requested docId is in the raw forward index, the raw bytes are returned. Otherwise, the log type, dictionary variables, and encoded variables are decoded using the CLP decoder to return the original log message's bytes.

Notes:

Writes are strictly sequential, while reads can be performed randomly. The supported append operations are:

  • setString(int docId, String value) - Encodes the log message using CLP and invokes appendEncodedMessage(@NotNull EncodedMessage clpEncodedMessage)
  • appendEncodedMessage(@NotNull EncodedMessage clpEncodedMessage)

Future Work:

  • enable conversion between mutable -> immutable CLP forward index without re-encoding
  • enable columnar conversion between mutable -> immutable CLP forward index
  • utilize dynamic CLP encoding feature provided in CLPMutableForwardIndexV2

protected int _nextDocId = 0;
protected int _nextDictVarDocId = 0;
protected int _nextEncodedVarId = 0;
protected int _bytesRawFwdIndexDocIdStartOffset = Integer.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only for CLPV2? can we add a comment on how this value is dynamic updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is only used for CLPV2 (but might be used for CLPV3 in the future). Internally, we have classes which extends the functionality of this class, so some class member variables are marked as protected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is updated at the end of appendEncodedMessage method call. I will add a comment to make this clear.

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 43 lines in your changes missing coverage. Please review.

Project coverage is 63.84%. Comparing base (59551e4) to head (7e2ef60).
Report is 1213 commits behind head on master.

Files with missing lines Patch % Lines
...ealtime/impl/forward/CLPMutableForwardIndexV2.java 77.71% 24 Missing and 13 partials ⚠️
.../local/segment/index/forward/ForwardIndexType.java 0.00% 4 Missing ⚠️
...verter/stats/MutableNoDictionaryColStatistics.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14241      +/-   ##
============================================
+ Coverage     61.75%   63.84%   +2.09%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2627     +191     
  Lines        133233   144845   +11612     
  Branches      20636    22162    +1526     
============================================
+ Hits          82274    92477   +10203     
- Misses        44911    45526     +615     
- Partials       6048     6842     +794     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.72% <75.00%> (+2.01%) ⬆️
java-21 63.72% <75.00%> (+2.10%) ⬆️
skip-bytebuffers-false 63.81% <75.00%> (+2.06%) ⬆️
skip-bytebuffers-true 63.63% <75.00%> (+35.90%) ⬆️
temurin 63.84% <75.00%> (+2.09%) ⬆️
unittests 63.84% <75.00%> (+2.09%) ⬆️
unittests1 55.37% <0.00%> (+8.48%) ⬆️
unittests2 34.44% <75.00%> (+6.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@deemoliu deemoliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jackluo923 jackluo923 changed the title Add the initial implementation of CLPMutableForwardIndexV2 Add and use CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency Oct 16, 2024
jackluo923 added a commit to y-scope/pinot that referenced this pull request Oct 21, 2024
@deemoliu deemoliu merged commit 8774d32 into apache:master Oct 22, 2024
21 checks passed
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mutable index, v1 and v2 cannot co-exist. If we are not planning to make it configurable, let's just remove v1 and only keep one mutable index format

@jackluo923
Copy link
Contributor Author

For mutable index, v1 and v2 cannot co-exist. If we are not planning to make it configurable, let's just remove v1 and only keep one mutable index format

Sure, I will make the change in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants