Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store index metadata file for Lucene text indexes #13948

Merged
merged 3 commits into from
Oct 7, 2024

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Sep 6, 2024

Currently the Lucene text index is written using some Analyzer and queried with some QueryParser. Updating these in the table config can cause unpredictable behavior as when the segments are loaded they'll use the new configs but segments were built with the old configs. This can cause incorrect results when querying.

For example,

  1. segment1 created with Analyzer1 and QueryPaserForAnalyzer1
  2. table config updated with Analyzer2 and QueryParserForAnalyzer2
  3. segment 2 created with Analyzer2 and QueryParserForAnalyzer2
  4. segment 1 loaded, Reader uses QueryParserForAnalyzer2 but index uses Analyzer1

Storing metadata per segment allows for compatible QueryParser/Analyzer pairs to be maintained. Now,4 becomes
4. segment 1 loaded, Reader uses QueryParserForAnalyzer1 and index uses Analyzer1.

This allows for seamless updates of realtime tables without breaking queries against old segments (provided the QueryParser and Analyzer classes are managed correctly). In the future this metadata can be leveraged by TextIndexHandler to provide better index reload/rebuilding capabilities for Lucene text index.

Testing: we've been using this internally for around half a year

tags: enhancement, docs

@itschrispeck itschrispeck added enhancement release-notes Referenced by PRs that need attention when compiling the next release notes documentation and removed release-notes Referenced by PRs that need attention when compiling the next release notes labels Sep 6, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.

Project coverage is 63.83%. Comparing base (59551e4) to head (ad63dd9).
Report is 1146 commits behind head on master.

Files with missing lines Patch % Lines
...ot/segment/local/segment/store/TextIndexUtils.java 93.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13948      +/-   ##
============================================
+ Coverage     61.75%   63.83%   +2.08%     
- Complexity      207     1530    +1323     
============================================
  Files          2436     2621     +185     
  Lines        133233   144069   +10836     
  Branches      20636    22034    +1398     
============================================
+ Hits          82274    91971    +9697     
- Misses        44911    45318     +407     
- Partials       6048     6780     +732     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.82% <94.44%> (+2.11%) ⬆️
java-21 55.37% <77.77%> (-6.25%) ⬇️
skip-bytebuffers-false 63.83% <94.44%> (+2.08%) ⬆️
skip-bytebuffers-true 55.33% <77.77%> (+27.60%) ⬆️
temurin 63.83% <94.44%> (+2.08%) ⬆️
unittests 63.83% <94.44%> (+2.08%) ⬆️
unittests1 55.55% <77.77%> (+8.66%) ⬆️
unittests2 34.34% <94.44%> (+6.61%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deemoliu
Copy link
Contributor

thanks @itschrispeck for the contribution and detailed information!
I have a question that might be unrelated, but I’d appreciate your expertise. is it possible to rebuild the segment1 indexes into Analyzer2 and QueryParserForAnalyzer2 format? if possible, how expensive is it?

@itschrispeck
Copy link
Collaborator Author

I have a question that might be unrelated, but I’d appreciate your expertise. is it possible to rebuild the segment1 indexes into Analyzer2 and QueryParserForAnalyzer2 format? if possible, how expensive is it?

This diff would be a pre-req to do so, but that functionality would still have to be added in TextIndexHandler as mentioned. Essentially we'd read the Lucene index metadata added by this diff and decide if we should overwrite the existing index. Expense is mostly determined by Analyzer choice, but in general Lucene index build is expensive relative to Pinot's other indexes.

I have not added the functionality in this PR since internally we take advantage of the fact that it is not currently possible. When that functionality is added, we'll add some config to enable/disable it as well.

CommonsConfigurationUtils.saveToFile(properties, propertiesFile);
}

public static TextIndexConfig getUpdatedConfigFromPropertiesFile(File file, TextIndexConfig config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc to public methods.

}

return (Constructor<QueryParserBase>) queryParserClass.getConstructor(String.class, Analyzer.class);
}

public static void writeConfigToPropertiesFile(File indexDir, TextIndexConfig config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc to public methods.

Copy link
Contributor

@chenboat chenboat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address minor comments. LGTM.

@@ -300,6 +300,12 @@ public static Constructor<QueryParserBase> getQueryParserWithStringAndAnalyzerTy
return (Constructor<QueryParserBase>) queryParserClass.getConstructor(String.class, Analyzer.class);
}

/**
* Writes the config to the properties file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific on the fields which are written to the properties files?

@chenboat chenboat merged commit 0b7dae6 into apache:master Oct 7, 2024
18 of 21 checks passed
@swaminathanmanish
Copy link
Contributor

@chenboat , @itschrispeck - We ran into an NPE with this change. When we don't have all fields populated in the TextIndexConfig, we will run into NPE, via code path that calls. The call path is from newly added code getUpdatedConfigFromPropertiesFile .
(cc @shounakmk219)

public AbstractBuilder(TextIndexConfig other) {
_fstType = other._fstType;
_enableQueryCache = other._enableQueryCache;
_useANDForMultiTermQueries = other._useANDForMultiTermQueries;
_stopWordsInclude = new ArrayList<>(other._stopWordsInclude);
_stopWordsExclude = new ArrayList<>(other._stopWordsExclude);
_luceneUseCompoundFile = other._luceneUseCompoundFile;
_luceneMaxBufferSizeMB = other._luceneMaxBufferSizeMB;
_luceneAnalyzerClass = other._luceneAnalyzerClass;
_enablePrefixSuffixMatchingInPhraseQueries = other._enablePrefixSuffixMatchingInPhraseQueries;
_reuseMutableIndex = other._reuseMutableIndex;
_luceneNRTCachingDirectoryMaxBufferSizeMB = other._luceneNRTCachingDirectoryMaxBufferSizeMB;
}

java.lang.RuntimeException: java.lang.NullPointerException: Cannot invoke "java.util.Collection.toArray()" because "c" is null
at org.apache.pinot.segment.local.segment.index.readers.text.LuceneTextIndexReader.(LuceneTextIndexReader.java:109) ~[startree-pinot-all-1.3.0-ST.33.2-jar-with-dependencies.jar:1.3.0-ST.33.2-2c89967358e92fa326338467332730af48c1d353]
at org.apache.pinot.segment.local.segment.index.text.TextIndexType$ReaderFactory.createIndexReader(TextIndexType.java:184) ~[startree-pinot-all-1.3.0-ST.33.2-jar-with-dependencies.jar:1.3.0-ST.33.2-2c89967358e92fa326338467332730af48c1d353]

@shounakmk219
Copy link
Collaborator

Raised a fix #14616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants