-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix bug "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" #19248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
❌ Gradle check result for 5c8fbe6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Only fail in CI, cannot reproduce locally with same seeds |
|
❌ Gradle check result for 267f48e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
gaobinlong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DCO check is failed, please amend your commit with '-s' to include the sign off info, and change log is needed.
| List<TokenFilterFactory> previousTokenFilters, | ||
| Function<String, TokenFilterFactory> allFilters | ||
| Function<String, TokenFilterFactory> allFilters, | ||
| Function<String, Analyzer> analyzersBuiltSoFar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is breaking because the method is public, do we have another solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gaobinlong
Do you have time to finish the review?
|
Thanks @laminelam looks like similar issue from past #16263 (comment) ? |
|
❌ Gradle check result for 48cc847: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Yes I think they are related. This fix should handle both situations |
andrross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prudhvigodithi Can you review this PR since you've worked on a similar issue in the past?
server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 49bb330: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@prudhvigodithi Friendly ping to review this code, thanks! @laminelam Can you fix up the conflicts by rebasing, the changelog failure by adding an appropriate entry to the changelog, and the DCO failure by ensuring all commits are signed? |
…whitespace or classic tokenizer in synonym_analyzer" bug opensearch-project#18037 add 'analyzersBuiltSoFar' to getChainAwareTokenFilterFactory to build custom analyzers depending on other (already built) analyzers The analyzers are built following the order of precedence specified in the settings Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>
…rgs) Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>
Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>
Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>
49bb330 to
7f3e8fd
Compare
|
❌ Gradle check result for 7f3e8fd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
| ) throws IOException { | ||
| Settings defaultSettings = Settings.builder().put(IndexMetadata.SETTING_VERSION_CREATED, settings.getIndexVersionCreated()).build(); | ||
| Map<String, T> factories = new HashMap<>(); | ||
| Map<String, T> factories = new LinkedHashMap<>(); // keep insertion order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to keep the insertion order from buildAnalyzerFactories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for the factories we are building from the analyzers previously sorted
we sort the analyzers, then iterate through them and insert them to the factories Map.
|
|
||
| // Some analyzers depend on others that need to be built first | ||
| // Sort by 'order', default to Integer.MAX_VALUE | ||
| List<Map.Entry<String, Settings>> sortedEntries = analyzersSettings.entrySet().stream().sorted((a, b) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So please correct me if i'm wrong, we are introducing a new setting order by this PR which does not exist today https://docs.opensearch.org/latest/analyzers/index-analyzers/.
Now the analyzers are built based on this new order key.
"analyzer": {
"test_analyzer": {
"order": 2,
"type": "custom",
"char_filter": [
"custom_pattern_replace"
],
"tokenizer": "whitespace",
"filter": [
"custom_ascii_folding",
"lowercase",
"custom_word_delimiter",
"custom_synonym_graph-replacement_filter",
"custom_pattern_replace_filter",
"flatten_graph"
]
},
"no_split_synonym_analyzer": {
"order": 1,
"type": "custom",
"tokenizer": "whitespace"
}
}
}
If the order is not there sort alphabetically ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prudhvigodithi
Yes that's correct, we are introducing this new option to give the user the ability to decide which analyzer to build first.
If an analyzer A depends on B (like in the synonym_graph filter scenario), we instruct OpenSearch to build B before A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is the case and special handling only required for synonym_graph, is there a way we can infer the order of analyzers in natural order as the user declared? I'm not against adding a new setting, but want to see if this bug can be solved without one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prudhvigodithi
I understand your concerns, already thought about that, but the Settings "map" is backed by an internal map (TreeMap) that sorts by the natural order. See this comment in Settings.Builder class:
// we use a sorted map for consistent serialization when using getAsMap()
private final Map<String, Object> map = new TreeMap<>();
|
Added a small comment on the original issue #18037 (comment). I did look at the PR at high level and added some comments, will take a look at it again. Thanks @laminelam. |
This PR fixes the "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" bug
Investigated the issue and looks like there are 2 causes:
analysisRegistry.getAnalyzer(synonymAnalyzerName);This is not enough because it only looks into the built in and pre built in analyzers. The one from settings are not there.
The solution is two-fold:
Fail safe instead of fail fast when building the analyzers
Right now, if an analyzer fails for some reason the whole building process fails with an exception.
Build the depending analyzers first:
Synonym custom analyzers may depend on another analyzer that has to be built first.
The PR adds a logic to:
Resolves #18037
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.