Fix bug "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" #19248

laminelam · 2025-09-07T17:29:24Z

This PR fixes the "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" bug

Investigated the issue and looks like there are 2 causes:

When building the analyzers, if one fails for some reason, it throws an exception and the process stops. So the customSynonymAnalyzer does not get instantiated.
On the other hand, if the customSynonymAnalyzer depends on another one that hasn't been built (and registred) yet the process fails too.

analysisRegistry.getAnalyzer(synonymAnalyzerName);

This is not enough because it only looks into the built in and pre built in analyzers. The one from settings are not there.

The solution is two-fold:

Fail safe instead of fail fast when building the analyzers.
Build the depending analyzers first.

Fail safe instead of fail fast when building the analyzers
Right now, if an analyzer fails for some reason the whole building process fails with an exception.

Build the depending analyzers first:
Synonym custom analyzers may depend on another analyzer that has to be built first.
The PR adds a logic to:

add option "order" attribute that defines precedence order between analyzers
add 'analyzersBuiltSoFar' to getChainAwareTokenFilterFactory to pass the already built analyzers needed by the one being built

Resolves #18037

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-09-07T18:39:34Z

❌ Gradle check result for 5c8fbe6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

laminelam · 2025-09-07T21:13:30Z

❌ Gradle check result for 5c8fbe6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Only fail in CI, cannot reproduce locally with same seeds

org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search.aggregation/10_histogram/histogram profiler}

org.opensearch.remotemigration.RemoteMigrationIndexMetadataUpdateIT.testIndexSettingsUpdateAfterIndexMovedToRemoteThroughAllocationExclude

org.opensearch.indices.replication.RemoteStoreReplicationSourceTests.testGetMergedSegmentFilesDownloadTimeout

github-actions · 2025-09-07T22:26:54Z

❌ Gradle check result for 267f48e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

gaobinlong

The DCO check is failed, please amend your commit with '-s' to include the sign off info, and change log is needed.

gaobinlong · 2025-09-17T08:15:54Z

...lysis-common/src/main/java/org/opensearch/analysis/common/MultiplexerTokenFilterFactory.java

        List<TokenFilterFactory> previousTokenFilters,
-        Function<String, TokenFilterFactory> allFilters
+        Function<String, TokenFilterFactory> allFilters,
+        Function<String, Analyzer> analyzersBuiltSoFar


This change is breaking because the method is public, do we have another solution?

Hi @gaobinlong
Do you have time to finish the review?

prudhvigodithi · 2025-09-22T01:38:04Z

Thanks @laminelam looks like similar issue from past #16263 (comment) ?

github-actions · 2025-09-22T01:52:00Z

❌ Gradle check result for 48cc847: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

laminelam · 2025-10-13T15:25:21Z

Thanks @laminelam looks like similar issue from past #16263 (comment) ?

Yes I think they are related. This fix should handle both situations

andrross

@prudhvigodithi Can you review this PR since you've worked on a similar issue in the past?

server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java

github-actions · 2025-10-27T15:05:23Z

❌ Gradle check result for 49bb330: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

andrross · 2025-11-21T22:23:38Z

@prudhvigodithi Friendly ping to review this code, thanks!

@laminelam Can you fix up the conflicts by rebasing, the changelog failure by adding an appropriate entry to the changelog, and the DCO failure by ensuring all commits are signed?

…whitespace or classic tokenizer in synonym_analyzer" bug opensearch-project#18037 add 'analyzersBuiltSoFar' to getChainAwareTokenFilterFactory to build custom analyzers depending on other (already built) analyzers The analyzers are built following the order of precedence specified in the settings Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

…rgs) Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

github-actions · 2025-11-22T16:45:02Z

❌ Gradle check result for 7f3e8fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

prudhvigodithi · 2025-11-22T21:32:31Z

server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java

    ) throws IOException {
        Settings defaultSettings = Settings.builder().put(IndexMetadata.SETTING_VERSION_CREATED, settings.getIndexVersionCreated()).build();
-        Map<String, T> factories = new HashMap<>();
+        Map<String, T> factories = new LinkedHashMap<>(); // keep insertion order


This change is to keep the insertion order from buildAnalyzerFactories.

this is for the factories we are building from the analyzers previously sorted
we sort the analyzers, then iterate through them and insert them to the factories Map.

prudhvigodithi · 2025-11-22T21:55:00Z

server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java

+
+        // Some analyzers depend on others that need to be built first
+        // Sort by 'order', default to Integer.MAX_VALUE
+        List<Map.Entry<String, Settings>> sortedEntries = analyzersSettings.entrySet().stream().sorted((a, b) -> {


So please correct me if i'm wrong, we are introducing a new setting order by this PR which does not exist today https://docs.opensearch.org/latest/analyzers/index-analyzers/.
Now the analyzers are built based on this new order key.

"analyzer": { "test_analyzer": { "order": 2, "type": "custom", "char_filter": [ "custom_pattern_replace" ], "tokenizer": "whitespace", "filter": [ "custom_ascii_folding", "lowercase", "custom_word_delimiter", "custom_synonym_graph-replacement_filter", "custom_pattern_replace_filter", "flatten_graph" ] }, "no_split_synonym_analyzer": { "order": 1, "type": "custom", "tokenizer": "whitespace" } } }

If the order is not there sort alphabetically ?

@prudhvigodithi
Yes that's correct, we are introducing this new option to give the user the ability to decide which analyzer to build first.
If an analyzer A depends on B (like in the synonym_graph filter scenario), we instruct OpenSearch to build B before A.

I know this is the case and special handling only required for synonym_graph, is there a way we can infer the order of analyzers in natural order as the user declared? I'm not against adding a new setting, but want to see if this bug can be solved without one.

@prudhvigodithi
I understand your concerns, already thought about that, but the Settings "map" is backed by an internal map (TreeMap) that sorts by the natural order. See this comment in Settings.Builder class:

// we use a sorted map for consistent serialization when using getAsMap() private final Map<String, Object> map = new TreeMap<>();

prudhvigodithi · 2025-11-22T22:00:59Z

Added a small comment on the original issue #18037 (comment).

I did look at the PR at high level and added some comments, will take a look at it again. Thanks @laminelam.

laminelam requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners September 7, 2025 17:29

github-actions bot added bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing labels Sep 7, 2025

laminelam mentioned this pull request Sep 7, 2025

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

Open

gaobinlong reviewed Sep 17, 2025

View reviewed changes

opensearch-ci-bot mentioned this pull request Sep 21, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IngestFromKinesisIT #17678

Open

andrross reviewed Oct 20, 2025

View reviewed changes

server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java Outdated Show resolved Hide resolved

This was referenced Oct 27, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IndexServiceTests #14407

Open

[AUTOCUT] Gradle Check Flaky Test Report for VerifyVersionConstantsIT #14585

Open

laminelam requested review from andrross and gaobinlong November 2, 2025 12:13

Lamine Idjeraoui added 4 commits November 22, 2025 08:52

Maintain the two versions of getChainAwareTokenFilterFactory (4 & 5 a…

9f2c269

…rgs) Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

change order's default value to Integer.MAX_VALUE

90f64b8

Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

update change log

7f3e8fd

Signed-off-by: Lamine Idjeraoui <lidjeraoui@apple.com>

laminelam force-pushed the feature/synonym_graph_bug_97 branch from 49bb330 to 7f3e8fd Compare November 22, 2025 15:34

opensearch-ci-bot mentioned this pull request Nov 22, 2025

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT #19726

Open

prudhvigodithi reviewed Nov 22, 2025

View reviewed changes

Fix bug "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" #19248

Are you sure you want to change the base?

Fix bug "synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer" #19248

Uh oh!

Conversation

laminelam commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Check List

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

laminelam commented Sep 7, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

gaobinlong left a comment

Choose a reason for hiding this comment

Uh oh!

gaobinlong Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

laminelam Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

prudhvigodithi commented Sep 22, 2025

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

laminelam commented Oct 13, 2025

Uh oh!

andrross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

andrross commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

prudhvigodithi Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

laminelam Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

prudhvigodithi Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

laminelam Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

prudhvigodithi Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

laminelam Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prudhvigodithi commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

laminelam commented Sep 7, 2025 •

edited

Loading

laminelam Nov 24, 2025 •

edited

Loading