Skip to content

Support merging object-type fields when fetching the schema from the index#3653

Merged
penghuo merged 12 commits intoopensearch-project:mainfrom
xinyual:mergeTwoObjects
Jun 9, 2025
Merged

Support merging object-type fields when fetching the schema from the index#3653
penghuo merged 12 commits intoopensearch-project:mainfrom
xinyual:mergeTwoObjects

Conversation

@xinyual
Copy link
Contributor

@xinyual xinyual commented May 22, 2025

Description

This PR supports merging object-type fields when fetching the schema from the several. For example, we have

PUT demo1
{
  "mappings": {
    "properties": {
      "machine": {
        "properties": {
          "os1": {
            "type": "text"
          },
          "ram1": {
            "type": "long"
          }
        }
      }
    }
  }
}

And

PUT demo2
{
  "mappings": {
    "properties": {
      "machine": {
        "properties": {
          "os2": {
            "type": "text"
          },
          "ram2": {
            "type": "long"
          }
        }
      }
    }
  }
}

Now we support source=demo1, demo2 | fields machine.os1, machine.os2

Also, did some benchmark test with different indices number and depth, reporting the average time of merging operation

indices=120 indices=1000
depth=15 0.103ms 0.951ms
depth=5 0.041ms 0.329ms

You can also do benchmark using MergeArrayAndObjectMapBenchmark with different arguments.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#3625

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

xinyual added 3 commits May 22, 2025 13:42
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Comment on lines +39 to +40
loadIndex(Index.MERGE_TEST_1);
loadIndex(Index.MERGE_TEST_2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 194 to 201
private Boolean checkWhetherToMerge(OpenSearchDataType first, OpenSearchDataType second) {
if (first.getExprCoreType() == second.getExprCoreType()
&& (first.getExprCoreType() == ExprCoreType.STRUCT
|| first.getExprCoreType() == ExprCoreType.ARRAY)) {
return true;
}
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an MergeRule abstraction.

  • For basic data type, before PR, the rule is Latest, after PR, the rule is noChange
  • For advance type, the rule is DeepMerge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a mergeRule utils but not sure whether it meets your requirement. For basic data type, I still keep the latest datatype according the order of indices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your code looks good.

We may support other type rules, for instance WideningMergeRule, so my previous thoughts is each Rule define it is match and mergeInto method

  public static void merge(Map<String, OpenSearchDataType> target, Map<String, OpenSearchDataType> source) {
    for (Map.Entry<String, OpenSearchDataType> entry : source.entrySet()) {
      String key = entry.getKey();
      OpenSearchDataType sourceValue = entry.getValue();
      OpenSearchDataType targetValue = target.get(key);
      RuleSelectorChain.selectRule(sourceValue, targetValue).mergeInto(key, sourceValue, target);
    }
  }
  
public class RuleSelectorChain {

  private static final List<MergeRuleSelector> RULE_SELECTORS = List.of(
      new DeepMergeSelector(),
      new LatestSelector()
  );

  public static MergeRule selectRule(OpenSearchDataType source, OpenSearchDataType target) {
    if (target == null) {
      return new LatestWinsRule();
    }
  
    return RULE_SELECTORS.stream()
        .map(selector -> selector.select(source, target))
        .filter(Optional::isPresent)
        .map(Optional::get)
        .findFirst()
        .orElse(new LatestSelector());  // this is default
  }
}

public class DeepMergeSelector implements MergeRuleSelector {
  @Override
  public Optional<MergeRule> select(OpenSearchDataType source, OpenSearchDataType target) {
     // return Optional.of(new DeepMergeRule()) if condition meet
  }
}

public class DeepMergeRule implements MergeRule {
  @Override
  public void mergeInto(String key, OpenSearchDataType source, Map<String, OpenSearchDataType> target) {
    OpenSearchDataType existing = target.get(key);
    merge(existing.getProperties(), source.getProperties());
    target.put(key, existing);
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Will refactor code like your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already refactor code and add interface with two implementations. Please check it.


if (target.containsKey(key) && checkWhetherToMerge(value, target.get(key))) {
OpenSearchDataType merged = target.get(key);
mergeObjectAndArrayInsideMap(merged.getProperties(), value.getProperties());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add a performance test for deep nesting, e.g., 10+ levels and 100/1000 indices. you can leverage benchmark in repo.
  2. based on test result,
    a. consider depth limit settings
    b. document merging limitations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add a benchmark for it. I tried 15 depth with 120 indices. The result is
Benchmark Mode Cnt Score Error Units
testMerge thrpt 25 9619.794 ± 393.331 ops/s
What is the expectation minimum ops of this merging action?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could u publish test result in PR description?
How long it will take to merge 15 depth with 120 indices? what if 1000 indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Already add result to the description. Let me know if you want more data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run a test load test, with 1000 indices, results shows when concurrent requred increased to 64, the latency increase to 12s.

next step.

  1. Could u double confirm load test results and update PR descritions?
  2. Profile OpenSearch, if the major latency contributor is getIndexMapping API, open issue in core repo.
  3. Update PPL Inconsistent Field Types across indices section with test results.

xinyual added 5 commits May 23, 2025 12:46
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>

if (target.containsKey(key) && checkWhetherToMerge(value, target.get(key))) {
OpenSearchDataType merged = target.get(key);
mergeObjectAndArrayInsideMap(merged.getProperties(), value.getProperties());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could u publish test result in PR description?
How long it will take to merge 15 depth with 120 indices? what if 1000 indices?

Comment on lines 194 to 201
private Boolean checkWhetherToMerge(OpenSearchDataType first, OpenSearchDataType second) {
if (first.getExprCoreType() == second.getExprCoreType()
&& (first.getExprCoreType() == ExprCoreType.STRUCT
|| first.getExprCoreType() == ExprCoreType.ARRAY)) {
return true;
}
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your code looks good.

We may support other type rules, for instance WideningMergeRule, so my previous thoughts is each Rule define it is match and mergeInto method

  public static void merge(Map<String, OpenSearchDataType> target, Map<String, OpenSearchDataType> source) {
    for (Map.Entry<String, OpenSearchDataType> entry : source.entrySet()) {
      String key = entry.getKey();
      OpenSearchDataType sourceValue = entry.getValue();
      OpenSearchDataType targetValue = target.get(key);
      RuleSelectorChain.selectRule(sourceValue, targetValue).mergeInto(key, sourceValue, target);
    }
  }
  
public class RuleSelectorChain {

  private static final List<MergeRuleSelector> RULE_SELECTORS = List.of(
      new DeepMergeSelector(),
      new LatestSelector()
  );

  public static MergeRule selectRule(OpenSearchDataType source, OpenSearchDataType target) {
    if (target == null) {
      return new LatestWinsRule();
    }
  
    return RULE_SELECTORS.stream()
        .map(selector -> selector.select(source, target))
        .filter(Optional::isPresent)
        .map(Optional::get)
        .findFirst()
        .orElse(new LatestSelector());  // this is default
  }
}

public class DeepMergeSelector implements MergeRuleSelector {
  @Override
  public Optional<MergeRule> select(OpenSearchDataType source, OpenSearchDataType target) {
     // return Optional.of(new DeepMergeRule()) if condition meet
  }
}

public class DeepMergeRule implements MergeRule {
  @Override
  public void mergeInto(String key, OpenSearchDataType source, Map<String, OpenSearchDataType> target) {
    OpenSearchDataType existing = target.get(key);
    merge(existing.getProperties(), source.getProperties());
    target.put(key, existing);
  }
}

xinyual added 2 commits May 28, 2025 11:34
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
public void testMerge() {
Map<String, OpenSearchDataType> finalResult = new HashMap<>();
for (Map<String, OpenSearchDataType> map : candidateMaps) {
OpenSearchDescribeIndexRequest.mergeObjectAndArrayInsideMap(finalResult, map);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mergeObjectAndArrayInsideMap not exist

penghuo
penghuo previously approved these changes Jun 4, 2025
Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyual Please create a issue to track pressure test and support merge limitation.

LantaoJin
LantaoJin previously approved these changes Jun 6, 2025
@LantaoJin
Copy link
Member

@xinyual could you resolve conflicts?

Signed-off-by: xinyual <xinyual@amazon.com>
@xinyual xinyual dismissed stale reviews from LantaoJin and penghuo via 2cc1b8a June 9, 2025 02:50
Signed-off-by: xinyual <xinyual@amazon.com>
@penghuo penghuo merged commit ed507d7 into opensearch-project:main Jun 9, 2025
22 checks passed
ahkcs pushed a commit to ahkcs/sql that referenced this pull request Jun 10, 2025
…index (opensearch-project#3653)

Signed-off-by: xinyual <xinyual@amazon.com>

Signed-off-by: Kai Huang <ahkcs@amazon.com>
(cherry picked from commit ed507d7)
penghuo pushed a commit that referenced this pull request Jun 16, 2025
…index (#3653)

* merge object/array

Signed-off-by: xinyual <xinyual@amazon.com>

* simplified code

Signed-off-by: xinyual <xinyual@amazon.com>

* apply spotless

Signed-off-by: xinyual <xinyual@amazon.com>

* fix IT by adding fields

Signed-off-by: xinyual <xinyual@amazon.com>

* revert to hashmap

Signed-off-by: xinyual <xinyual@amazon.com>

* filter one indices case

Signed-off-by: xinyual <xinyual@amazon.com>

* add ut and merge rules

Signed-off-by: xinyual <xinyual@amazon.com>

* add benchmark test

Signed-off-by: xinyual <xinyual@amazon.com>

* revert change

Signed-off-by: xinyual <xinyual@amazon.com>

* refactor merge rules

Signed-off-by: xinyual <xinyual@amazon.com>

* fix IT

Signed-off-by: xinyual <xinyual@amazon.com>

---------

Signed-off-by: xinyual <xinyual@amazon.com>
@anasalkouz
Copy link
Member

Is this backported to 3.0 or 3.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants