Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FlatObject FieldMapper #6507

Merged
merged 29 commits into from
Apr 7, 2023
Merged

Conversation

mingshl
Copy link
Contributor

@mingshl mingshl commented Feb 28, 2023

Description

To fulfill #1018, we implement the approach by store the entire nested object as a String. A flat_object creates exactly two internal Lucene StringField ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field.

Background

The current object field types available for JSON objects in OpenSearch are object , nested and join. However, these field types can create individual indexable fields for each component, resulting in a large number of fields. When dealing with complex JSON objects, mapping many individual fields can lead to heavy RAM usage and thrashing, ultimately causing a mapping explosion. Additionally, storing a large number of fields can take up valuable space, and migrating indexes from the system can be a difficult task.

To address this problem, we are introducing a new field type called flat-object. This new field type will flatten the index in mapping, meaning that the components will not be indexed. Instead, the values and their paths will be stored in two string fields, no matter how complex the JSON object may be. While the values within the JSON object can still be accessed in the flat field using standard dot path notation in DSL and SQL, they will not be indexed for faster lookup. This will provide a more efficient way of handling complex JSON objects and will ultimately help to improve performance.

High Level Design

  • The FlatObjectFieldMapper stores the entire nested JSON object as a String. A flat-object field creates exactly two internal Lucene StringField ( "._value" and "._valueAndPath" ) in regards of how many nested fields the flat field has.
    catalog { ._value :
    ._valueAndPath }

    • value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike').

    • valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike')

Supported Query:

Term query
Terms query
Termset query
Prefix query
Range query
Match query
Multi_match query
Query_string query
Simplequery_string query
Exists query

Performance Evaluation:

FlatObject can upload document with more than 1000 fields, while dynamicmapping cannot go above 1000 fields. FlatObject support searching values without dotpath(searching with global fieldname), while dynamic mapping requires searching with exact dot path.

In testing 100 runs, flatObject takes 15% more time in creating index, 6% to 9% slower when searching with dot path depending on the complexity of the nested JSON. But flatObject takes 6% less time in uploading documents.

Benchmark Mode Cnt Score Error Units
FlatObjectMappingBenchmark.CreateDynamicIndex avgt 100 222.730 ± 11.891 ms/op
FlatObjectMappingBenchmark.CreateFlatObjectIndex avgt 100 256.178 ± 9.294 ms/op
FlatObjectMappingBenchmark.indexDynamicMapping avgt 100 337.405 ± 15.383 ms/op
FlatObjectMappingBenchmark.indexFlatObjectMapping avgt 100 316.860 ± 17.534 ms/op
FlatObjectMappingBenchmark.searchDynamicMappingWithOneHundredNestedJSON avgt 100 281.482 ± 11.176 ms/op
FlatObjectMappingBenchmark.searchFlatObjectMappingInValueWithOneHundredNestedJSON avgt 100 308.626 ± 11.710 ms/op

Limitation and Future Development:

  • enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory
  • open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit.
  • enable wildcard query

Issues Resolved

#1018

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=pit/10_basic/Delete all}

@mingshl
Copy link
Contributor Author

mingshl commented Apr 6, 2023

@reta This FlatObjectfieldMapper is default using Lucene.KEYWORD_ANALYZER, if you look at the method that initiate the FlatObjectFieldType in here and here, the rest of the public FlatObjectFieldType are used in different tests file and they are useful.

If you look at the keywordfieldmapper, it's similar design, the only difference is that it's extending from ParamterizedFieldMapper, so it accepts open parameters for normalizer, but it's also using the same Lucene.KEYWORD_ANALYZER.

So it shouldn't be a surprise because it only does KEYWORD_ANALYZER in this field mapper, which only do exact match and will not allow tokenizers.

I hope this help.

@reta
Copy link
Collaborator

reta commented Apr 6, 2023

@mingshl I am confused now

So it shouldn't be a surprise because it only does KEYWORD_ANALYZER in this field mapper, which only do exact match and will not allow tokenizers.

That's fine for now - no questions, but you said that previously

In FlatObject FieldMapper, we are intended to open parameters in the future,

So would these parameters include the possibility to specify normalizer? If yes - the reindexing will be required.

@mingshl
Copy link
Contributor Author

mingshl commented Apr 6, 2023

@reta Yes, the first PR is using default parameters, and Lucene.KEYWORD_ANALYZER. We don't have any open parameters -- that's said normalizers can not be used in this first release. Similar to other parameters, such as docValue, ignoreAbove, nullValue. Changing any one of the parameters, yes, users will need to re-index, that's why it's a wider discussion on the open parameters, and we should better to add the open parameters all together in a next PR.

@reta
Copy link
Collaborator

reta commented Apr 6, 2023

@reta Yes, the first PR is using default parameters, and Lucene.KEYWORD_ANALYZER. We don't have any open parameters -- that's said normalizers can not be used in this first release. Similar to other parameters, such as docValue, ignoreAbove, nullValue. Changing any one of the parameters, yes, users will need to re-index, that's why it's a wider discussion on the open parameters, and we should better to add the open parameters all together in a next PR.

All set, I just wanted to highlight that

@mingshl
Copy link
Contributor Author

mingshl commented Apr 6, 2023

@reta Yes, the first PR is using default parameters, and Lucene.KEYWORD_ANALYZER. We don't have any open parameters -- that's said normalizers can not be used in this first release. Similar to other parameters, such as docValue, ignoreAbove, nullValue. Changing any one of the parameters, yes, users will need to re-index, that's why it's a wider discussion on the open parameters, and we should better to add the open parameters all together in a next PR.

All set, I just wanted to highlight that

Thank you!!! I also added "normalizers" as one of the example of open parameters in the description in the PR.

@mingshl mingshl mentioned this pull request Apr 6, 2023
lukas-vlcek and others added 2 commits April 6, 2023 22:22
Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes

@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationAllocationIT.testSingleIndexShardAllocation

@reta
Copy link
Collaborator

reta commented Apr 6, 2023

LGTM @mingshl , I have nothing to add (taking into account the limitations and future work), thank you!

@dblock dblock merged commit 75bb3ef into opensearch-project:main Apr 7, 2023
@dblock
Copy link
Member

dblock commented Apr 7, 2023

I merged it, great work @mingshl!

@dblock dblock added the backport 2.x Backport to 2.x branch label Apr 7, 2023
@dblock
Copy link
Member

dblock commented Apr 7, 2023

@mingshl What's next for this feature? Do you have a list of GH issues/improvements as next steps written up?

opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 7, 2023
* Add FlatObject FieldMapper

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* resolve import package for HttpHost

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Dynamic Create FlatObjectFieldType for dotpath field

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Rename flat-object to flat_object and fix CI tests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Organized package

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* resolved compile error

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* organize package

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add integration tests and remove benchmark

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* fix IT tests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Skip IT tests before 2.7.0

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Revert "Skip IT tests before 2.7.0"

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add more IT tests for supported queries

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Removed license head and add tests for wildcard query

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add tests for array, nested-arrary, number and float

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Upgrade FlatObjectFieldMapperTests to  MapperTestCase

- Upgrading `FlatObjectFieldMapperTests` from `MapperServiceTestCase` to `MapperTestCase`. The `MapperTestCase` explicitly forces us to:
	- Test parameter updates (empty now)
	- Explicitly specify if the field supports Meta and Boost (if not, relevant tests are automatically skipped)
- Test also the substring fields
- Add new test `testMapperServiceHasParser` to verify the new `flat_object` field mapper is present in mapper service registry
- Remove duplicated test and assertions methods
- Removed `testExistsQueryDocValuesDisabledWithNorms` as this test was not adding much. We shall reintroduce similar test later if we decide that we want to support ExistsQuery and what to do if DocValue are disabled and Norms enabled.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Add exist query in FlatObjectFieldMapperTests and FlatObjectFieldDataTests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add IT tests for painless query in testDocValues

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add filter script (Painless) test for flat_object

While it is not possible to use flat_object field in scripting filter context to access doc values (like `doc[<flat_object>.<field_x>]`) it is possible to call `doc[<flat_object>].size()` to get number of fields inside the flat_object field.

- Reorganize flat_object yaml tests into sections:
  - Mappings
  - Supported
  - Unsupported
- Scripting (Painless) yamlRest tests need to go into lang-painless module

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Removed Normalizer

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* removed unused codes

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Remove non-relevant Javadoc from DynamicKeyFieldMapper

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Improve flat_object scripting test

Make it more obvious what the `doc[<flat_field>].size()` number represents.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Add test for mapping parameters

Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* remove IndexAnalyzer from Builder

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* remove IndexAnalyzer from Builder

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

---------

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
(cherry picked from commit 75bb3ef)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@tlfeng tlfeng added Search Search query, autocomplete ...etc v2.7.0 labels Apr 7, 2023
tlfeng pushed a commit that referenced this pull request Apr 10, 2023
To fulfill issue #1018, we implement the approach by storing the entire nested object as a String. A `flat_object` creates exactly two internal Lucene [StringField](https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/document/StringField.html) ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field. 

- value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike').
- valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike')

Limitation and Future Development:
- enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory
- open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit.
- enable wildcard query

(cherry picked from commit 75bb3ef)

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@mingshl
Copy link
Contributor Author

mingshl commented Apr 13, 2023

@mingshl What's next for this feature? Do you have a list of GH issues/improvements as next steps written up?

Created new issues for future enhancement:
#7138
#7137
#7136

@macohen
Copy link
Contributor

macohen commented Apr 14, 2023

@mingshl awesome! Can you label those three issues "Search" and then close this if you think there's no more work do be done here please?

austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Apr 28, 2023
* Add FlatObject FieldMapper

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* resolve import package for HttpHost

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Dynamic Create FlatObjectFieldType for dotpath field

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Rename flat-object to flat_object and fix CI tests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Organized package

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* resolved compile error

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* organize package

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add integration tests and remove benchmark

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* fix IT tests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Skip IT tests before 2.7.0

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Revert "Skip IT tests before 2.7.0"

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add more IT tests for supported queries

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Removed license head and add tests for wildcard query

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add tests for array, nested-arrary, number and float

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Upgrade FlatObjectFieldMapperTests to  MapperTestCase

- Upgrading `FlatObjectFieldMapperTests` from `MapperServiceTestCase` to `MapperTestCase`. The `MapperTestCase` explicitly forces us to:
	- Test parameter updates (empty now)
	- Explicitly specify if the field supports Meta and Boost (if not, relevant tests are automatically skipped)
- Test also the substring fields
- Add new test `testMapperServiceHasParser` to verify the new `flat_object` field mapper is present in mapper service registry
- Remove duplicated test and assertions methods
- Removed `testExistsQueryDocValuesDisabledWithNorms` as this test was not adding much. We shall reintroduce similar test later if we decide that we want to support ExistsQuery and what to do if DocValue are disabled and Norms enabled.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Add exist query in FlatObjectFieldMapperTests and FlatObjectFieldDataTests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add IT tests for painless query in testDocValues

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Add filter script (Painless) test for flat_object

While it is not possible to use flat_object field in scripting filter context to access doc values (like `doc[<flat_object>.<field_x>]`) it is possible to call `doc[<flat_object>].size()` to get number of fields inside the flat_object field.

- Reorganize flat_object yaml tests into sections:
  - Mappings
  - Supported
  - Unsupported
- Scripting (Painless) yamlRest tests need to go into lang-painless module

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Removed Normalizer

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* removed unused codes

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* Remove non-relevant Javadoc from DynamicKeyFieldMapper

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Improve flat_object scripting test

Make it more obvious what the `doc[<flat_field>].size()` number represents.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* Add test for mapping parameters

Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

* remove IndexAnalyzer from Builder

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* remove IndexAnalyzer from Builder

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

---------

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Search Search query, autocomplete ...etc v2.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants