-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FlatObject FieldMapper #6507
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@reta This FlatObjectfieldMapper is default using Lucene.KEYWORD_ANALYZER, if you look at the method that initiate the FlatObjectFieldType in here and here, the rest of the public FlatObjectFieldType are used in different tests file and they are useful. If you look at the keywordfieldmapper, it's similar design, the only difference is that it's extending from ParamterizedFieldMapper, so it accepts open parameters for normalizer, but it's also using the same Lucene.KEYWORD_ANALYZER. So it shouldn't be a surprise because it only does KEYWORD_ANALYZER in this field mapper, which only do exact match and will not allow tokenizers. I hope this help. |
@mingshl I am confused now
That's fine for now - no questions, but you said that previously
So would these parameters include the possibility to specify normalizer? If yes - the reindexing will be required. |
@reta Yes, the first PR is using default parameters, and Lucene.KEYWORD_ANALYZER. We don't have any open parameters -- that's said normalizers can not be used in this first release. Similar to other parameters, such as docValue, ignoreAbove, nullValue. Changing any one of the parameters, yes, users will need to re-index, that's why it's a wider discussion on the open parameters, and we should better to add the open parameters all together in a next PR. |
All set, I just wanted to highlight that |
Thank you!!! I also added "normalizers" as one of the example of open parameters in the description in the PR. |
server/src/main/java/org/opensearch/index/mapper/FlatObjectFieldMapper.java
Outdated
Show resolved
Hide resolved
Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
rest-api-spec/src/main/resources/rest-api-spec/test/index/90_flat_object.yml
Outdated
Show resolved
Hide resolved
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
LGTM @mingshl , I have nothing to add (taking into account the limitations and future work), thank you! |
I merged it, great work @mingshl! |
@mingshl What's next for this feature? Do you have a list of GH issues/improvements as next steps written up? |
* Add FlatObject FieldMapper Signed-off-by: Mingshi Liu <mingshl@amazon.com> * resolve import package for HttpHost Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Dynamic Create FlatObjectFieldType for dotpath field Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Rename flat-object to flat_object and fix CI tests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Organized package Signed-off-by: Mingshi Liu <mingshl@amazon.com> * resolved compile error Signed-off-by: Mingshi Liu <mingshl@amazon.com> * organize package Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add integration tests and remove benchmark Signed-off-by: Mingshi Liu <mingshl@amazon.com> * fix IT tests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Skip IT tests before 2.7.0 Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Revert "Skip IT tests before 2.7.0" Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add more IT tests for supported queries Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Removed license head and add tests for wildcard query Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add tests for array, nested-arrary, number and float Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Upgrade FlatObjectFieldMapperTests to MapperTestCase - Upgrading `FlatObjectFieldMapperTests` from `MapperServiceTestCase` to `MapperTestCase`. The `MapperTestCase` explicitly forces us to: - Test parameter updates (empty now) - Explicitly specify if the field supports Meta and Boost (if not, relevant tests are automatically skipped) - Test also the substring fields - Add new test `testMapperServiceHasParser` to verify the new `flat_object` field mapper is present in mapper service registry - Remove duplicated test and assertions methods - Removed `testExistsQueryDocValuesDisabledWithNorms` as this test was not adding much. We shall reintroduce similar test later if we decide that we want to support ExistsQuery and what to do if DocValue are disabled and Norms enabled. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Add exist query in FlatObjectFieldMapperTests and FlatObjectFieldDataTests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add IT tests for painless query in testDocValues Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add filter script (Painless) test for flat_object While it is not possible to use flat_object field in scripting filter context to access doc values (like `doc[<flat_object>.<field_x>]`) it is possible to call `doc[<flat_object>].size()` to get number of fields inside the flat_object field. - Reorganize flat_object yaml tests into sections: - Mappings - Supported - Unsupported - Scripting (Painless) yamlRest tests need to go into lang-painless module Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Removed Normalizer Signed-off-by: Mingshi Liu <mingshl@amazon.com> * removed unused codes Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Remove non-relevant Javadoc from DynamicKeyFieldMapper Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Improve flat_object scripting test Make it more obvious what the `doc[<flat_field>].size()` number represents. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Add test for mapping parameters Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * remove IndexAnalyzer from Builder Signed-off-by: Mingshi Liu <mingshl@amazon.com> * remove IndexAnalyzer from Builder Signed-off-by: Mingshi Liu <mingshl@amazon.com> --------- Signed-off-by: Mingshi Liu <mingshl@amazon.com> Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io> (cherry picked from commit 75bb3ef) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
To fulfill issue #1018, we implement the approach by storing the entire nested object as a String. A `flat_object` creates exactly two internal Lucene [StringField](https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/document/StringField.html) ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field. - value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike'). - valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike') Limitation and Future Development: - enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory - open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit. - enable wildcard query (cherry picked from commit 75bb3ef) Signed-off-by: Mingshi Liu <mingshl@amazon.com> Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@mingshl awesome! Can you label those three issues "Search" and then close this if you think there's no more work do be done here please? |
* Add FlatObject FieldMapper Signed-off-by: Mingshi Liu <mingshl@amazon.com> * resolve import package for HttpHost Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Dynamic Create FlatObjectFieldType for dotpath field Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Rename flat-object to flat_object and fix CI tests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Organized package Signed-off-by: Mingshi Liu <mingshl@amazon.com> * resolved compile error Signed-off-by: Mingshi Liu <mingshl@amazon.com> * organize package Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add integration tests and remove benchmark Signed-off-by: Mingshi Liu <mingshl@amazon.com> * fix IT tests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Skip IT tests before 2.7.0 Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Revert "Skip IT tests before 2.7.0" Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add more IT tests for supported queries Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Removed license head and add tests for wildcard query Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add tests for array, nested-arrary, number and float Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Upgrade FlatObjectFieldMapperTests to MapperTestCase - Upgrading `FlatObjectFieldMapperTests` from `MapperServiceTestCase` to `MapperTestCase`. The `MapperTestCase` explicitly forces us to: - Test parameter updates (empty now) - Explicitly specify if the field supports Meta and Boost (if not, relevant tests are automatically skipped) - Test also the substring fields - Add new test `testMapperServiceHasParser` to verify the new `flat_object` field mapper is present in mapper service registry - Remove duplicated test and assertions methods - Removed `testExistsQueryDocValuesDisabledWithNorms` as this test was not adding much. We shall reintroduce similar test later if we decide that we want to support ExistsQuery and what to do if DocValue are disabled and Norms enabled. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Add exist query in FlatObjectFieldMapperTests and FlatObjectFieldDataTests Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add IT tests for painless query in testDocValues Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Add filter script (Painless) test for flat_object While it is not possible to use flat_object field in scripting filter context to access doc values (like `doc[<flat_object>.<field_x>]`) it is possible to call `doc[<flat_object>].size()` to get number of fields inside the flat_object field. - Reorganize flat_object yaml tests into sections: - Mappings - Supported - Unsupported - Scripting (Painless) yamlRest tests need to go into lang-painless module Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Removed Normalizer Signed-off-by: Mingshi Liu <mingshl@amazon.com> * removed unused codes Signed-off-by: Mingshi Liu <mingshl@amazon.com> * Remove non-relevant Javadoc from DynamicKeyFieldMapper Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Improve flat_object scripting test Make it more obvious what the `doc[<flat_field>].size()` number represents. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * Add test for mapping parameters Mapping parameters are not allowed in the initial version. This commit adds a test to demonstrate that trying to specify index/search analyzer for the flat_object field will fail. Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> * remove IndexAnalyzer from Builder Signed-off-by: Mingshi Liu <mingshl@amazon.com> * remove IndexAnalyzer from Builder Signed-off-by: Mingshi Liu <mingshl@amazon.com> --------- Signed-off-by: Mingshi Liu <mingshl@amazon.com> Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Description
To fulfill #1018, we implement the approach by store the entire nested object as a String. A flat_object creates exactly two internal Lucene StringField ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field.
Background
The current object field types available for JSON objects in OpenSearch are object , nested and join. However, these field types can create individual indexable fields for each component, resulting in a large number of fields. When dealing with complex JSON objects, mapping many individual fields can lead to heavy RAM usage and thrashing, ultimately causing a mapping explosion. Additionally, storing a large number of fields can take up valuable space, and migrating indexes from the system can be a difficult task.
To address this problem, we are introducing a new field type called flat-object. This new field type will flatten the index in mapping, meaning that the components will not be indexed. Instead, the values and their paths will be stored in two string fields, no matter how complex the JSON object may be. While the values within the JSON object can still be accessed in the flat field using standard dot path notation in DSL and SQL, they will not be indexed for faster lookup. This will provide a more efficient way of handling complex JSON objects and will ultimately help to improve performance.
High Level Design
The FlatObjectFieldMapper stores the entire nested JSON object as a String. A flat-object field creates exactly two internal Lucene StringField ( "._value" and "._valueAndPath" ) in regards of how many nested fields the flat field has.
catalog { ._value :
._valueAndPath }
value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike').
valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike')
Supported Query:
Term query
Terms query
Termset query
Prefix query
Range query
Match query
Multi_match query
Query_string query
Simplequery_string query
Exists query
Performance Evaluation:
FlatObject can upload document with more than 1000 fields, while dynamicmapping cannot go above 1000 fields. FlatObject support searching values without dotpath(searching with global fieldname), while dynamic mapping requires searching with exact dot path.
In testing 100 runs, flatObject takes 15% more time in creating index, 6% to 9% slower when searching with dot path depending on the complexity of the nested JSON. But flatObject takes 6% less time in uploading documents.
Benchmark Mode Cnt Score Error Units
FlatObjectMappingBenchmark.CreateDynamicIndex avgt 100 222.730 ± 11.891 ms/op
FlatObjectMappingBenchmark.CreateFlatObjectIndex avgt 100 256.178 ± 9.294 ms/op
FlatObjectMappingBenchmark.indexDynamicMapping avgt 100 337.405 ± 15.383 ms/op
FlatObjectMappingBenchmark.indexFlatObjectMapping avgt 100 316.860 ± 17.534 ms/op
FlatObjectMappingBenchmark.searchDynamicMappingWithOneHundredNestedJSON avgt 100 281.482 ± 11.176 ms/op
FlatObjectMappingBenchmark.searchFlatObjectMappingInValueWithOneHundredNestedJSON avgt 100 308.626 ± 11.710 ms/op
Limitation and Future Development:
Issues Resolved
#1018
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.