Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'flattened object' fields. #42541

Merged
merged 33 commits into from
Jun 28, 2019
Merged

Add support for 'flattened object' fields. #42541

merged 33 commits into from
Jun 28, 2019

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented May 24, 2019

This PR merges the object-fields feature branch. All commits on the branch have been individually code reviewed as part of earlier PRs.

Before merging, there are a few open issues to resolve:

  • The field type is currently marked 'experimental'. I've started an internal discussion to see if we can remove this tag, since the feature is useful in its current form and we don't expect huge changes in its API.
  • There is some chance we want to revise the type name again -- I reopened an internal issue to ask for feedback.
  • I will push a commit with some tweaks to the documentation (I've gained more insight on the field type from performance profiling and discussions with the Kibana team).

Original issue: #25312
Meta-issue tracking design + implementation: #33003

@jtibshirani jtibshirani added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.3.0 labels May 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how queries and aggregations work as if fields had been mapped on their own. However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now.

@@ -82,6 +84,8 @@ include::types/date.asciidoc[]

include::types/date_nanos.asciidoc[]

include::types/embedded-json.asciidoc[]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we move it next to the object and keyword fields that it relates to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these includes are just alphabetized. For the actual links to individual field types, I put it under "Specialised datatypes" to encourage users to think through whether it's appropriate for their data.

- Only one field mapping is created for the whole object, which can help
prevent a <<mapping-limit-settings, mappings explosion>> due to a large
number of field mappings.
- An embedded JSON field may take up less space in the index, as only one underlying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip this one as this is not what your tests suggested? #33003 (comment)

Copy link
Contributor Author

@jtibshirani jtibshirani May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I actually have a TODO to rework this docs page, I will ping you for another look when that's done.

keywords. When sorting, this implies that values are compared lexicographically.

Finally, because of the way leaf values are stored in the index, the null
character `\0` is not allowed to appear in the keys of the JSON object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment might be a bit misleading since the null character is not allowed anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I had forgotten these weren't allowed by default.

==== Stored fields

If the <<mapping-store,`store`>> option is enabled, the entire JSON object will
be stored in pretty-printed format. It can be retrieved through the top-level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we pretty-print?

for (int i = 0; i < field.length(); ++i) {
if (field.charAt(i) == '.') {
numDots++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use String#indexOf(String) which is an intrinsic and might make this a bit faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -36,15 +38,24 @@
final CopyOnWriteHashMap<String, MappedFieldType> fullNameToFieldType;
private final CopyOnWriteHashMap<String, String> aliasToConcreteName;

private final CopyOnWriteHashMap<String, JsonFieldMapper> fullNameToJsonMapper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, it'd be slightly cleaner to me if we referred to the field type rather than mapper here, since the type is supposed to be about read logic while the mapper is about write logic

Copy link
Contributor Author

@jtibshirani jtibshirani May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had actually tried this, but found it cleaner to use JsonFieldMapper here compared to the other option, where we use RootJsonFieldType and create the KeyedJsonFieldType objects using it. I found it nice that JsonFieldMapper contained consistent pairs of methods (fieldType() and keyedFieldType(), name() and keyedFieldName()). To me the mapper is acting in its role as 'field type provider'.


public static final String CONTENT_TYPE = "embedded_json";
public static final NamedAnalyzer WHITESPACE_ANALYZER = new NamedAnalyzer(
"whitespace", AnalyzerScope.INDEX, new WhitespaceAnalyzer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having an analyzer shared across indices with the INDEX scope feels wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense -- I'll move the construction of this analyzer to JsonFieldMapper.Builder#build to match what we do in KeywordFieldMapper.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented May 29, 2019

However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now.

Looking at this again, I also find the behavior around stored fields to be a bit unintuitive:

  • Only the root field is stored, it is not possible to load the keyed fields through stored_fields.
  • We store the whole JSON input, which differs from what is indexed and stored in docvalues.
  • We don't preserve the original formatting because we parse the JSON block, then reconstruct it to create the stored field.

I guess the main use case would be if a user wanted to retrieve the field in a search, but using source filtering is too expensive. I would be okay leaving it unsupported for now because the design is not that clean. @colings86 and @romseygeek checking since you reviewed this earlier -- what do you think about removing support for stored fields?

@colings86
Copy link
Contributor

colings86 commented May 29, 2019 via email

* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
* Add tests for the supported query types.
* Disallow unbounded range queries on keyed JSON fields.
* Make sure MappedFieldType#hasDocValues always returns false.
* Add documentation for JSON fields.
We now track the maximum depth of any JSON field, which allows the JSON field
lookup to be short-circuited as soon as that depth is reached. This helps
prevent slow lookups when the user is searching over a very deep field that is
not in the mappings.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
One concern around the name `json` is that because the entire document is JSON,
new users may see this field and think that they should always use it. We
thought that a more verbose name like `embedded_json` would help convey that the
field type has a special, non-obvious purpose.

This commit updates documentation references to `embedded_json`, but leaves the
`JsonField` naming in the code to avoid very long class names.
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the
range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and
sorting options to be supported.

As discussed in #41220, I opted not to support
`KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity.
The one place this affects is the 'low cardinality' optimization for terms
aggregations, which now needs to be disabled for keyed JSON fields.

It was fairly difficult to incorporate this change, and I have a couple
follow-up refactors in mind to help simplify the global ordinals code. (I will
likely wait until this feature branch is merged though before opening PRs on
master).
…r. (#41319)

The index warmer iterates through all field types when determining the fields
for which global ordinals should be loaded. Previously, keyed JSON field types
were not returned from FieldTypeLookup#iterator, so their eager_global_ordinals
setting would be ignored. This PR fixes the issue by including keyed JSON fields
in FieldTypeLookup#iterator.
In an earlier iteration of the design, it made sense to disallow these query
types on the root JSON field. It should now it be fine to allow them.
@jtibshirani jtibshirani changed the title Add support for embedded JSON ('queryable object') fields. Add support for embedded JSON fields. May 29, 2019
* Don't explicitly mention that '\0' is not allowed in keys.
* Use String#indexOf in FieldTypeLookup#fieldDepth.
* Construct the whitespace analyzer once per field mapper.
* Remove comment about saving space.
* Emphasize the similarity to keyword fields.
* Line wrap at 80 characters.
@jtibshirani jtibshirani changed the title Add support for embedded JSON fields. Add support for 'flattened object' fields. Jun 7, 2019
@jtibshirani
Copy link
Contributor Author

@jpountz @colings86 this is ready for another look. The last commit since you reviewed is 7237b2b ('Remove the experimental tag.')

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from minor comments, it looks good to me.

whitespace when building a query for this field. Accepts `true` or `false`
(default).

<<mapping-store,`store`>>::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks I missed this.

index: flat_object_test
body:
mappings:
dynamic: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is irrelevant to the test, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 removed.

* A helper class for {@link FlatObjectFieldMapper} parses a JSON object
* and produces a pair of indexable fields for each leaf value.
*/
public class FlatObjectFieldParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be pkg-private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1757,5 +1757,5 @@ public void testFieldAliasesForMetaFields() throws Exception {

DocumentField field = hit.getFields().get("id-alias");
assertThat(field.getValue().toString(), equalTo("1"));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe undo changes to this file since they are unrelated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Previously, if multiple `embedded_json` fields were added at once, only the last
one would be registered with `FieldTypeLookup`. This bug was uncovered when
trying out different scenarios for performance benchmarking.
This PR pulls `FlatObjectFieldMapper` into its own `MapperPlugin`. To do so it
introduces a new interface `DynamicKeyFieldMapper` with the method
`keyedFieldType(String key)`, which gives the opportunity to return a special
field type for a subfield.
This PR pulls the `flattened` mapper plugin into the xpack directory as its own
feature.
@jtibshirani jtibshirani merged commit f3317eb into master Jun 28, 2019
@jtibshirani jtibshirani deleted the object-fields branch June 28, 2019 12:33
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Jun 28, 2019
This commit merges the `object-fields` feature branch. The new 'flattened
object' field type allows an entire JSON object to be indexed into a field, and
provides limited search functionality over the field's contents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants