-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add search 'fields' option to support high-level field retrieval. #60100
Conversation
Pinging @elastic/es-search (:Search/Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a minor thing when I glanced through the list. But it looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a big PR and all individual commits have been reviewed individually so I only looked at the high level changes. I really like how this new API can simplify the retrieval of fields in general. I left one comment regarding the handling of arrays and I wonder if we should discuss a better name than fields
for the API (I don't have a better name either :( ) but other than that the changes look good to me.
* @param format an optional format string used when formatting values, for example a date format. | ||
* @return a list a standardized field values. | ||
*/ | ||
public List<?> lookupValues(SourceLookup lookup, @Nullable String format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the returned list should be sorted similarly to what we do for doc_values
or if we should preserve the original order in the document's source ? I am saying this since we may want to return results from doc_values in the future but that would be inconsistent with the current way of retrieving fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought that the order of the returned list was "undefined". They come back in the sort order as the _source
now, but that is more an accident of the implementation than anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, users don't find that doc values behavior to be useful and are sometimes quite confused by it. One example is #29110, and I've also seen discuss issues asking why doc values don't preserve the order of the array. I wonder if sorting the results by value will make it harder for users to switch to the API from source filtering, in cases where they care about returning the original list order.
This API can also return structured values like numeric ranges or geo points, which don't have an obvious default sort order. It could be harder to understand the behavior if arrays sometimes came back in sorted order, sometimes not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest thing may be to document that the order of the returned results is undefined. Maybe in a follow up we could add a parameter to preserve the order of the original document? If someone asks for it. That'd just force us to _source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
harder for users to switch to the API from source filtering
Agreed, I was more seeing this from users that switch to the API from docvalues_field
but I can see how it could be confusing coming from source filtering. We don't have clear plans to retrieve values from doc_values in this API so I don't have strong feeling here. Let's keep it undefined for now and we can discuss again if we add something that uses doc values under the hood ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @nik9000 said, his comment appeared after I posted mine so I am glad we came to the same conclusion ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest thing may be to document that the order of the returned results is undefined.
Let's keep it undefined for now and we can discuss again if we add something that uses doc values under the hood ?
This works for me, I'll add a note to the docs that users shouldn't rely on the order of the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, multiple race conditions here :) Sounds like we are all on the same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three way race!
@jimczi thanks for looking it over.
I discussed with others earlier and we were happy with the |
That sounds good to me. I wasn't aware of the discussion but I am happy with the outcome. Thanks for explaining. |
Currently the phase just looks up each field name in the _source and returns its values in the 'fields' section of the response. There are several aspects that need improvement -- this PR just lays out the initial class structure and tests.
This commit pulls out a FieldValueRetriever object, which retrieves specific fields given a document's source. The new object makes it easier to unit test the logic, and will help keep FetchFieldsPhase from growing too complex as we add more functionality.
This commit adds the capability to `FieldTypeLookup` to retrieve a field's paths in the _source. When retrieving a field's values, we consult these source paths to make sure we load the relevant values. This allows us to handle requests for field aliases and multi-fields. We also retrieve values that were copied into the field through copy_to. To me this is what users would expect out of the API, and it's consistent with what comes back from `docvalues_fields` and `stored_fields`. However it does add some complexity, and was not something flagged as important from any of the clients I spoke to about this API. I'm looking for feedback on this point.
This PR adds new method `FieldMapper#lookupValues(SourceLookup)` that extracts and parses the source values. This lets us return values like numbers and dates in a consistent format, and also handle special data types like `constant_keyword`. The `lookupValues` method calls into `parseSourceValue`, which mappers can override to specify how values should be parsed.
The new `format` option allows for passing a custom date format: ``` POST logs-*/_search { "fields": [ "file.*", { "field": "event.timestamp", "format": "epoch_millis" }, ... ] } ``` Other API notes: * We use the same syntax as `docvalue_fields` for consistency. Under the hood, both `fields` and `docvalue_fields` use the same `FieldAndFormat` object to share serialization logic. * Only `date` and `date_range` fields support formatting currently.
For keyword-style fields, if the source value is larger than `ignore_above` then we don't retrieve the field. In particular, the field is treated as if the value didn't exist.
This PR updates FetchFieldsPhase to override hitExecute instead of hitsExecute (plural). This way, we can make sure that the stored fields (including _source) are only loaded once per hit as part of FetchPhase.
This avoids unnecessary lookups, since metadata fields don't have _source values.
…58623) This PR adds a version of `XContentMapValues.extractValue` that accepts a default value to return in place of 'null'. It then uses this method when looking up source values to return the configured `null_value` instead of 'null' when retrieving fields.
This PR adds docs for the `fields` parameter. We now present `fields` as the preferred way to load specific fields in a search, with `docvalue_fields` and `stored_fields` as other options to look into. Source filtering is no longer featured prominently, and its section is moved to the end.
As we discussed in the meta-issue, when returning `keyword` in the fields retrieval API, we'll apply their `normalizer`. This decision is not a clear-cut one, and we'll validate it with internal users before merging the feature branch.
Although we accept a variety of formats during indexing, spatial data is returned in a single consistent format. This is GeoJSON by default, but well-known text is also supported by passing the option 'format: wkt'. Note that points (in addition to shapes) are returned in GeoJSON by default. The reasoning is that this gives better consistency, and is the most convenient format for most REST API users.
We don't actually support 'fields' as a URL parameter.
568d248
to
840cd11
Compare
I rebased the existing commits on top of latest master, and pushed two new changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hello @jtibshirani , the YAML test for this feature uses a As the old joke goes, Anyone who uses YAML long enough will eventually get burned when attempting to abbreviate Norway :) It's the case with The Go client integration tests are failing because of this now. Usually, I skip these tests, because it appears the Go YAML parser is the only one being so strict, but in this case, I would like to run them. Is it possible to change the key names? As a side note, there's a bunch of tests which use |
fields
parameter to support high-level field retrieval.
This feature adds a new
fields
parameter to the search request, whichconsults both the document
_source
and the mappings to fetch fields in aconsistent way.
The PR merges the
field-retrieval
feature branch. All commits on the branchhave been individually code reviewed as part of earlier PRs.
After the PR is merged, I plan to work with other teams to integrate with the
feature. The meta issue contains a small number of 'open questions' that I will
gather feedback on (which might lead to some small follow-up changes).
Some potential improvements planned for follow-ups:
FieldMapper#lookupValues
toMappedFieldType
. (?)_size
.inner_hits
.Original issue: #49028
Meta issue tracking design + implementation: #55363