Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search 'fields' option to support high-level field retrieval. #60100

Merged
merged 14 commits into from
Jul 27, 2020

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Jul 23, 2020

This feature adds a new fields parameter to the search request, which
consults both the document _source and the mappings to fetch fields in a
consistent way.

The PR merges the field-retrieval feature branch. All commits on the branch
have been individually code reviewed as part of earlier PRs.

After the PR is merged, I plan to work with other teams to integrate with the
feature. The meta issue contains a small number of 'open questions' that I will
gather feedback on (which might lead to some small follow-up changes).

Some potential improvements planned for follow-ups:

  • Move FieldMapper#lookupValues to MappedFieldType. (?)
  • Handle meta fields like _size.
  • Make use of more efficient source parsing.
  • Support the API in inner_hits.

Original issue: #49028
Meta issue tracking design + implementation: #55363

@jtibshirani jtibshirani added >feature :Search/Search Search-related issues that do not fall into other categories v7.10.0 v8.0.0 labels Jul 23, 2020
@jtibshirani jtibshirani marked this pull request as ready for review July 23, 2020 03:50
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 23, 2020
@jtibshirani jtibshirani requested review from nik9000 and jimczi July 23, 2020 19:44
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a minor thing when I glanced through the list. But it looks good to me!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big PR and all individual commits have been reviewed individually so I only looked at the high level changes. I really like how this new API can simplify the retrieval of fields in general. I left one comment regarding the handling of arrays and I wonder if we should discuss a better name than fields for the API (I don't have a better name either :( ) but other than that the changes look good to me.

* @param format an optional format string used when formatting values, for example a date format.
* @return a list a standardized field values.
*/
public List<?> lookupValues(SourceLookup lookup, @Nullable String format) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the returned list should be sorted similarly to what we do for doc_values or if we should preserve the original order in the document's source ? I am saying this since we may want to return results from doc_values in the future but that would be inconsistent with the current way of retrieving fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that the order of the returned list was "undefined". They come back in the sort order as the _source now, but that is more an accident of the implementation than anything.

Copy link
Contributor Author

@jtibshirani jtibshirani Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, users don't find that doc values behavior to be useful and are sometimes quite confused by it. One example is #29110, and I've also seen discuss issues asking why doc values don't preserve the order of the array. I wonder if sorting the results by value will make it harder for users to switch to the API from source filtering, in cases where they care about returning the original list order.

This API can also return structured values like numeric ranges or geo points, which don't have an obvious default sort order. It could be harder to understand the behavior if arrays sometimes came back in sorted order, sometimes not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest thing may be to document that the order of the returned results is undefined. Maybe in a follow up we could add a parameter to preserve the order of the original document? If someone asks for it. That'd just force us to _source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harder for users to switch to the API from source filtering

Agreed, I was more seeing this from users that switch to the API from docvalues_field but I can see how it could be confusing coming from source filtering. We don't have clear plans to retrieve values from doc_values in this API so I don't have strong feeling here. Let's keep it undefined for now and we can discuss again if we add something that uses doc values under the hood ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @nik9000 said, his comment appeared after I posted mine so I am glad we came to the same conclusion ;).

Copy link
Contributor Author

@jtibshirani jtibshirani Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest thing may be to document that the order of the returned results is undefined.

Let's keep it undefined for now and we can discuss again if we add something that uses doc values under the hood ?

This works for me, I'll add a note to the docs that users shouldn't rely on the order of the values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, multiple race conditions here :) Sounds like we are all on the same page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three way race!

@jtibshirani
Copy link
Contributor Author

@jimczi thanks for looking it over.

I wonder if we should discuss a better name than fields for the API ...

I discussed with others earlier and we were happy with the fields naming: it's simple and there's a nice symmetry between the request parameter name and the response key fields. We also determined it won't conflict with search-time field definitions that we have planned for runtime fields. We were originally thinking of naming that section fields, but now plan to include 'mappings' in the name like runtime_mappings. I'm happy to re-raise the discussion in an upcoming meeting though if you think it'd be helpful.

@jimczi
Copy link
Contributor

jimczi commented Jul 27, 2020

We were originally thinking of naming that section fields, but now plan to include 'mappings' in the name like runtime_mappings. I'm happy to re-raise the discussion in an upcoming meeting though if you think it'd be helpful.

That sounds good to me. I wasn't aware of the discussion but I am happy with the outcome. Thanks for explaining.

Currently the phase just looks up each field name in the _source and returns its
values in the 'fields' section of the response. There are several aspects that
need improvement -- this PR just lays out the initial class structure and tests.
This commit pulls out a FieldValueRetriever object, which retrieves specific
fields given a document's source. The new object makes it easier to unit test
the logic, and will help keep FetchFieldsPhase from growing too complex as we
add more functionality.
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.
This PR adds new method `FieldMapper#lookupValues(SourceLookup)` that extracts
and parses the source values. This lets us return values like numbers and dates
in a consistent format, and also handle special data types like
`constant_keyword`. The `lookupValues` method calls into `parseSourceValue`,
which mappers can override to specify how values should be parsed.
The new `format` option allows for passing a custom date format:

```
POST logs-*/_search
{
  "fields": [
    "file.*",
    {
      "field": "event.timestamp",
      "format": "epoch_millis"
    },
    ...
  ]
}
```

Other API notes:
* We use the same syntax as `docvalue_fields` for consistency. Under the hood,
both `fields` and `docvalue_fields` use the same `FieldAndFormat` object to
share serialization logic.
* Only `date` and `date_range` fields support formatting currently.
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
This PR updates FetchFieldsPhase to override hitExecute instead of hitsExecute
(plural). This way, we can make sure that the stored fields (including _source)
are only loaded once per hit as part of FetchPhase.
This avoids unnecessary lookups, since metadata fields don't have _source
values.
…58623)

This PR adds a version of `XContentMapValues.extractValue` that accepts a
default value to return in place of 'null'. It then uses this method when
looking up source values to return the configured `null_value` instead of
'null' when retrieving fields.
This PR adds docs for the `fields` parameter. We now present `fields` as the
preferred way to load specific fields in a search, with `docvalue_fields` and
`stored_fields` as other options to look into. Source filtering is no longer
featured prominently, and its section is moved to the end.
As we discussed in the meta-issue, when returning `keyword` in the fields
retrieval API, we'll apply their `normalizer`. This decision is not a clear-cut
one, and we'll validate it with internal users before merging the feature
branch.
Although we accept a variety of formats during indexing, spatial data is
returned in a single consistent format. This is GeoJSON by default, but
well-known text is also supported by passing the option 'format: wkt'.

Note that points (in addition to shapes) are returned in GeoJSON by default. The
reasoning is that this gives better consistency, and is the most convenient
format for most REST API users.
We don't actually support 'fields' as a URL parameter.
@jtibshirani
Copy link
Contributor Author

I rebased the existing commits on top of latest master, and pushed two new changes:

  • Don't include a fields URL parameter in the the REST spec.
  • Add a note to docs (and javadoc) that array values are not guaranteed to be in a particular order.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jtibshirani jtibshirani merged commit 8a89d95 into master Jul 27, 2020
@jtibshirani jtibshirani deleted the field-retrieval branch July 27, 2020 20:25
@nik9000
Copy link
Member

nik9000 commented Jul 27, 2020

@karmi
Copy link
Contributor

karmi commented Jul 30, 2020

Hello @jtibshirani , the YAML test for this feature uses a x and y as key names.

As the old joke goes, Anyone who uses YAML long enough will eventually get burned when attempting to abbreviate Norway :) It's the case with y, as well, because in some YAML parsers, y and yes are automatically converted into boolean true, which respects the YAML spec...

The Go client integration tests are failing because of this now. Usually, I skip these tests, because it appears the Go YAML parser is the only one being so strict, but in this case, I would like to run them. Is it possible to change the key names?

As a side note, there's a bunch of tests which use x and y as keys / properties, maybe it would be worth it to add a note to documentation that this breaks strict YAML parsers?

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Jul 30, 2020

Hi @karmi, will do! I can easily move away from using x and y in the tests.

Update: I opened a fix in #60471.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants