Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc values support for JSON fields. #40069

Merged
merged 17 commits into from
Apr 1, 2019

Conversation

jtibshirani
Copy link
Contributor

This is a 'work in progress' PR: I am new to the doc values/ aggregations area, and was hoping to get some early feedback on the approach.

When doc_values are enabled, we now add two SortedSetDocValuesFields for each token: one containing the raw value, and another with key\0value. The root JSON field uses the standard SortedSetDVOrdinalsIndexFieldData. For keyed fields, this PR introduces a new type KeyedJsonIndexFieldData that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like terms.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.

@jtibshirani jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Mar 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great @jtibshirani, I left some general comments regarding the implementation but I think that it's the right approach.

return false;
}

private boolean accepted(long ord) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ordinals are sorted it should be possible to compute the range of ordinals that belongs to a single prefix (field). You can perform a binary search to find the first and last ordinals that contains the field prefix and use this information to remove the need to call lookupOrd on each value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, will try that out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this more closely, and wanted to check I understand your suggestion: when creating a new KeyedJsonAtomicFieldData, we can do a one-time calculation to figure out the relevant range of ordinals. If the set of documents considered during the search is very restricted, this might be slightly more expensive than the current approach, but in general it should cut down a lot on lookups/ bytes comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea yes, we'll need to recompute the min/max for each query (since we want to cache the fielddata only once for all fields) but since it's a binary search it shouldn't be too expensive.

if (hasDocValues()) {
return new DocValuesFieldExistsQuery(name());
} else {
return new TermQuery(new Term(FieldNamesFieldMapper.NAME, name()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this pr but KeyedJsonFieldType#existsQuery should also use the _field_names field instead of a prefix query ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will give this some thought and follow up in a separate issue/ PR.

MultiValueMode sortMode,
XFieldComparatorSource.Nested nested,
boolean reverse) {
return delegate.sortField(missingValue, sortMode, nested, reverse);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SortField will use the original doc values to perform the sort so you'll need to create a custom SortedSetSortField that uses a KeyedJsonDocValues in order to filter the values that should not participate in the sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look at SortedSetDVOrdinalsIndexFieldData to understand how this should be implemented, and was trying to figure out whether this optimization was relevant: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java#L76-L79. Would you be able to give a little bit of background on this logic (originally introduced in #23827)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an optimization but a change that was needed to handle index sorting. Index sorting at the index writer level accepts the Lucene's FieldSort only and since we compare index sort with the query sort to activate early termination we needed to make the query sort compatible. This shouldn't be an issue for the json field since this should not be allowed to use it for index sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks!

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! One nit, other than that I think @jimczi covered suggestions I would make


@Override
public Collection<Accountable> getChildResources() {
return Collections.emptyList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be delegated as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Mar 18, 2019

Thanks @jimczi and @romseygeek for reviewing. I'll remove the WIP label and will work on the changes -- will ping you when it's ready for another look.

@jtibshirani
Copy link
Contributor Author

@jimczi @romseygeek this is now ready for another review. There is no rush though, as I'll be focused on another project for the rest of the week.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great, I left some minor comments and an idea to optimize the memory needed for aggregations that we discussed with Adrien earlier (remapping of ordinals).

} catch (IOException e) {
throw new RuntimeException(e);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no match for this key (minOrd==-1) you can directly return DocValues#emptySortedSet ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it might help simplify KeyedJsonDocValues a bit as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is certainly cleaner.

It is possible to sort on a `json` field, as well as perform simple
keyword-style aggregations such as `terms`. As with queries, there is no
special support for numerics -- all values in the JSON object are treated as
keywords. WHen sorting, this implies that values are compared lexicographically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHen -> When

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

this.key = key;
this.delegate = delegate;
this.minOrd = minOrd;
this.maxOrd = maxOrd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small optimization that could save some memory in the terms aggregation is to remap the documents ordinals to [0, (maxOrd-minOrd)] during the collection. With global ordinals the terms aggregation allocates one big array based on the size reported by getValueCount so if you have a lot inner fields with lots of different values we'll allocate much more than what we really need. The remapping should happen in lookupOrd (to retrieve the original ordinal) and nextOrd (to remap based on minOrd), getValueCount can return maxOrd-minOrd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions but this looks great in general.

} catch (IOException e) {
throw new RuntimeException(e);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it might help simplify KeyedJsonDocValues a bit as well

}
}
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc values already have an optimized way to find the first term that has a prefix via their terms enum, so we could do something like that. In general the underlying implementation does a binary search as well, but often it is a bit more efficient eg. by working directly on the compressed data.

TermsEnum te = sortedSetDocValues.termsEnum();
if (te.seekCeil(prefix) != SeekStatus.END && StringHelper.startsWith(te.term(), prefix)) {
  return te.ord();
} else {
  return -1;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SortedSetDocValues it looks like this will call into SortedDocValues#lookupTerm, which is very similar to the current implementation. I find the current approach cleaner in that findMinOrd and findMaxOrd follow the same template, and that comparisons happen only on the prefixes. Maybe I could keep it as is for now, but see if it makes a difference when I run some benchmarks (on the TODO list)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

if (delegate.advanceExact(target)) {
for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) {
if (minOrd <= ord && ord <= maxOrd) {
boolean advanced = delegate.advanceExact(target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calls to advanceExact() are often followed by calls to nextOrd(). With such an implementation, we will iterate ords of the underlying doc values twice. Could we somehow cache the first ord that is greater than or equal to minOrd so that we don't have to call advanceExact again on the underlying doc values here and the first call to nextOrd() will return the cached first matching ord?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice idea, will try it out.


long result = -1;
while (low <= high) {
long mid = (low + high) / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's impossible here because we are dealing with longs, a good practice when implementing binary search with signed indexes is to use unsigned shifts or unsigned division to avoid issues in case of overflow, ie. (low + high) >>> 1, or Long.divideUnsigned(low + high, 2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

minOrd = findMinOrd(keyBytes, delegate);
maxOrd = minOrd != -1 ? findMaxOrd(keyBytes, delegate) : -1;
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException would be a better fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

this.key = key;
this.delegate = delegate;
this.minOrd = minOrd;
this.maxOrd = maxOrd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.

*
* Note that this method can only be called on ordinals returned from {@link #nextOrd()}.
* Otherwise it may attempt to look up values that do not share the correct prefix, which
* can result in undefined behavior or an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fail with out-of-range ords, some features are going to call this method on out-of-range ords, such as if you pass min_doc_count: 0 to a terms agg. For the record, rebasing ordinals would address this issue (but has some challenges as highlighted in another comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably document this limitation.

Copy link
Contributor Author

@jtibshirani jtibshirani Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to know, I wasn't aware that we scanned the full ordinals array in some places. I added an explicit check for out-of-range ordinals.

I started to add documentation around the aggregations features that were not supported, but it was hard to justify and came off as unintuitive. I'm going to look into rebasing the global ordinals in a follow-up PR to try to solve the issue (and if that's not possible will brainstorm some compromise/ explanation).

@jtibshirani
Copy link
Contributor Author

This PR is now ready for another look. I had to force-push the branch because of CI failures -- the first commit since you reviewed is 7058eeb.

I looked into 'rebasing' the ordinals to lie in the range [0, (maxOrd-minOrd)], and as @jpountz anticipated it's quite tricky due to the changes needed to KeyedJsonIndexFieldData#getOrdinalMap. Here are the changes I'm planning for follow-up PRs:

  • Try to rebase the ordinals for keyed JSON fields to start at 0.
  • Fix or document the issues around scanning through the whole ordinals array (for example setting min_doc_count: 0 fails). This is related to the above work, as it wouldn't be an issue if the ordinals always laid in the range [0, (maxOrd-minOrd)].
  • I also realized that eager_global_ordinals is only applied to the root JSON field, and is being skipped for the keyed field data. I plan to address this in a follow-up, since it might require a small refactor to JsonFieldMapper.

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2
@elasticmachine run elasticsearch-ci/bwc

Now that we error on out-of-range ordinals, supplying a missing sort value no
longer works. This will be addressed in a follow-up PR.
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks great, thanks @jtibshirani
I agree that getOrdinalMap requires some refactoring so let's discuss the remapping in a follow up.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the caching of the first matching ord looks good to me.

}

long ord = delegate.nextOrd();
if (ord != NO_MORE_ORDS && ord <= maxOrd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add an assert ord >= minOrd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

public boolean advanceExact(int target) throws IOException {
if (delegate.advanceExact(target)) {
for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) {
if (minOrd <= ord && ord <= maxOrd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could actually break the loop if ord > maxOrd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@jtibshirani jtibshirani merged commit 80b0c08 into elastic:object-fields Apr 1, 2019
@jtibshirani jtibshirani deleted the json-doc-values branch April 1, 2019 22:49
jtibshirani added a commit that referenced this pull request Apr 10, 2019
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
jtibshirani added a commit that referenced this pull request Apr 17, 2019
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
jtibshirani added a commit that referenced this pull request May 1, 2019
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
jtibshirani added a commit that referenced this pull request May 24, 2019
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
jtibshirani added a commit that referenced this pull request May 29, 2019
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants