-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc values support for JSON fields. #40069
Add doc values support for JSON fields. #40069
Conversation
Pinging @elastic/es-search |
@elasticmachine run elasticsearch-ci/default-distro |
a0f1f77
to
b2e0b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great @jtibshirani, I left some general comments regarding the implementation but I think that it's the right approach.
return false; | ||
} | ||
|
||
private boolean accepted(long ord) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ordinals are sorted it should be possible to compute the range of ordinals that belongs to a single prefix (field). You can perform a binary search to find the first and last ordinals that contains the field prefix and use this information to remove the need to call lookupOrd
on each value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, will try that out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this more closely, and wanted to check I understand your suggestion: when creating a new KeyedJsonAtomicFieldData
, we can do a one-time calculation to figure out the relevant range of ordinals. If the set of documents considered during the search is very restricted, this might be slightly more expensive than the current approach, but in general it should cut down a lot on lookups/ bytes comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea yes, we'll need to recompute the min/max for each query (since we want to cache the fielddata only once for all fields) but since it's a binary search it shouldn't be too expensive.
if (hasDocValues()) { | ||
return new DocValuesFieldExistsQuery(name()); | ||
} else { | ||
return new TermQuery(new Term(FieldNamesFieldMapper.NAME, name())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this pr but KeyedJsonFieldType#existsQuery
should also use the _field_names
field instead of a prefix query ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will give this some thought and follow up in a separate issue/ PR.
MultiValueMode sortMode, | ||
XFieldComparatorSource.Nested nested, | ||
boolean reverse) { | ||
return delegate.sortField(missingValue, sortMode, nested, reverse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SortField
will use the original doc values to perform the sort so you'll need to create a custom SortedSetSortField
that uses a KeyedJsonDocValues
in order to filter the values that should not participate in the sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look at SortedSetDVOrdinalsIndexFieldData
to understand how this should be implemented, and was trying to figure out whether this optimization was relevant: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java#L76-L79. Would you be able to give a little bit of background on this logic (originally introduced in #23827)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an optimization but a change that was needed to handle index sorting. Index sorting at the index writer level accepts the Lucene's FieldSort only and since we compare index sort with the query sort to activate early termination we needed to make the query sort compatible. This shouldn't be an issue for the json
field since this should not be allowed to use it for index sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks!
server/src/test/java/org/elasticsearch/index/fielddata/IndexFieldDataServiceTests.java
Show resolved
Hide resolved
6a2375b
to
5a83e1f
Compare
b2e0b99
to
ff4ce21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! One nit, other than that I think @jimczi covered suggestions I would make
|
||
@Override | ||
public Collection<Accountable> getChildResources() { | ||
return Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be delegated as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @jimczi and @romseygeek for reviewing. I'll remove the WIP label and will work on the changes -- will ping you when it's ready for another look. |
5a83e1f
to
577b7f6
Compare
c41fd03
to
24ff0e2
Compare
577b7f6
to
1093962
Compare
5e06859
to
b6ae960
Compare
1093962
to
3e35ccd
Compare
b6ae960
to
6da1e1d
Compare
@jimczi @romseygeek this is now ready for another review. There is no rush though, as I'll be focused on another project for the rest of the week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great, I left some minor comments and an idea to optimize the memory needed for aggregations that we discussed with Adrien earlier (remapping of ordinals).
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no match for this key (minOrd==-1
) you can directly return DocValues#emptySortedSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 it might help simplify KeyedJsonDocValues a bit as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is certainly cleaner.
It is possible to sort on a `json` field, as well as perform simple | ||
keyword-style aggregations such as `terms`. As with queries, there is no | ||
special support for numerics -- all values in the JSON object are treated as | ||
keywords. WHen sorting, this implies that values are compared lexicographically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHen -> When
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
this.key = key; | ||
this.delegate = delegate; | ||
this.minOrd = minOrd; | ||
this.maxOrd = maxOrd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small optimization that could save some memory in the terms
aggregation is to remap the documents ordinals to [0, (maxOrd-minOrd)]
during the collection. With global ordinals the terms
aggregation allocates one big array based on the size reported by getValueCount
so if you have a lot inner fields with lots of different values we'll allocate much more than what we really need. The remapping should happen in lookupOrd
(to retrieve the original ordinal) and nextOrd
(to remap based on minOrd), getValueCount
can return maxOrd-minOrd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some suggestions but this looks great in general.
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 it might help simplify KeyedJsonDocValues a bit as well
} | ||
} | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc values already have an optimized way to find the first term that has a prefix via their terms enum, so we could do something like that. In general the underlying implementation does a binary search as well, but often it is a bit more efficient eg. by working directly on the compressed data.
TermsEnum te = sortedSetDocValues.termsEnum();
if (te.seekCeil(prefix) != SeekStatus.END && StringHelper.startsWith(te.term(), prefix)) {
return te.ord();
} else {
return -1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SortedSetDocValues
it looks like this will call into SortedDocValues#lookupTerm
, which is very similar to the current implementation. I find the current approach cleaner in that findMinOrd
and findMaxOrd
follow the same template, and that comparisons happen only on the prefixes. Maybe I could keep it as is for now, but see if it makes a difference when I run some benchmarks (on the TODO list)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
if (delegate.advanceExact(target)) { | ||
for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) { | ||
if (minOrd <= ord && ord <= maxOrd) { | ||
boolean advanced = delegate.advanceExact(target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calls to advanceExact() are often followed by calls to nextOrd(). With such an implementation, we will iterate ords of the underlying doc values twice. Could we somehow cache the first ord that is greater than or equal to minOrd
so that we don't have to call advanceExact
again on the underlying doc values here and the first call to nextOrd() will return the cached first matching ord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice idea, will try it out.
|
||
long result = -1; | ||
while (low <= high) { | ||
long mid = (low + high) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it's impossible here because we are dealing with longs, a good practice when implementing binary search with signed indexes is to use unsigned shifts or unsigned division to avoid issues in case of overflow, ie. (low + high) >>> 1
, or Long.divideUnsigned(low + high, 2)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
minOrd = findMinOrd(keyBytes, delegate); | ||
maxOrd = minOrd != -1 ? findMaxOrd(keyBytes, delegate) : -1; | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UncheckedIOException would be a better fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
this.key = key; | ||
this.delegate = delegate; | ||
this.minOrd = minOrd; | ||
this.maxOrd = maxOrd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.
* | ||
* Note that this method can only be called on ordinals returned from {@link #nextOrd()}. | ||
* Otherwise it may attempt to look up values that do not share the correct prefix, which | ||
* can result in undefined behavior or an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fail with out-of-range ords, some features are going to call this method on out-of-range ords, such as if you pass min_doc_count: 0
to a terms
agg. For the record, rebasing ordinals would address this issue (but has some challenges as highlighted in another comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably document this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to know, I wasn't aware that we scanned the full ordinals array in some places. I added an explicit check for out-of-range ordinals.
I started to add documentation around the aggregations features that were not supported, but it was hard to justify and came off as unintuitive. I'm going to look into rebasing the global ordinals in a follow-up PR to try to solve the issue (and if that's not possible will brainstorm some compromise/ explanation).
3e35ccd
to
4ff0edd
Compare
…rch. This strategy reduces the number of lookups + comparisons needed when filtering the prefixed doc values.
1cc1387
to
54ac640
Compare
This PR is now ready for another look. I had to force-push the branch because of CI failures -- the first commit since you reviewed is I looked into 'rebasing' the ordinals to lie in the range
|
@elasticmachine run elasticsearch-ci/1 |
Now that we error on out-of-range ordinals, supplying a missing sort value no longer works. This will be addressed in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks great, thanks @jtibshirani
I agree that getOrdinalMap
requires some refactoring so let's discuss the remapping in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the caching of the first matching ord looks good to me.
} | ||
|
||
long ord = delegate.nextOrd(); | ||
if (ord != NO_MORE_ORDS && ord <= maxOrd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add an assert ord >= minOrd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
public boolean advanceExact(int target) throws IOException { | ||
if (delegate.advanceExact(target)) { | ||
for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) { | ||
if (minOrd <= ord && ord <= maxOrd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could actually break the loop if ord > maxOrd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@elasticmachine run elasticsearch-ci/bwc |
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
This is a 'work in progress' PR: I am new to the doc values/ aggregations area, and was hoping to get some early feedback on the approach.
When
doc_values
are enabled, we now add twoSortedSetDocValuesFields
for each token: one containing the rawvalue
, and another withkey\0value
. The root JSON field uses the standardSortedSetDVOrdinalsIndexFieldData
. For keyed fields, this PR introduces a new typeKeyedJsonIndexFieldData
that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations liketerms
.One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.