-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc_count field mapper #58339
Add doc_count field mapper #58339
Conversation
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a metadata field _doc_count
instead of a standard field type? It could be a nice fit, because it's more of an internal field rather than something we expect users to configure. Also we don't want multiple fields of type doc_count
to be defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although my mapper-skills are relatively poor so take it with a grain of salt. :)
Left a few comments about validation/tests but not critical (and might not be relevant). Exciting!
|
||
[IMPORTANT] | ||
======== | ||
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought while reading the restrictions, is it possible to define _doc_count
as an object? We should forbid that as well if it isn't already... but i suspect the current restrictions prevent it from being an object too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an assertion that the input is a VALUE_NUMBER
in the parseCreateField()
method.
elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/DocCountFieldMapper.java
Line 113 in 7b7ca43
XContentParserUtils.ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, parser.currentToken(), parser); |
Is there anything else that should be added?
.../org/elasticsearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Outdated
Show resolved
Hide resolved
import static java.util.Collections.singleton; | ||
|
||
|
||
public class FieldBasedDocCountProviderTests extends AggregatorTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to filter or aggregate on the _doc_counts
field itself...I'm guessing it's not possible? Should we add some tests to make sure that's handled politely if a user tries to search/agg it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it's common to have a test case like DocCountFieldTypeTests
that checks query and field data construction works (or throws an error) as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added test DocCountFieldTypeTests
that tests range
, term
and exists
queries.
Also, DocCountFieldType
extends the MappedFieldType
class that by default throws an exception when calling the fielddataBuilder()
. Does this cover aggregations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me overall, I left some minor comments.
server/src/main/java/org/elasticsearch/index/mapper/DocCountFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/DocCountFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/DocCountFieldMapper.java
Outdated
Show resolved
Hide resolved
Cleaned up/simplified DocCountProvider class
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, I had one last comment.
try { | ||
docCountValues = DocValues.getNumeric(ctx.reader(), DocCountFieldMapper.NAME); | ||
} catch (IOException e) { | ||
docCountValues = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be swallowing an IOException
and treating it as if there was no doc count field. What case does this logic address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I felt it was better to swallow the error and fallback to doc_count=1
. Now, it feels more transparent to rethrow the exception.
I am not sure what can go wrong there and I don't have a strong opinion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should propagate it, an IOException
could indicate something serious and it's best not to silently continue on. Perhaps this method could just declare throws IOException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too.
This PR is closed and opened as new PR (#64503) against master branch. |
import java.io.IOException; | ||
|
||
/** | ||
* An implementation of a doc_count provider that reads the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Javadoc still reflects having an interface
} | ||
} | ||
|
||
public void setLeafReaderContext(LeafReaderContext ctx) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just a constructor argument? Under what conditions would it be correct to use a DocCountProvider
without setting the LeafReaderContext
?
Bucket aggregations compute bucket
doc_count
values by incrementing thedoc_count
by 1 for every document collected in the bucket.When using summary fields (such as
aggregate_metric_double
) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (nameddoc_count
field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field.Bucket aggregations will check if a field of type
doc_count
exists in a document and will take this value into consideration when computing doc counts.