-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc_count field mapper #58339
Closed
csoulios
wants to merge
53
commits into
elastic:feature/aggregate-metrics
from
csoulios:doc_count-field-mapper
Closed
Add doc_count field mapper #58339
Changes from 51 commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
4b5fab3
Initial version of doc_count field mapper
csoulios cd515b3
added tests
csoulios 655e112
Build fixes
csoulios db13d83
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 191d793
Added tests for doc_count fieldmapper
csoulios 5f81bee
doc count tests
csoulios dab8219
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios ecdc603
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 520ac9a
Resolve conflicts after merge from master
csoulios 676ffc6
Added yaml test for doc_count field type
csoulios 7c7139c
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios d3b9c45
Minor changes to test
csoulios c36ecac
Fix issue with not-registering field mapper
csoulios 4dca391
Simplify terms agg test
csoulios 912d943
Add doc_count provider in the buckets aggregator
csoulios be46a00
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios c0f23ae
Initialize doc_count provider once
csoulios f7b43c1
Added tests for FieldBasedDocCountProvider
csoulios 5e1b96a
Added more tests to DocCountFieldMapper
csoulios 80d832b
Fixed NPE at AggregatorTestCase
csoulios 1e8b472
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios e24d680
Updated branch to fix build after merge
csoulios 74c727b
Added validation for single doc_count field
csoulios cd2c84d
Added version skips to fix broken tests
csoulios 91246eb
Added documentation for doc_count
csoulios 77aa346
Changes to address review comments:
csoulios 39c43a0
Use _doc_count as Lucene field for doc count
csoulios 8ca3fbc
Minor change: field rename
csoulios 83929cb
Minor change to yml test.
csoulios 848fc77
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 0a1731d
Fix errors from merge
csoulios 82f092a
Converted _doc_count to metadata field type
csoulios ba92359
Throw an error if parsed value is not a number
csoulios cb61366
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 522c385
Make _doc_count field a metadata field
csoulios df2a2eb
Fixed broken tests
csoulios 838436f
Fix bug in low cardinality ordinal terms aggs
csoulios 4a92c80
Update docs that _doc_count is a metadata field
csoulios 5d6d037
Fix broken ML tests
csoulios 0ff6fe1
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 23e4b30
Fix errors after merge
csoulios b258653
Addressed review comments
csoulios f5ed1df
Addressed reviewer comments
csoulios 2fcdcf6
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 4138d16
Added DocCountFieldTypeTests
csoulios 5d38b7f
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 654847e
Fix errors after merge
csoulios 7b7ca43
Make composite agg respect _doc_count field
csoulios ce44e87
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 5621c44
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios 1d969a1
DocCountProvider rethrows IOException instead of swallowing it
csoulios cb05034
Set familyTypeName of _doc_count to integer
csoulios d7d80f4
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
[[mapping-doc-count-field]] | ||
=== `_doc_count` data type | ||
++++ | ||
<titleabbrev>_doc_count</titleabbrev> | ||
++++ | ||
|
||
Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned | ||
in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected | ||
in each bucket. | ||
|
||
While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent | ||
documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may | ||
represent multiple documents. | ||
|
||
To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a | ||
metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents | ||
aggregated in a single summary field. | ||
|
||
When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count` | ||
by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default. | ||
|
||
[IMPORTANT] | ||
======== | ||
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed. | ||
* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior. | ||
======== | ||
|
||
[[mapping-doc-count-field-example]] | ||
==== Example | ||
|
||
The following <<indices-create-index, create index>> API request creates a new index with the following field mappings: | ||
|
||
* `my_histogram`, a `histogram` field used to store percentile data | ||
* `my_text`, a `keyword` field used to store a title for the histogram | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT my_index | ||
{ | ||
"mappings" : { | ||
"properties" : { | ||
"my_histogram" : { | ||
"type" : "histogram" | ||
}, | ||
"my_text" : { | ||
"type" : "keyword" | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
The following <<docs-index_,index>> API requests store pre-aggregated data for | ||
two histograms: `histogram_1` and `histogram_2`. | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT my_index/_doc/1 | ||
{ | ||
"my_text" : "histogram_1", | ||
"my_histogram" : { | ||
"values" : [0.1, 0.2, 0.3, 0.4, 0.5], | ||
"counts" : [3, 7, 23, 12, 6] | ||
}, | ||
"_doc_count": 45 <1> | ||
} | ||
|
||
PUT my_index/_doc/2 | ||
{ | ||
"my_text" : "histogram_2", | ||
"my_histogram" : { | ||
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], | ||
"counts" : [8, 17, 8, 7, 6, 2] | ||
}, | ||
"_doc_count_": 62 <1> | ||
} | ||
-------------------------------------------------- | ||
<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram. | ||
|
||
If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
GET /_search | ||
{ | ||
"aggs" : { | ||
"histogram_titles" : { | ||
"terms" : { "field" : "my_text" } | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
We will get the following response: | ||
|
||
[source,console-result] | ||
-------------------------------------------------- | ||
{ | ||
... | ||
"aggregations" : { | ||
"histogram_titles" : { | ||
"doc_count_error_upper_bound": 0, | ||
"sum_other_doc_count": 0, | ||
"buckets" : [ | ||
{ | ||
"key" : "histogram_2", | ||
"doc_count" : 62 | ||
}, | ||
{ | ||
"key" : "histogram_1", | ||
"doc_count" : 45 | ||
} | ||
] | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE[skip:test not setup] |
150 changes: 150 additions & 0 deletions
150
...api-spec/src/main/resources/rest-api-spec/test/search.aggregation/370_doc_count_field.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
setup: | ||
- do: | ||
indices.create: | ||
index: test_1 | ||
body: | ||
settings: | ||
number_of_replicas: 0 | ||
mappings: | ||
properties: | ||
str: | ||
type: keyword | ||
number: | ||
type: integer | ||
|
||
- do: | ||
bulk: | ||
index: test_1 | ||
refresh: true | ||
body: | ||
- '{"index": {}}' | ||
- '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }' | ||
- '{"index": {}}' | ||
- '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }' | ||
- '{"index": {}}' | ||
- '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }' | ||
- '{"index": {}}' | ||
- '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }' | ||
- '{"index": {}}' | ||
- '{"str": "abc", "number" : 500, "unmapped": "abc" }' | ||
|
||
--- | ||
"Test numeric terms agg with doc_count": | ||
- skip: | ||
version: " - 7.99.99" | ||
reason: "Doc count fields are only implemented in 8.0" | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } } | ||
|
||
- match: { hits.total: 5 } | ||
- length: { aggregations.num_terms.buckets: 3 } | ||
- match: { aggregations.num_terms.buckets.0.key: 100 } | ||
- match: { aggregations.num_terms.buckets.0.doc_count: 12 } | ||
- match: { aggregations.num_terms.buckets.1.key: 500 } | ||
- match: { aggregations.num_terms.buckets.1.doc_count: 11 } | ||
- match: { aggregations.num_terms.buckets.2.key: 200 } | ||
- match: { aggregations.num_terms.buckets.2.doc_count: 1 } | ||
|
||
|
||
--- | ||
"Test keyword terms agg with doc_count": | ||
- skip: | ||
version: " - 7.99.99" | ||
reason: "Doc count fields are only implemented in 8.0" | ||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } } | ||
|
||
- match: { hits.total: 5 } | ||
- length: { aggregations.str_terms.buckets: 3 } | ||
- match: { aggregations.str_terms.buckets.0.key: "abc" } | ||
- match: { aggregations.str_terms.buckets.0.doc_count: 11 } | ||
- match: { aggregations.str_terms.buckets.1.key: "foo" } | ||
- match: { aggregations.str_terms.buckets.1.doc_count: 8 } | ||
- match: { aggregations.str_terms.buckets.2.key: "xyz" } | ||
- match: { aggregations.str_terms.buckets.2.doc_count: 5 } | ||
|
||
--- | ||
|
||
"Test unmapped string terms agg with doc_count": | ||
- skip: | ||
version: " - 7.99.99" | ||
reason: "Doc count fields are only implemented in 8.0" | ||
- do: | ||
bulk: | ||
index: test_2 | ||
refresh: true | ||
body: | ||
- '{"index": {}}' | ||
- '{"_doc_count": 10, "str": "abc" }' | ||
- '{"index": {}}' | ||
- '{"str": "abc" }' | ||
- do: | ||
search: | ||
index: test_2 | ||
rest_total_hits_as_int: true | ||
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } } | ||
|
||
- match: { hits.total: 2 } | ||
- length: { aggregations.str_terms.buckets: 1 } | ||
- match: { aggregations.str_terms.buckets.0.key: "abc" } | ||
- match: { aggregations.str_terms.buckets.0.doc_count: 11 } | ||
|
||
--- | ||
"Test composite str_terms agg with doc_count": | ||
- skip: | ||
version: " - 7.99.99" | ||
reason: "Doc count fields are only implemented in 8.0" | ||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
body: { "size" : 0, "aggs" : | ||
{ "composite_agg" : { "composite" : | ||
{ | ||
"sources": ["str_terms": { "terms": { "field": "str" } }] | ||
} | ||
} | ||
} | ||
} | ||
|
||
- match: { hits.total: 5 } | ||
- length: { aggregations.composite_agg.buckets: 3 } | ||
- match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" } | ||
- match: { aggregations.composite_agg.buckets.0.doc_count: 11 } | ||
- match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" } | ||
- match: { aggregations.composite_agg.buckets.1.doc_count: 8 } | ||
- match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" } | ||
- match: { aggregations.composite_agg.buckets.2.doc_count: 5 } | ||
|
||
|
||
--- | ||
"Test composite num_terms agg with doc_count": | ||
- skip: | ||
version: " - 7.99.99" | ||
reason: "Doc count fields are only implemented in 8.0" | ||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
body: { "size" : 0, "aggs" : | ||
{ "composite_agg" : | ||
{ "composite" : | ||
{ | ||
"sources": ["num_terms" : { "terms" : { "field" : "number" } }] | ||
} | ||
} | ||
} | ||
} | ||
|
||
- match: { hits.total: 5 } | ||
- length: { aggregations.composite_agg.buckets: 3 } | ||
- match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 } | ||
- match: { aggregations.composite_agg.buckets.0.doc_count: 12 } | ||
- match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 } | ||
- match: { aggregations.composite_agg.buckets.1.doc_count: 1 } | ||
- match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 } | ||
- match: { aggregations.composite_agg.buckets.2.doc_count: 11 } | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought while reading the restrictions, is it possible to define
_doc_count
as an object? We should forbid that as well if it isn't already... but i suspect the current restrictions prevent it from being an object too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an assertion that the input is a
VALUE_NUMBER
in theparseCreateField()
method.elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/DocCountFieldMapper.java
Line 113 in 7b7ca43
Is there anything else that should be added?