Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc_count field mapper #58339

Conversation

csoulios
Copy link
Contributor

Bucket aggregations compute bucket doc_count values by incrementing the doc_count by 1 for every document collected in the bucket.

When using summary fields (such as aggregate_metric_double) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (named doc_count field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field.

Bucket aggregations will check if a field of type doc_count exists in a document and will take this value into consideration when computing doc counts.

@jtibshirani jtibshirani added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Jun 25, 2020
@talevy talevy added the Team:Deployment Management Meta label for Management Experience - Deployment Management team label Jul 6, 2020
@csoulios csoulios marked this pull request as ready for review July 9, 2020 16:24
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 9, 2020
@csoulios csoulios removed the WIP label Jul 9, 2020
@csoulios csoulios requested review from polyfractal and talevy July 9, 2020 16:31
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a metadata field _doc_count instead of a standard field type? It could be a nice fit, because it's more of an internal field rather than something we expect users to configure. Also we don't want multiple fields of type doc_count to be defined.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although my mapper-skills are relatively poor so take it with a grain of salt. :)

Left a few comments about validation/tests but not critical (and might not be relevant). Exciting!


[IMPORTANT]
========
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought while reading the restrictions, is it possible to define _doc_count as an object? We should forbid that as well if it isn't already... but i suspect the current restrictions prevent it from being an object too.

Copy link
Contributor Author

@csoulios csoulios Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an assertion that the input is a VALUE_NUMBER in the parseCreateField() method.

XContentParserUtils.ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, parser.currentToken(), parser);

Is there anything else that should be added?

import static java.util.Collections.singleton;


public class FieldBasedDocCountProviderTests extends AggregatorTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to filter or aggregate on the _doc_counts field itself...I'm guessing it's not possible? Should we add some tests to make sure that's handled politely if a user tries to search/agg it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's common to have a test case like DocCountFieldTypeTests that checks query and field data construction works (or throws an error) as expected.

Copy link
Contributor Author

@csoulios csoulios Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added test DocCountFieldTypeTests that tests range, term and exists queries.

Also, DocCountFieldType extends the MappedFieldType class that by default throws an exception when calling the fielddataBuilder(). Does this cover aggregations?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me overall, I left some minor comments.

@csoulios csoulios requested a review from jimczi October 27, 2020 15:49
@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

@csoulios csoulios requested a review from not-napoleon October 27, 2020 17:11
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, I had one last comment.

try {
docCountValues = DocValues.getNumeric(ctx.reader(), DocCountFieldMapper.NAME);
} catch (IOException e) {
docCountValues = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be swallowing an IOException and treating it as if there was no doc count field. What case does this logic address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I felt it was better to swallow the error and fallback to doc_count=1. Now, it feels more transparent to rethrow the exception.

I am not sure what can go wrong there and I don't have a strong opinion about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should propagate it, an IOException could indicate something serious and it's best not to silently continue on. Perhaps this method could just declare throws IOException.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too.

@csoulios
Copy link
Contributor Author

csoulios commented Nov 2, 2020

This PR is closed and opened as new PR (#64503) against master branch.

@csoulios csoulios closed this Nov 2, 2020
import java.io.IOException;

/**
* An implementation of a doc_count provider that reads the value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Javadoc still reflects having an interface

}
}

public void setLeafReaderContext(LeafReaderContext ctx) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just a constructor argument? Under what conditions would it be correct to use a DocCountProvider without setting the LeafReaderContext?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Deployment Management Meta label for Management Experience - Deployment Management team Team:Search Meta label for search team v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants