Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve terms aggregation documentation and add concurrent segment search caveats #6355

Merged
merged 4 commits into from
Feb 7, 2024

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Feb 5, 2024

Description

Improve terms aggregation documentation by adding details for missing parameters as well as add caveats for concurrent segment search.

Issues Resolved

Resolves: #6129
Resolves: #5046

Relates:

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…arch caveats

Signed-off-by: Jay Deng <jayd0104@gmail.com>
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @jed326! Left some comments that should be quick to resolve. Then we can move the PR to final editorial review.

_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Jay Deng <jayd0104@gmail.com>

Separately, the `shard_min_doc_count` parameter is used to filter out the unique terms a shard returns back to the coordinator with less than `shard_min_doc_count` results.

When using concurrent segment search, the `shard_min_doc_count` parameter is not applied to each segment slice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to mention this since it will be applied at shard level as part of reduce in concurrent search case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to explicitly call it out here since it's not obvious that shard_size would be applied at the slice level but shard_min_doc_count is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can link to opensearch-project/OpenSearch#11847 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws would you mind adding the link to the GitHub issue to the docs here for me? I don't want to push a new commit and mess up the suggestions from @natebower below. Thanks!

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 @kolchfa-aws Please see my comments/changes and tag me when complete so that I can verify the revisions to lines 68, 84, and 90. Thanks!

_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved

Additionally, the coordinating node that’s responsible for the aggregation prompts each shard for its top unique terms. The number of buckets returned by each shard is controlled by the `shard_size` parameter. This parameter is distinct from the `size` parameter and exists as a mechanism to increase the accuracy of the bucket document counts.

For example, imagine a scenario where the `size` and `shard_size` parameters are both 3. The `terms` aggregation requests each shard for its top 3 unique terms. The coordinating node takes each of the results and aggregates them to compute the final result. If a shard has an object that’s not part of the top 3, then it won't show up in the response. However, increasing the `shard_size` value for this request allows each shard to return a larger number of unique terms, increasing the likelihood that the coordinating node has the full picture of all results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence: "in which the size and shard_size parameters both have a value of 3"? Last sentence: "has the full picture of all results" is idiomatic. Please revise and tag me on the revision.

_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved

There are two collect modes available: `depth_first` and `breadth_first`. The `depth_first` collection mode expands all branches of the aggregation tree in a depth-first manner and only does pruning after the expansion is done.

However, when using nested terms aggregations, the cardinality of the number of buckets returned is multiplied by the cardinality of the field at each level of nesting, making it easy to see combinatorial explosion in the bucket count as you nest aggregations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "terms" be in code font?

_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
_aggregations/bucket/terms.md Outdated Show resolved Hide resolved
@hdhalter hdhalter added 3 - Tech review PR: Tech review in progress release-notes PR: Include this PR in the automated release notes v2.12.0 labels Feb 7, 2024
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@kolchfa-aws
Copy link
Collaborator

@natebower I addressed your comments. Please review again when you get a chance.


## The `min_doc_count` and `shard_min_doc_count` parameters

You can use the `min_doc_count` parameter to filter out any unique terms with fewer than `min_doc_count` results. The `min_doc_count` threshold is applied only after merging the results retrieved from all of the shards. Each shard is unaware of the global document count for a given term. If there is a significant difference in the top `shard_size` globally frequent terms and the top terms local to a shard, you may receive unexpected results when using the `min_doc_count` parameter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

significant difference "between"?

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just one final comment. Otherwise, LGTM!

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@kolchfa-aws kolchfa-aws merged commit 9d3a738 into opensearch-project:main Feb 7, 2024
2 of 3 checks passed
@hdhalter hdhalter removed the 3 - Tech review PR: Tech review in progress label Feb 13, 2024
@hdhalter hdhalter added the 3 - Done Issue is done/complete label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Done Issue is done/complete release-notes PR: Include this PR in the automated release notes v2.12.0
Projects
None yet
5 participants