-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve terms aggregation documentation and add concurrent segment search caveats #6355
Improve terms aggregation documentation and add concurrent segment search caveats #6355
Conversation
…arch caveats Signed-off-by: Jay Deng <jayd0104@gmail.com>
cdae78e
to
3829e29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @jed326! Left some comments that should be quick to resolve. Then we can move the PR to final editorial review.
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Jay Deng <jayd0104@gmail.com>
_aggregations/bucket/terms.md
Outdated
|
||
Separately, the `shard_min_doc_count` parameter is used to filter out the unique terms a shard returns back to the coordinator with less than `shard_min_doc_count` results. | ||
|
||
When using concurrent segment search, the `shard_min_doc_count` parameter is not applied to each segment slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to mention this since it will be applied at shard level as part of reduce in concurrent search case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to explicitly call it out here since it's not obvious that shard_size
would be applied at the slice level but shard_min_doc_count
is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can link to opensearch-project/OpenSearch#11847 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws would you mind adding the link to the GitHub issue to the docs here for me? I don't want to push a new commit and mess up the suggestions from @natebower below. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jed326 @kolchfa-aws Please see my comments/changes and tag me when complete so that I can verify the revisions to lines 68, 84, and 90. Thanks!
_aggregations/bucket/terms.md
Outdated
|
||
Additionally, the coordinating node that’s responsible for the aggregation prompts each shard for its top unique terms. The number of buckets returned by each shard is controlled by the `shard_size` parameter. This parameter is distinct from the `size` parameter and exists as a mechanism to increase the accuracy of the bucket document counts. | ||
|
||
For example, imagine a scenario where the `size` and `shard_size` parameters are both 3. The `terms` aggregation requests each shard for its top 3 unique terms. The coordinating node takes each of the results and aggregates them to compute the final result. If a shard has an object that’s not part of the top 3, then it won't show up in the response. However, increasing the `shard_size` value for this request allows each shard to return a larger number of unique terms, increasing the likelihood that the coordinating node has the full picture of all results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First sentence: "in which the size
and shard_size
parameters both have a value of 3"? Last sentence: "has the full picture of all results" is idiomatic. Please revise and tag me on the revision.
_aggregations/bucket/terms.md
Outdated
|
||
There are two collect modes available: `depth_first` and `breadth_first`. The `depth_first` collection mode expands all branches of the aggregation tree in a depth-first manner and only does pruning after the expansion is done. | ||
|
||
However, when using nested terms aggregations, the cardinality of the number of buckets returned is multiplied by the cardinality of the field at each level of nesting, making it easy to see combinatorial explosion in the bucket count as you nest aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "terms" be in code font?
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@natebower I addressed your comments. Please review again when you get a chance. |
_aggregations/bucket/terms.md
Outdated
|
||
## The `min_doc_count` and `shard_min_doc_count` parameters | ||
|
||
You can use the `min_doc_count` parameter to filter out any unique terms with fewer than `min_doc_count` results. The `min_doc_count` threshold is applied only after merging the results retrieved from all of the shards. Each shard is unaware of the global document count for a given term. If there is a significant difference in the top `shard_size` globally frequent terms and the top terms local to a shard, you may receive unexpected results when using the `min_doc_count` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
significant difference "between"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Just one final comment. Otherwise, LGTM!
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Description
Improve terms aggregation documentation by adding details for missing parameters as well as add caveats for concurrent segment search.
Issues Resolved
Resolves: #6129
Resolves: #5046
Relates:
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.