-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use collector manager for search when necessary #45829
Use collector manager for search when necessary #45829
Conversation
When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. Thus for such a case, we use collectorManager, where for every segment a dedicated collector will be created. TODO: on the Lucene side we need to make collector to except min competitive score
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good @mayya-sharipova , I left some comments
// the leaves in any order. This is needed since we reorder | ||
// the leaves based on the minimum value in each segment. | ||
newSortFields[newSortFields.length-1] = SortField.FIELD_DOC; | ||
newFormats[newSortFields.length-1] = DocValueFormat.RAW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager
?
@@ -470,7 +533,7 @@ static void restoreTopFieldDocs(QuerySearchResult result, SortAndFormats origina | |||
TopDocs topDocs = result.topDocs().topDocs; | |||
for (ScoreDoc scoreDoc : topDocs.scoreDocs) { | |||
FieldDoc fieldDoc = (FieldDoc) scoreDoc; | |||
fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length-1); | |||
fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager
?
QuerySearchResult queryResult = searchContext.queryResult(); | ||
try { | ||
Weight weight = searcher.createWeight(searcher.rewrite(query), queryCollector.scoreMode(), 1f); | ||
searcher.search(searcher.getIndexReader().leaves(), weight, queryCollector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can replace with searcher.search(query, queryCollector);
?
@Override | ||
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException { | ||
public void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to change the visibility, we can also make searchLeaf
private since it is only used internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to change the visibility
Because there is already a method with the same signature in its superclass IndexSearcher
that is protected
, if I don't make this method protected
or public
, Java will complain with a message "attempting to assign weaker access privileges". I will make this method protected
also make searchLeaf private
I did. I also have to modify SearchCancellationTests::testCancellableCollector as it was not able to use searchLeaf
anymore.
if (searchContext.trackTotalHitsUpTo() != SearchContext.TRACK_TOTAL_HITS_DISABLED) { | ||
hitCount = shortcutTotalHitCount(reader, query); | ||
if (searchContext.trackTotalHitsUpTo() == Integer.MAX_VALUE) { | ||
scoreMode = ScoreMode.COMPLETE; //TODO: not sure if scoreMode should always be TOP_SCORES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already check this here so I don't think it is needed. We shouldn't run with the optim if the score mode is COMPLETE.
@jimczi Thanks for the review. I have addressed your comments in the second commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Optimize sort on numeric long and date fields (#39770) Optimize sort on numeric long and date fields, when the system property `es.search.long_sort_optimized` is true. * Skip optimization if the index has duplicate data (#43121) Skip sort optimization if the index has 50% or more data with the same value. When index has a lot of docs with the same value, sort optimization doesn't make sense, as DistanceFeatureQuery will produce same scores for these docs, and Lucene will use the second sort to tie-break. This could be slower than usual sorting. * Sort leaves on search according to the primary numeric sort field (#44021) This change pre-sort the index reader leaves (segment) prior to search when the primary sort is a numeric field eligible to the distance feature optimization. It also adds a tie breaker on `_doc` to the rewritten sort in order to bypass the fact that leaves will be collected in a random order. I ran this patch on the http_logs benchmark and the results are very promising: ``` | 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms | | 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms | | 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms | | 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms | | 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms | | 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms | | 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms | | 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms | | error rate | desc_sort_timestamp | 0 | 0 | 0 | % | | Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s | | Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s | | Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s | | 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms | | 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms | | 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms | | 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms | | 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms | | 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms | | 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms | | 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms | | error rate | asc_sort_timestamp | 0 | 0 | 0 | % | ``` So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note that I indexed the http_logs with a single client in order to simulate real time-based indices where document are indexed in their timestamp order. Relates #37043 * Remove nested collector in docs response As we don't use cancellableCollector anymore, it should be removed from the expected docs response. * Use collector manager for search when necessary (#45829) When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. Thus for such a case, we use collectorManager, where for every segment a dedicated collector will be created. * Use shared TopFieldCollector manager Use shared TopFieldCollector manager for sort optimization. This collector manager is able to exchange minimum competitive score between collectors * Correct calculation of avg value to avoid overflow * Optimize calculating if index has duplicate data
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.
TODO: on the Lucene side we need to make collector to accept
min competitive score