Use collector manager for search when necessary #45829

mayya-sharipova · 2019-08-22T09:45:56Z

When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.

TODO: on the Lucene side we need to make collector to accept
min competitive score

When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. Thus for such a case, we use collectorManager, where for every segment a dedicated collector will be created. TODO: on the Lucene side we need to make collector to except min competitive score

elasticmachine · 2019-08-22T09:50:23Z

Pinging @elastic/es-search

jimczi

it looks good @mayya-sharipova , I left some comments

jimczi · 2019-08-22T10:19:01Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

-                    // the leaves in any order. This is needed since we reorder
-                    // the leaves based on the minimum value in each segment.
-                    newSortFields[newSortFields.length-1] = SortField.FIELD_DOC;
-                    newFormats[newSortFields.length-1] = DocValueFormat.RAW;


we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager ?

jimczi · 2019-08-22T10:19:08Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

@@ -470,7 +533,7 @@ static void restoreTopFieldDocs(QuerySearchResult result, SortAndFormats origina
        TopDocs topDocs = result.topDocs().topDocs;
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            FieldDoc fieldDoc = (FieldDoc) scoreDoc;
-            fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length-1);
+            fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length);


we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager ?

jimczi · 2019-08-22T10:57:35Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+        QuerySearchResult queryResult = searchContext.queryResult();
+        try {
+            Weight weight = searcher.createWeight(searcher.rewrite(query), queryCollector.scoreMode(), 1f);
+            searcher.search(searcher.getIndexReader().leaves(), weight, queryCollector);


you can replace with searcher.search(query, queryCollector); ?

jimczi · 2019-08-22T10:59:17Z

server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java

    @Override
-    protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException {
+    public void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException {


We don't need to change the visibility, we can also make searchLeaf private since it is only used internally

We don't need to change the visibility

Because there is already a method with the same signature in its superclass IndexSearcher that is protected, if I don't make this method protected or public, Java will complain with a message "attempting to assign weaker access privileges". I will make this method protected

also make searchLeaf private

I did. I also have to modify SearchCancellationTests::testCancellableCollector as it was not able to use searchLeaf anymore.

jimczi · 2019-08-22T11:07:36Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+        if (searchContext.trackTotalHitsUpTo() != SearchContext.TRACK_TOTAL_HITS_DISABLED) {
+            hitCount = shortcutTotalHitCount(reader, query);
+            if (searchContext.trackTotalHitsUpTo() == Integer.MAX_VALUE) {
+                scoreMode = ScoreMode.COMPLETE; //TODO: not sure if scoreMode should always be TOP_SCORES


we already check this here so I don't think it is needed. We shouldn't run with the optim if the score mode is COMPLETE.

mayya-sharipova · 2019-08-28T20:01:01Z

@jimczi Thanks for the review. I have addressed your comments in the second commit.

jimczi

LGTM

* Optimize sort on numeric long and date fields (#39770) Optimize sort on numeric long and date fields, when the system property `es.search.long_sort_optimized` is true. * Skip optimization if the index has duplicate data (#43121) Skip sort optimization if the index has 50% or more data with the same value. When index has a lot of docs with the same value, sort optimization doesn't make sense, as DistanceFeatureQuery will produce same scores for these docs, and Lucene will use the second sort to tie-break. This could be slower than usual sorting. * Sort leaves on search according to the primary numeric sort field (#44021) This change pre-sort the index reader leaves (segment) prior to search when the primary sort is a numeric field eligible to the distance feature optimization. It also adds a tie breaker on `_doc` to the rewritten sort in order to bypass the fact that leaves will be collected in a random order. I ran this patch on the http_logs benchmark and the results are very promising: ``` | 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms | | 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms | | 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms | | 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms | | 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms | | 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms | | 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms | | 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms | | error rate | desc_sort_timestamp | 0 | 0 | 0 | % | | Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s | | Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s | | Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s | | 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms | | 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms | | 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms | | 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms | | 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms | | 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms | | 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms | | 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms | | error rate | asc_sort_timestamp | 0 | 0 | 0 | % | ``` So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note that I indexed the http_logs with a single client in order to simulate real time-based indices where document are indexed in their timestamp order. Relates #37043 * Remove nested collector in docs response As we don't use cancellableCollector anymore, it should be removed from the expected docs response. * Use collector manager for search when necessary (#45829) When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. Thus for such a case, we use collectorManager, where for every segment a dedicated collector will be created. * Use shared TopFieldCollector manager Use shared TopFieldCollector manager for sort optimization. This collector manager is able to exchange minimum competitive score between collectors * Correct calculation of avg value to avoid overflow * Optimize calculating if index has duplicate data

mayya-sharipova requested a review from jimczi August 22, 2019 09:46

mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Aug 22, 2019

jimczi reviewed Aug 22, 2019

View reviewed changes

Address Jim's comments

8e054be

Correct check style

816b243

jimczi approved these changes Aug 30, 2019

View reviewed changes

mayya-sharipova merged commit 81971ac into elastic:long_sort_optimization Aug 30, 2019

mayya-sharipova deleted the search-collector-manager branch August 30, 2019 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use collector manager for search when necessary #45829

Use collector manager for search when necessary #45829

mayya-sharipova commented Aug 22, 2019

elasticmachine commented Aug 22, 2019

jimczi left a comment

jimczi Aug 22, 2019

jimczi Aug 22, 2019

jimczi Aug 22, 2019

jimczi Aug 22, 2019

mayya-sharipova Aug 28, 2019 •

edited

Loading

jimczi Aug 22, 2019

mayya-sharipova commented Aug 28, 2019

jimczi left a comment

Use collector manager for search when necessary #45829

Use collector manager for search when necessary #45829

Conversation

mayya-sharipova commented Aug 22, 2019

elasticmachine commented Aug 22, 2019

jimczi left a comment

Choose a reason for hiding this comment

jimczi Aug 22, 2019

Choose a reason for hiding this comment

jimczi Aug 22, 2019

Choose a reason for hiding this comment

jimczi Aug 22, 2019

Choose a reason for hiding this comment

jimczi Aug 22, 2019

Choose a reason for hiding this comment

mayya-sharipova Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

jimczi Aug 22, 2019

Choose a reason for hiding this comment

mayya-sharipova commented Aug 28, 2019

jimczi left a comment

Choose a reason for hiding this comment

mayya-sharipova Aug 28, 2019 •

edited

Loading