Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use collector manager for search when necessary #45829

Conversation

mayya-sharipova
Copy link
Contributor

When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.

TODO: on the Lucene side we need to make collector to accept
min competitive score

When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.

TODO: on the Lucene side we need to make collector to except
min competitive score
@mayya-sharipova mayya-sharipova requested a review from jimczi August 22, 2019 09:46
@mayya-sharipova mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Aug 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good @mayya-sharipova , I left some comments

// the leaves in any order. This is needed since we reorder
// the leaves based on the minimum value in each segment.
newSortFields[newSortFields.length-1] = SortField.FIELD_DOC;
newFormats[newSortFields.length-1] = DocValueFormat.RAW;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager ?

@@ -470,7 +533,7 @@ static void restoreTopFieldDocs(QuerySearchResult result, SortAndFormats origina
TopDocs topDocs = result.topDocs().topDocs;
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
FieldDoc fieldDoc = (FieldDoc) scoreDoc;
fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length-1);
fieldDoc.fields = Arrays.copyOfRange(fieldDoc.fields, 1, fieldDoc.fields.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to tiebreak by global doc id because we reorder the leaves in searchWithCollectorManager ?

QuerySearchResult queryResult = searchContext.queryResult();
try {
Weight weight = searcher.createWeight(searcher.rewrite(query), queryCollector.scoreMode(), 1f);
searcher.search(searcher.getIndexReader().leaves(), weight, queryCollector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace with searcher.search(query, queryCollector); ?

@Override
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException {
public void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change the visibility, we can also make searchLeaf private since it is only used internally

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change the visibility

Because there is already a method with the same signature in its superclass IndexSearcher that is protected, if I don't make this method protected or public, Java will complain with a message "attempting to assign weaker access privileges". I will make this method protected

also make searchLeaf private

I did. I also have to modify SearchCancellationTests::testCancellableCollector as it was not able to use searchLeaf anymore.

if (searchContext.trackTotalHitsUpTo() != SearchContext.TRACK_TOTAL_HITS_DISABLED) {
hitCount = shortcutTotalHitCount(reader, query);
if (searchContext.trackTotalHitsUpTo() == Integer.MAX_VALUE) {
scoreMode = ScoreMode.COMPLETE; //TODO: not sure if scoreMode should always be TOP_SCORES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already check this here so I don't think it is needed. We shouldn't run with the optim if the score mode is COMPLETE.

@mayya-sharipova
Copy link
Contributor Author

@jimczi Thanks for the review. I have addressed your comments in the second commit.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mayya-sharipova mayya-sharipova merged commit 81971ac into elastic:long_sort_optimization Aug 30, 2019
@mayya-sharipova mayya-sharipova deleted the search-collector-manager branch August 30, 2019 18:45
mayya-sharipova added a commit that referenced this pull request Nov 26, 2019
* Optimize sort on numeric long and date fields (#39770)

Optimize sort on numeric long and date fields, when 
the system property `es.search.long_sort_optimized` is true.

* Skip optimization if the index has duplicate data (#43121)

Skip sort optimization if the index has 50% or more data
with the same value.
When index has a lot of docs with the same value, sort
optimization doesn't make sense, as DistanceFeatureQuery
will produce same scores for these docs, and Lucene
will use the second sort to tie-break. This could be slower
than usual sorting.

* Sort leaves on search according to the primary numeric sort field (#44021)

This change pre-sort the index reader leaves (segment) prior to search
when the primary sort is a numeric field eligible to the distance feature
optimization. It also adds a tie breaker on `_doc` to the rewritten sort
in order to bypass the fact that leaves will be collected in a random order.
I ran this patch on the http_logs benchmark and the results are very promising:

```
|                                       50th percentile latency | desc_sort_timestamp |    220.706 |      136544 |   136324 |     ms |
|                                       90th percentile latency | desc_sort_timestamp |    244.847 |      162084 |   161839 |     ms |
|                                       99th percentile latency | desc_sort_timestamp |    316.627 |      172005 |   171688 |     ms |
|                                      100th percentile latency | desc_sort_timestamp |    335.306 |      173325 |   172989 |     ms |
|                                  50th percentile service time | desc_sort_timestamp |    218.369 |     1968.11 |  1749.74 |     ms |
|                                  90th percentile service time | desc_sort_timestamp |    244.182 |      2447.2 |  2203.02 |     ms |
|                                  99th percentile service time | desc_sort_timestamp |    313.176 |     2950.85 |  2637.67 |     ms |
|                                 100th percentile service time | desc_sort_timestamp |    332.924 |     2959.38 |  2626.45 |     ms |
|                                                    error rate | desc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |  asc_sort_timestamp |   0.801824 |    0.800855 | -0.00097 |  ops/s |
|                                             Median Throughput |  asc_sort_timestamp |   0.802595 |    0.801104 | -0.00149 |  ops/s |
|                                                Max Throughput |  asc_sort_timestamp |   0.803282 |    0.801351 | -0.00193 |  ops/s |
|                                       50th percentile latency |  asc_sort_timestamp |    220.761 |     824.098 |  603.336 |     ms |
|                                       90th percentile latency |  asc_sort_timestamp |    251.741 |     853.984 |  602.243 |     ms |
|                                       99th percentile latency |  asc_sort_timestamp |    368.761 |     893.943 |  525.182 |     ms |
|                                      100th percentile latency |  asc_sort_timestamp |    431.042 |      908.85 |  477.808 |     ms |
|                                  50th percentile service time |  asc_sort_timestamp |    218.547 |     820.757 |  602.211 |     ms |
|                                  90th percentile service time |  asc_sort_timestamp |    249.578 |     849.886 |  600.308 |     ms |
|                                  99th percentile service time |  asc_sort_timestamp |    366.317 |     888.894 |  522.577 |     ms |
|                                 100th percentile service time |  asc_sort_timestamp |    430.952 |     908.401 |   477.45 |     ms |
|                                                    error rate |  asc_sort_timestamp |          0 |           0 |        0 |      % |
```

So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note
that I indexed the http_logs with a single client in order to simulate real time-based indices
where document are indexed in their timestamp order.

Relates #37043

* Remove nested collector in docs response

As we don't use cancellableCollector anymore, it should be removed from
the expected docs response.

* Use collector manager for search when necessary (#45829)

When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.

* Use shared TopFieldCollector manager

Use shared TopFieldCollector manager for sort optimization.
This collector manager is able to exchange minimum competitive
score between collectors

* Correct calculation of avg value to avoid overflow

* Optimize calculating if index has duplicate data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants