-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searches against a large number of unavailable shards result in very large responses #90622
Comments
Pinging @elastic/es-search (Team:Search) |
It seems originally we did not record these failures at all but behavior was changed in: #64337 So, it seems a good middle ground is as you said, let's not return the trace in these class of failures. Sounds fair? |
Well, including unavailable shards as shard failures, but adjusting their exception serialization is a bit more complicated. Will need to discuss the best way to approach this. Some silly ideas:
|
…ailable (#91365) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes #90622
…ailable (elastic#91365) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes elastic#90622
…ailable (#91365) (#92907) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes #90622
Searching a large number of unavailable shards through e.g. the
*
pattern while a large cluster is recovering from a full restart or so, leads to extremely large responses containing an exception for each shard.This example shows a 375M on heap response for ~25k unavailable shards being returned. Over the wire, this serialises with a similar size and we'd have 700M+ peak heap usage for a search request when the valid search response in a green cluster might be much smaller than this.
It seems to me that we could mostly resolve this by not returning the stack trace in this specific case of the shard not available exception (it doesn't seem valuable for users and we made a similar fix around unavailable shards in snapshots state responses)?
The text was updated successfully, but these errors were encountered: