-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Faster search queries might see a timeout discrepancy in search response due to cached time #2000
Comments
So in this case is the value of "took" plain incorrect? |
@dblock there are two problems
due to the relative times being cached at 200ms Yes the actual took time looks wrong here |
Caching time at any rate seems problematic. There must be a good reason that was introduced at all though? |
I think the System#nanoTime is a slightly expensive system call and probably long running queries could live with the inaccuracy, thats my understanding of the motivation behind the cache. |
Sounds like we use |
System.currentTimeMillis is based on physical clock time that can give negative deltas if clocks go back due to NTP. System.nanoTime uses a monotonically increasing time which is the right choice for measuring elapsed times |
Agree with @Bukhtawar, [1] https://shipilev.net/blog/2014/nanotrusting-nanotime/#_building_performance_models |
Yes, of course. I should have read the documentation. Thx. |
I'm looking into this. |
Root cause:Right now, when timeout cancellation Runnable is created in QueryPhase.java, it uses This premature false positive timeouts can happen when E.g.
to fix this, only Steps to consistently reproduce:
and after using |
Describe the bug
With search timeout set to 200ms user can end up seeing a response below which seems inconsistent since the timeout itself was set at 200ms and there would possibly be no way the
took
time is below the time out and yet the query times outThe major issue could be wrong timeouts being enforced either pre-mature or too late based on the estimated time intervals. For eg it might timeout at 0ms or 400ms for a 200ms timeout
This happens due to the elapsed time computation which uses an optimization for System#nanoTime that caches time by 200ms by default based on the setting
thread_pool.estimated_time_interval
Some latency sensitive query might see discrepancy based on theses defaults.
We need to check what is a reasonable default for the estimated time interval based on JMH benchmarks. Since today it exists as a static value, with basically no documentation on what to expect out of a search timeout. We can choose to make it dynamic with reasonable defaults and let users choose this interval within appropriate limits
OpenSearch/server/src/main/java/org/opensearch/threadpool/ThreadPool.java
Line 632 in 996d33a
The text was updated successfully, but these errors were encountered: