-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase index.search.idle.after
setting default from 30s to 10 minutes (or more)
#9707
Comments
Hi @msfroh, + to increase the default settings, but why 10 mins? do you have data to proof that 10 mins is the right/best timeout value? |
Absolutely not! It's another totally arbitrary limit that should be debated, ideally based on data. In some ways, I would almost prefer a situation where there is no default in order to force users to think about what's right for their use-case, but I realize that's probably not helpful. (In particular, the current 30 second idle time mostly seems to be hurting the users who just want to use the out-of-the-box defaults.) I think of the problem as follows:
|
I had an absolutely terrible idea for an alternative solution: We can keep track of how frequently a given index receives search requests while idle. If the index receives more than N requests to idle shards within time period T (where N and T are configurable but have arbitrary default values), then we disable the shard idle behavior for the index. |
I like the "terrible" idea. Another bunch of ideas to throw in the open if we don't want to increase idle time...
This probably warrants its own issue, but I'm curious to hear comments on all these. |
I was just talking with @ruai0511 and we came up with another possible option. What if shard idle didn't stop refreshes altogether, but rather just made them more sparse. Right now, during the first 30 seconds that a shard (with all default settings) exists, it refreshes every second. If it doesn't see any searches within those 30 seconds, refreshes stop altogether. When a search request comes in, the shard does a blocking refresh, then goes back to refreshing every second for the next 30 seconds (assuming no more search requests come). What if, instead, being idle for 30 seconds doubled the implicit refresh interval. So, after the first 30 seconds of 1 second refreshes, we back off to 2 second refreshes, then 4 seconds, then 8, ... up a max of (say) 64 seconds. If a search request comes in, we drop the the implicit refresh interval back to 1 second. (Note that this is only the implicit refresh interval -- if the user has explicitly set a refresh interval, we continue to honor that and there is not shard idle behavior, just like now.) The one major risk that I could see is in the scenario where e.g. someone is sending logs in a "write-only" workload, then something bad happens, so they decide to search their logs, but they don't see logs from the past 64 seconds on that first search. (Of course, that first search will drop the refresh interval back to 1 second, so a followup search would probably find the relevant results.) If they don't do a followup search, it could be a nasty surprise. Of course, there are workarounds -- explicit refresh before search, sending another search request, etc. |
I think we can open this as a separate issue and tackle it there, or tackle it here and abandon the need to bump up default to 10 mins. But circling back to the tackling options, we should see what the intended goal is in the first place. If it is just tracking at a cluster level, the counters for hitting an If it is to do an adaptive idle policy, then the question is whether whether you want to do an "ideal idle" policy threshold discovery. Former tells us what to look for if we see oddly high but rate latencies. Latter ensure the system adaptively learns and we (hopefully) never have to worry about finding that balance between ingesting and refreshing. |
The discussion I had with @ruai0511 was mostly about "What is search idle really trying to solve?" Essentially, if you have a "write-only" logging use-case (or an overnight rebuild), the default 1s refresh will:
So -- frequent refresh when indexing only means small segments and wasted I would be curious to try benchmarking an indexing workload where instead of the existing shard idle behavior, we do the exponentially-decaying refresh rate, and see if there's any noticeable impact on the indexing speed. My hunch is that refreshing every minute (or minute and 4 seconds) would be infrequent enough to have little impact on indexing performance under load. |
We have integrated Lucene's merge-on-refresh policy a while back, that should help with "write-only" logging use-case (and alike), right?
Seems like worth trying |
Merge-on-refresh would deal with the small segments, but we still pay the merge cost I believe. I'm not 100% certain, but I think it's still better to write larger segments in the first place, up to a point. (Eventually, you're going to bump into the RAM buffer limit and will end up flushing anyway.) Back when I was working on Amazon Product Search, @mikemccand tried disabling explicit commits during the index build (since we used a "rebuild offline" model) and it had no real impact on the index build time. So, that's where I get the "up to a point" reasoning: while I would believe that flushing every second hurts indexing throughput, I have one anecdata point to suggest that there's no difference between flushing every minute and disabling explicit flushes altogether. |
With the exponentially decaying refresh rate, I assume we'd still force a refresh if a search request hit the shard if it was in the increased refresh rate state in order to keep the same staleness guarantees, right? |
That would just (more or less) bring back the existing idle shard behavior, though, which is exactly what this issue is trying to address. I covered the staleness problem above:
|
IMO, if someone wants a staleness guarantee, they should either explicitly set the refresh interval (disabling the shard idle behavior) or issue an explicit refresh before they search. |
Sorry, missed that! I think the idea of an adaptive idle policy with a bounded max staleness is interesting. It does become challenging to make it the default due to the potential nasty surprise you mentioned. I'm onboard with benchmarking and potentially adding it as an option (and maybe become the default in a future major version). I also intuitively agree that the default 30 second idle timeout does seem far too aggressive and would be on board with changing that. |
I feel like we still don't have a great solution to the current problem, where real users who have low levels of traffic end up with search requests that spike in latency because of the existing default behavior. We can increase the default 30 second idle timeout to reduce the number of users who are impacted, but I don't know what the new default should be -- 1 minute? 2 minutes? 5 minutes? 10? I don't have a good suggestion other than picking a different arbitrary value. |
Personally, enabling shard idle by default feels like the wrong choice. Disabling it will give you more consistent and predicable behavior. Its only in the case that you have a natural pattern of bulk loads with literally zero search traffic that it really makes a lot of sense. This is a super unsatisfying suggestion, but can we improve documentation and/or highlight this issue in some of the getting starting/set up guides (e.g. https://opensearch.org/docs/latest/install-and-configure/configuration/)? Specifically, I'm suggesting to add some content around choosing the right refresh interval and enable/disabling shard idle as appropriate for the workload. |
I agree with @andrross on this. search.idle setting is used to increase indexing performance which users can set explicitly rather than having it as an implicit setting. I think setting it to any arbitrary number may not solve this and comes down to the same problem. Even with implicit adaptive refresh interval setting, it might again cause surprises and confusion if users are unaware of it. |
Is your feature request related to a problem? Please describe.
Elasticsearch 7.0 introduced a "search idle" feature (elastic/elasticsearch#27500) to avoid refreshing an index that isn't receiving any search traffic. This helps remove the unnecessary effort of refreshing during large bulk load operations. For example, for a "rebuild the index overnight and serve traffic during the day" use-case, it's apparently a big help.
Unfortunately, I've seen at least a few cases where users end up with shards going idle and then block on refresh on their next query:
Describe the solution you'd like
In my opinion, the default 30 second shard idle timeout is far too aggressive.
For the "big overnight re-index job" use-case or "index logs constantly and only search them when something breaks" use-case, not seeing any query traffic for 10 minutes should still be a fine threshold -- sure you're doing unnecessary refreshes for an extra 9.5 minutes, but that's not likely to be too cost prohibitive.
Describe alternatives you've considered
We could change the default behavior for a search on idle shards to a background refresh, rather than blocking the first search(es). Searches could run quickly using the last (pre-idle)
IndexReader
. Unfortunately, that would be a major change for users who might be surprised by (potentially very) stale results.We also have a workaround where users who search and update their index all the time (but sometimes have sparser search traffic) can disable the search idle feature altogether by explicitly setting
index.refresh_interval
. In my opinion, it's still a good idea to do that, but I'd like the default behavior to be less aggressive.Additional context
N/A
The text was updated successfully, but these errors were encountered: