-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher can deadlock resulting in-ability to index any documents. #41390
Labels
Comments
Pinging @elastic/es-core-features |
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Apr 22, 2019
This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
jakelandis
added a commit
that referenced
this issue
Apr 29, 2019
…1418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Apr 30, 2019
…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Apr 30, 2019
…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
This was referenced Apr 30, 2019
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Apr 30, 2019
…astic#41418) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
jakelandis
added a commit
that referenced
this issue
Apr 30, 2019
…1418) (#41690) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390 * includes changes to satisfy 6.x
jakelandis
added a commit
that referenced
this issue
Apr 30, 2019
…1418) (#41685) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390
jakelandis
added a commit
that referenced
this issue
Apr 30, 2019
…1418) (#41684) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390
akhil10x5
pushed a commit
to akhil10x5/elasticsearch
that referenced
this issue
May 2, 2019
…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
gurkankaymak
pushed a commit
to gurkankaymak/elasticsearch
that referenced
this issue
May 27, 2019
…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390
good, thanks |
removing version label to avoid confusion. This bug is possible as of 6.5.0 and fixed as of 6.8.0 |
Note: to clear the deadlock when it occurs:
Keep in mind any watches queued for execution in .triggered_watches will not execute after deleting the index. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If the
watcher
thread pool and associated queue is full, thewrite
thread pool can fill with index requests that are blocked which will eventually backup thewrite
thread pool and it's associated queue resulting in a complete rejections of all index requests.This is most likely to happen in monitoring clusters that monitor many remote clusters and thus have many watches. However, this can happen on any type of cluster if the Watcher thread pool and associated queue is full.
The culprit is here
When the
watcher
queue is full the Watch is rejected. Once the Watch is reject it attempts to writes a history record so we know that that the Watch was rejected. When Watcher writes the history document, it needs a spot in in thewrite
queue. The way Watcher is implemented that the write task is blocking, completely synchronous. So the write needs to complete before another history document can be written. Specifically it is blocking in a synchronized method here. This means that only 1 thread that uses this BulkProcessor can write, the rest of the threads will be "Blocked" by thesychnrozied
method [1]. Normally this means that Watcher is only blocking Watcher since the threads that blocked are all from thewatcher
Threadpool. So far, we are OK...Watcher's ability to write the history is throttled to one at a time and pending writes will end up in thewatcher
queue and get processes on at a time. The problems start to happen when thewatcher
queue is full, it starts to reject and tries to write the history document from the current thread...which can be a thread out of thewrite
thread pool. This can happen since the normal history write happens on the response of the .triggered-watch write. The response is not threaded and the reject exception is handled by thewrite
thread...so this where things go bad. Now we have 1
watcher
thread waiting for a response from the index call [3], all the otherwatcher
threads are "Blocked" by that onewatcher
thread. If thewatcher
queue fills up, rejections start happening and threads from thewrite
pool now start trying to write the history document (on thewrite
thread) and also get "Blocked" due to the same synchronized method. This is becuase the exception handler uses the same bulkProcessor instance to write the document on exception. So the threads from thewrite
pool get caught up in thesynchronized
method that backed up thewatcher
queue. So now we can have threads in thewrite
thread pool that blocked by the same synchronous Watcher writes. This backs up thewrite
queue and will eventually fill it up and start rejecting. Now thewatcher
writes can't happen because thewrite
queue is rejecting, and the writes can't happen becausewatcher
threads are blockingwrite
threads. 💥 Deadlock. I am sure there are some cases where it doesn't completely deadlock, but rather heavily throttles the ability to write documents.You can reproduce this behavior by adding Thread.sleep(1000) in the BulkRequestHandler just after the semaphore aquision here and run a lot of watches on a fast schedule [4].
There are more concurrent issues here such that the
synchronized
method trumps the semaphore lock downstream negating the ability to use the bulk processor concurrently and the fact that deletion of triggered watches are synchronous to the write of the history document.... But will address those separately.This issue is only applicable for 6.5+ as of #32490 Increasing
xpack.watcher.bulk.actions
can help but won't prevent this issue entirely.[1]
[2]
[3]
[4]
The text was updated successfully, but these errors were encountered: