Watcher can deadlock resulting in-ability to index any documents. #41390

jakelandis · 2019-04-19T20:17:03Z

If the watcher thread pool and associated queue is full, the write thread pool can fill with index requests that are blocked which will eventually backup the write thread pool and it's associated queue resulting in a complete rejections of all index requests.

This is most likely to happen in monitoring clusters that monitor many remote clusters and thus have many watches. However, this can happen on any type of cluster if the Watcher thread pool and associated queue is full.

The culprit is here

When the watcher queue is full the Watch is rejected. Once the Watch is reject it attempts to writes a history record so we know that that the Watch was rejected. When Watcher writes the history document, it needs a spot in in the write queue. The way Watcher is implemented that the write task is blocking, completely synchronous. So the write needs to complete before another history document can be written. Specifically it is blocking in a synchronized method here. This means that only 1 thread that uses this BulkProcessor can write, the rest of the threads will be "Blocked" by the sychnrozied method [1]. Normally this means that Watcher is only blocking Watcher since the threads that blocked are all from the watcher Threadpool. So far, we are OK...Watcher's ability to write the history is throttled to one at a time and pending writes will end up in the watcher queue and get processes on at a time. The problems start to happen when the watcher queue is full, it starts to reject and tries to write the history document from the current thread...which can be a thread out of the write thread pool. This can happen since the normal history write happens on the response of the .triggered-watch write. The response is not threaded and the reject exception is handled by the write thread.

..so this where things go bad. Now we have 1 watcher thread waiting for a response from the index call [3], all the other watcher threads are "Blocked" by that one watcher thread. If the watcher queue fills up, rejections start happening and threads from the write pool now start trying to write the history document (on the write thread) and also get "Blocked" due to the same synchronized method. This is becuase the exception handler uses the same bulkProcessor instance to write the document on exception. So the threads from the write pool get caught up in the synchronized method that backed up the watcher queue. So now we can have threads in the write thread pool that blocked by the same synchronous Watcher writes. This backs up the write queue and will eventually fill it up and start rejecting. Now the watcher writes can't happen because the write queue is rejecting, and the writes can't happen because watcher threads are blocking write threads. 💥 Deadlock. I am sure there are some cases where it doesn't completely deadlock, but rather heavily throttles the ability to write documents.

You can reproduce this behavior by adding Thread.sleep(1000) in the BulkRequestHandler just after the semaphore aquision here and run a lot of watches on a fast schedule [4].

There are more concurrent issues here such that the synchronized method trumps the semaphore lock downstream negating the ability to use the bulk processor concurrently and the fact that deletion of triggered watches are synchronous to the write of the history document.... But will address those separately.

This issue is only applicable for 6.5+ as of #32490 Increasing xpack.watcher.bulk.actions can help but won't prevent this issue entirely.

[1]

"elasticsearch[instance-0000000012][watcher][T#9]" daemon prio=5 tid=861 BLOCKED
    at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:320)
       local variable: org.elasticsearch.action.index.IndexRequest#22
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:305)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:301)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:287)
    at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:49)
       local variable: org.elasticsearch.common.xcontent.XContentBuilder#204
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:314)
       local variable: java.lang.String#354386
       local variable: org.elasticsearch.xpack.watcher.execution.TriggeredExecutionContext#47
       local variable: org.elasticsearch.xpack.core.watcher.history.WatchRecord$ExceptionWatchRecord#2
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$5(ExecutionService.java:394)
    at org.elasticsearch.xpack.watcher.execution.ExecutionService$$Lambda$3807.run(<unknown string>)
    at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:543)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$2006#851
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable#460
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       local variable: java.util.concurrent.ThreadPoolExecutor$Worker#35
    at java.lang.Thread.run(Thread.java:748)

[2]

"elasticsearch[instance-0000000012][write][T#2]" daemon prio=5 tid=193 BLOCKED
    at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:320)
       local variable: org.elasticsearch.action.index.IndexRequest#428
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:305)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:301)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:287)
    at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:49)
       local variable: org.elasticsearch.common.xcontent.XContentBuilder#212
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeAsync(ExecutionService.java:402)
       local variable: org.elasticsearch.xpack.watcher.execution.TriggeredExecutionContext#959
       local variable: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException#20
       local variable: org.elasticsearch.xpack.watcher.execution.TriggeredWatch#96
       local variable: org.elasticsearch.xpack.core.watcher.history.WatchRecord$MessageWatchRecord#1
       local variable: java.lang.String#2849348
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeTriggeredWatches(ExecutionService.java:260)
       local variable: org.elasticsearch.action.bulk.BulkItemResponse#7
       local variable: org.elasticsearch.action.bulk.BulkResponse#7
       local variable: org.elasticsearch.common.collect.Tuple#77849
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$processEventsAsync$0(ExecutionService.java:199)
    at org.elasticsearch.xpack.watcher.execution.ExecutionService$$Lambda$3805.accept(<unknown string>)
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
       local variable: org.elasticsearch.action.ActionListener$1#310814
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1870#93643
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1870#93644
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:450)
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:431)
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:420)
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1870#93645
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishOnSuccess(TransportReplicationAction.java:970)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:889)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:873)
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1093)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1870#93646
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1174)
       local variable: org.elasticsearch.transport.TransportService$DirectResponseChannel#48
       local variable: org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler#50
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1154)
    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:467)
       local variable: org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2#1
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$2.onResponse(TransportReplicationAction.java:445)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryResult.respond(TransportReplicationAction.java:532)
    at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respondIfPossible(TransportWriteAction.java:169)
    at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.respond(TransportWriteAction.java:159)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$runWithPrimaryShardReference$2(TransportReplicationAction.java:423)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$3721.accept(<unknown string>)
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
       local variable: org.elasticsearch.action.ActionListener$1#357575
    at org.elasticsearch.action.support.replication.ReplicationOperation.finish(ReplicationOperation.java:257)
    at org.elasticsearch.action.support.replication.ReplicationOperation.decPendingAndFinishIfNeeded(ReplicationOperation.java:238)
    at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:128)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:425)
       local variable: org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference#1
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:371)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$3719.accept(<unknown string>)
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
       local variable: org.elasticsearch.action.ActionListener$1#357576
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:273)
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:240)
    at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2352)
    at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:988)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:370)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       local variable: org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction#1
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:325)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:312)
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:140)
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:299)
       local variable: org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1#41
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$$Lambda$2911.accept(<unknown string>)
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
       local variable: org.elasticsearch.action.ActionListener$1#357577
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$2(ServerTransportFilter.java:155)
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile$$Lambda$2917.accept(<unknown string>)
    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.maybeRun(AuthorizationUtils.java:177)
       local variable: org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer#216
    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.setRunAsRoles(AuthorizationUtils.java:171)
    at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.authorize(AuthorizationUtils.java:159)
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$3(ServerTransportFilter.java:157)
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile$$Lambda$2913.accept(<unknown string>)
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60)
       local variable: org.elasticsearch.action.ActionListener$1#357578
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:177)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator$$Lambda$2915.accept(<unknown string>)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$4(AuthenticationService.java:210)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator$$Lambda$2916.run(<unknown string>)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:221)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:175)
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:134)
    at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:104)
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:131)
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:306)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1870#93647
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
       local variable: org.elasticsearch.action.support.replication.ReplicationTask#115
    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable#236
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       local variable: java.util.concurrent.ThreadPoolExecutor$Worker#28
    at java.lang.Thread.run(Thread.java:748)

[3]

"elasticsearch[instance-0000000012][watcher][T#6]" daemon prio=5 tid=858 WAITING
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
       local variable: java.util.concurrent.CountDownLatch$Sync#176
       local variable: java.util.concurrent.locks.AbstractQueuedSynchronizer$Node#1653
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
    at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:85)
       local variable: org.elasticsearch.action.bulk.BulkRequestHandler#1
       local variable: org.elasticsearch.action.bulk.BulkRequest#22
       local variable: org.elasticsearch.action.bulk.BulkRequestHandler$$Lambda$3862#1
    at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:374)
    at org.elasticsearch.action.bulk.BulkProcessor.executeIfNeeded(BulkProcessor.java:365)
    at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:322)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:305)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:301)
    at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:287)
    at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:49)
       local variable: org.elasticsearch.common.xcontent.XContentBuilder#206
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:314)
       local variable: java.lang.String#354418
       local variable: org.elasticsearch.xpack.watcher.execution.TriggeredExecutionContext#41
       local variable: org.elasticsearch.xpack.core.watcher.history.WatchRecord$ExceptionWatchRecord#5
    at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$5(ExecutionService.java:394)
    at org.elasticsearch.xpack.watcher.execution.ExecutionService$$Lambda$3807.run(<unknown string>)
    at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:543)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$2006#854
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       local variable: org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable#463
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       local variable: java.util.concurrent.ThreadPoolExecutor$Worker#41
    at java.lang.Thread.run(Thread.java:748)

[4]

for i in {1..20};do curl -XPUT "http://localhost:9200/_xpack/watcher/watch/foo$i" -H 'Content-Type: application/json' -d'{  "trigger": {    "schedule": {      "interval": "1s"    }  },  "input": {    "simple": {      "name": "Hello there"    }  },  "actions": {    "logit": {      "logging": {        "level": "info",        "text": "The payload is: {{ctx.payload}}"      }    }  }}';done

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-19T20:18:13Z

Pinging @elastic/es-core-features

This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390

…1418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390

…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390

…astic#41418) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390

…1418) (#41690) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390 * includes changes to satisfy 6.x

…1418) (#41685) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390

…1418) (#41684) This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes #41390

…astic#41418) * Fix Watcher deadlock that can cause in-abilty to index documents. This commit removes the usage of the `BulkProcessor` to write history documents and delete triggered watches on a `EsRejectedExecutionException`. Since the exception could be handled on the write thread, the write thread can be blocked waiting on watcher threads (due to a synchronous method). This is problematic since those watcher threads can be blocked waiting on write threads. This commit also moves the handling of the exception to the generic threadpool to avoid submitting write requests from the write thread pool. fixes elastic#41390

AlreadyGo · 2019-07-10T07:40:03Z

good, thanks

jakelandis · 2019-11-08T18:11:45Z

removing version label to avoid confusion. This bug is possible as of 6.5.0 and fixed as of 6.8.0

nerophon · 2020-02-14T10:49:36Z

Note: to clear the deadlock when it occurs:

Delete .triggered_watches (it will be recreated)
Restart affected deadlocked nodes

Keep in mind any watches queued for execution in .triggered_watches will not execute after deleting the index.

jakelandis added :Data Management/Watcher >bug v6.5.0 labels Apr 19, 2019

jakelandis mentioned this issue Apr 22, 2019

Fix Watcher deadlock that can cause in-abilty to index documents. #41418

Merged

jakelandis closed this as completed in #41418 Apr 29, 2019

This was referenced Apr 30, 2019

Fix Watcher deadlock that can cause in-abilty to index documents. (#41418) #41684

Merged

Fix Watcher deadlock that can cause in-abilty to index documents. (#41418) #41685

Merged

jakelandis mentioned this issue Apr 30, 2019

Fix Watcher deadlock that can cause in-abilty to index documents. (#41418) #41690

Merged

jakelandis removed the v6.5.0 label Nov 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher can deadlock resulting in-ability to index any documents. #41390

Watcher can deadlock resulting in-ability to index any documents. #41390

jakelandis commented Apr 19, 2019 •

edited

Loading

elasticmachine commented Apr 19, 2019

AlreadyGo commented Jul 10, 2019

jakelandis commented Nov 8, 2019

nerophon commented Feb 14, 2020

Watcher can deadlock resulting in-ability to index any documents. #41390

Watcher can deadlock resulting in-ability to index any documents. #41390

Comments

jakelandis commented Apr 19, 2019 • edited Loading

elasticmachine commented Apr 19, 2019

AlreadyGo commented Jul 10, 2019

jakelandis commented Nov 8, 2019

nerophon commented Feb 14, 2020

jakelandis commented Apr 19, 2019 •

edited

Loading