Deadlock in FileSettingsService #92812

thecoop · 2023-01-11T09:44:28Z

The FileSettingsService blocks while waiting for some async calls to happen. On a newly-elected master, this blocking happens on the cluster applier thread, preventing the master service from completing its current publication:

"elasticsearch[elasticsearch-sample-es-default-2][clusterApplierService#updateTask][T#1]" #42 [178] daemon prio=5 os_prio=0 cpu=470.04ms elapsed=391.31s tid=0x00007f3b9b9498f0 nid=178 waiting on condition  [0x00007f3b0b3f2000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x00000000d3fb7288> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:221)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@19.0.1/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@19.0.1/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.get(java.base@19.0.1/CompletableFuture.java:2072)
	at org.elasticsearch.reservedstate.service.FileSettingsService.startWatcher(org.elasticsearch.server@8.6.0/FileSettingsService.java:248)
	- locked <0x000000008318c648> (a org.elasticsearch.reservedstate.service.FileSettingsService)
	at org.elasticsearch.reservedstate.service.FileSettingsService.startIfMaster(org.elasticsearch.server@8.6.0/FileSettingsService.java:157)
	at org.elasticsearch.reservedstate.service.FileSettingsService.clusterChanged(org.elasticsearch.server@8.6.0/FileSettingsService.java:151)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(org.elasticsearch.server@8.6.0/ClusterApplierService.java:558)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(org.elasticsearch.server@8.6.0/ClusterApplierService.java:544)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(org.elasticsearch.server@8.6.0/ClusterApplierService.java:504)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(org.elasticsearch.server@8.6.0/ClusterApplierService.java:428)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(org.elasticsearch.server@8.6.0/ClusterApplierService.java:154)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(org.elasticsearch.server@8.6.0/ThreadContext.java:850)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(org.elasticsearch.server@8.6.0/PrioritizedEsThreadPoolExecutor.java:257)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(org.elasticsearch.server@8.6.0/PrioritizedEsThreadPoolExecutor.java:223)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

The trouble is, one of the things on which we're waiting is another cluster state update:

elasticsearch/server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java

Line 425 in 8ae63c3

    
           stateService.process(NAMESPACE, parsedState, (e) -> completeProcessing(e, completion));

This update can never complete, because it needs the blocked applier thread to unblock, completing the current publication.

Relates elastic/cloud-on-k8s#6303

Workaround

Remove the settings file(s) and restart the master.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-01-11T09:44:52Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

thecoop · 2023-01-11T10:21:57Z

The deadlock is (on master thread) startWatcher -> FileSettingsService.processFileSettings -> stateService.process -> clusterService.submitStateUpdateTask -> masterService.submitStateUpdateTask -> taskBatcher.submitTask -> threadExecutor.execute on master threadpool, but master threadpool blocked on get() to wait for the task to complete

thecoop · 2023-01-11T12:23:17Z

@DaveCTurner has worked out why this didn't happen in our test cases...

The relevant code path is only hit when we set initial state timeout to zero (which we believe ECK has)¹. The FileSettingsServiceIT tests don't do this, they use the install default (30s)

This is the patch to reproduce this in the integration tests:

a/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java b/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
index 60c04da7ad0..4e279a363ae 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
@@ -34,6 +34,7 @@ import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicLong;

 import static org.elasticsearch.indices.recovery.RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING;
+import static org.elasticsearch.node.Node.INITIAL_STATE_TIMEOUT_SETTING;
 import static org.elasticsearch.test.NodeRoles.dataOnlyNode;
 import static org.hamcrest.Matchers.allOf;
 import static org.hamcrest.Matchers.containsString;
@@ -189,7 +190,9 @@ public class FileSettingsServiceIT extends ESIntegTestCase {
     public void testReservedStatePersistsOnRestart() throws Exception {
	 internalCluster().setBootstrapMasterNodeIndex(0);
	 logger.info("--> start master node");
-        final String masterNode = internalCluster().startMasterOnlyNode();
+        final String masterNode = internalCluster().startMasterOnlyNode(
+            Settings.builder().put(INITIAL_STATE_TIMEOUT_SETTING.getKey(), "0s").build()
+        );
	 assertMasterNode(internalCluster().masterClient(), masterNode);
	 var savedClusterState = setupClusterStateListener(masterNode);

edit to add: or it takes >30s to elect a master at startup, which is possible if e.g. not all the nodes start up at once ↩

thecoop added >bug >regression :Core/Infra/Settings Settings infrastructure and APIs labels Jan 11, 2023

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 11, 2023

DaveCTurner added blocker v8.6.1 labels Jan 11, 2023

DaveCTurner changed the title ~~Deadlock on 8.6.0 upgrade~~ Deadlock in FileSettingsService Jan 11, 2023

thecoop removed the >regression label Jan 11, 2023

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 11, 2023

Add known-issue docs for elastic#92812

dc0151c

DaveCTurner added a commit that referenced this issue Jan 11, 2023

Add known-issue docs for #92812 (#92813)

9eba369

DaveCTurner added a commit that referenced this issue Jan 11, 2023

Add known-issue docs for #92812 (#92813)

bb37dc2

DaveCTurner added a commit that referenced this issue Jan 11, 2023

Add known-issue docs for #92812 (#92813)

0ec08c8

DaveCTurner added a commit that referenced this issue Jan 11, 2023

Add known-issue docs for #92812 (#92813)

865f898

thecoop added v8.4.4 v8.5.4 labels Jan 11, 2023

barkbay mentioned this issue Jan 11, 2023

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster elastic/cloud-on-k8s#6303

Closed

grcevski self-assigned this Jan 11, 2023

grcevski removed v8.4.4 v8.5.4 labels Jan 11, 2023

danielmitterdorfer pushed a commit to danielmitterdorfer/elasticsearch that referenced this issue Jan 12, 2023

Add known-issue docs for elastic#92812 (elastic#92813)

1998558

grcevski mentioned this issue Jan 12, 2023

Don't announce ready until file settings are applied #92856

Merged

3 tasks

grcevski closed this as completed in #92856 Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in FileSettingsService #92812

Deadlock in FileSettingsService #92812

thecoop commented Jan 11, 2023 •

edited by DaveCTurner

Loading

elasticsearchmachine commented Jan 11, 2023

thecoop commented Jan 11, 2023 •

edited

Loading

thecoop commented Jan 11, 2023 •

edited by DaveCTurner

Loading

Deadlock in FileSettingsService #92812

Deadlock in FileSettingsService #92812

Comments

thecoop commented Jan 11, 2023 • edited by DaveCTurner Loading

Workaround

elasticsearchmachine commented Jan 11, 2023

thecoop commented Jan 11, 2023 • edited Loading

thecoop commented Jan 11, 2023 • edited by DaveCTurner Loading

Footnotes

thecoop commented Jan 11, 2023 •

edited by DaveCTurner

Loading

thecoop commented Jan 11, 2023 •

edited

Loading

thecoop commented Jan 11, 2023 •

edited by DaveCTurner

Loading