Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in FileSettingsService #92812

Closed
thecoop opened this issue Jan 11, 2023 · 3 comments · Fixed by #92856
Closed

Deadlock in FileSettingsService #92812

thecoop opened this issue Jan 11, 2023 · 3 comments · Fixed by #92856
Assignees
Labels
blocker >bug :Core/Infra/Settings Settings infrastructure and APIs Team:Core/Infra Meta label for core/infra team v8.6.1

Comments

@thecoop
Copy link
Member

thecoop commented Jan 11, 2023

The FileSettingsService blocks while waiting for some async calls to happen. On a newly-elected master, this blocking happens on the cluster applier thread, preventing the master service from completing its current publication:

"elasticsearch[elasticsearch-sample-es-default-2][clusterApplierService#updateTask][T#1]" #42 [178] daemon prio=5 os_prio=0 cpu=470.04ms elapsed=391.31s tid=0x00007f3b9b9498f0 nid=178 waiting on condition  [0x00007f3b0b3f2000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x00000000d3fb7288> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:221)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@19.0.1/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@19.0.1/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.get(java.base@19.0.1/CompletableFuture.java:2072)
	at org.elasticsearch.reservedstate.service.FileSettingsService.startWatcher(org.elasticsearch.server@8.6.0/FileSettingsService.java:248)
	- locked <0x000000008318c648> (a org.elasticsearch.reservedstate.service.FileSettingsService)
	at org.elasticsearch.reservedstate.service.FileSettingsService.startIfMaster(org.elasticsearch.server@8.6.0/FileSettingsService.java:157)
	at org.elasticsearch.reservedstate.service.FileSettingsService.clusterChanged(org.elasticsearch.server@8.6.0/FileSettingsService.java:151)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(org.elasticsearch.server@8.6.0/ClusterApplierService.java:558)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(org.elasticsearch.server@8.6.0/ClusterApplierService.java:544)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(org.elasticsearch.server@8.6.0/ClusterApplierService.java:504)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(org.elasticsearch.server@8.6.0/ClusterApplierService.java:428)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(org.elasticsearch.server@8.6.0/ClusterApplierService.java:154)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(org.elasticsearch.server@8.6.0/ThreadContext.java:850)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(org.elasticsearch.server@8.6.0/PrioritizedEsThreadPoolExecutor.java:257)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(org.elasticsearch.server@8.6.0/PrioritizedEsThreadPoolExecutor.java:223)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

The trouble is, one of the things on which we're waiting is another cluster state update:

stateService.process(NAMESPACE, parsedState, (e) -> completeProcessing(e, completion));

This update can never complete, because it needs the blocked applier thread to unblock, completing the current publication.

Relates elastic/cloud-on-k8s#6303

Workaround

Remove the settings file(s) and restart the master.

@thecoop thecoop added >bug >regression :Core/Infra/Settings Settings infrastructure and APIs labels Jan 11, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@DaveCTurner DaveCTurner changed the title Deadlock on 8.6.0 upgrade Deadlock in FileSettingsService Jan 11, 2023
@thecoop
Copy link
Member Author

thecoop commented Jan 11, 2023

The deadlock is (on master thread) startWatcher -> FileSettingsService.processFileSettings -> stateService.process -> clusterService.submitStateUpdateTask -> masterService.submitStateUpdateTask -> taskBatcher.submitTask -> threadExecutor.execute on master threadpool, but master threadpool blocked on get() to wait for the task to complete

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 11, 2023
@thecoop
Copy link
Member Author

thecoop commented Jan 11, 2023

@DaveCTurner has worked out why this didn't happen in our test cases...

The relevant code path is only hit when we set initial state timeout to zero (which we believe ECK has)1. The FileSettingsServiceIT tests don't do this, they use the install default (30s)

This is the patch to reproduce this in the integration tests:

a/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java b/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
index 60c04da7ad0..4e279a363ae 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/reservedstate/service/FileSettingsServiceIT.java
@@ -34,6 +34,7 @@ import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicLong;

 import static org.elasticsearch.indices.recovery.RecoverySettings.INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING;
+import static org.elasticsearch.node.Node.INITIAL_STATE_TIMEOUT_SETTING;
 import static org.elasticsearch.test.NodeRoles.dataOnlyNode;
 import static org.hamcrest.Matchers.allOf;
 import static org.hamcrest.Matchers.containsString;
@@ -189,7 +190,9 @@ public class FileSettingsServiceIT extends ESIntegTestCase {
     public void testReservedStatePersistsOnRestart() throws Exception {
	 internalCluster().setBootstrapMasterNodeIndex(0);
	 logger.info("--> start master node");
-        final String masterNode = internalCluster().startMasterOnlyNode();
+        final String masterNode = internalCluster().startMasterOnlyNode(
+            Settings.builder().put(INITIAL_STATE_TIMEOUT_SETTING.getKey(), "0s").build()
+        );
	 assertMasterNode(internalCluster().masterClient(), masterNode);
	 var savedClusterState = setupClusterStateListener(masterNode);

Footnotes

  1. edit to add: or it takes >30s to elect a master at startup, which is possible if e.g. not all the nodes start up at once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Core/Infra/Settings Settings infrastructure and APIs Team:Core/Infra Meta label for core/infra team v8.6.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants