Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on configuration application in NodeImpl when disruptors are full #1105

Closed
alievmirza opened this issue May 20, 2024 · 9 comments
Closed

Comments

@alievmirza
Copy link
Contributor

alievmirza commented May 20, 2024

Describe the bug

There is a deadlock in NodeImpl when working with full LogManagerImpl#diskQueue, FSMCallerImpl#taskQueue and NodeImpl#writeLock.

  1. NodeImpl#executeApplyingTasks() takes NodeImpl.writeLock and calls LogManager.appendEntries()
  2. LogManager tries to enqueue a task to diskQueue which is full, hence it blocks until a task gets consumed from diskQueue
  3. diskQueue is consumed by StableClosureEventHandler
  4. StableClosureEventHandler tries to enqueue a task to FSMCallerImpl#taskQueue, which is also full, so this also blocks until a task gets consumed from FSMCallerImpl#taskQueue
  5. FSMCallerImpl#taskQueue is consumed by ApplyTaskHandler
  6. ApplyTaskHandler calls NodeImpl#onConfigurationChangeDone(), which tries to take NodeImpl#writeLock

As a result, there is a deadlock: NodeImpl#writeLock -> LogManager#diskQueue -> FSMCallerImpl#taskQueue -> NodeImpl#writeLock (disruptors are used as blocking queues in JRaft, so, when full, they act like locks).

This was caught by com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload which uses extremely short disruptors (2 items max each).

Steps to reproduce

Run com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload in a loop several times, for my local machine it is reproducible within 50-100 runs.

Environment

  • SOFAJRaft version: v1.3.14 (latest commit 890033a)
  • JVM version (e.g. java -version): openjdk version "11.0.23"
  • OS version (e.g. uname -a): macOs 14.5
  • Maven version: 3.9.6
  • IDE version: IntelliJ IDEA 2024.1 (Community Edition)
@fengjiachun
Copy link
Contributor

Thank you for your feedback.

Could you provide the test code for reproduction? I couldn't replicate the issue on my machine.
So, could please add a related test and send a PR?

@alievmirza
Copy link
Contributor Author

alievmirza commented May 20, 2024

Oh, sorry, I've sent the wrong name for the base test class (fixed that in the description), the test is com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload. To reproduce, just run the test in the loop until failure. For that purposes I've used Edit Run Configuration in Intellij IDEA for the test, Modify options -> Repeat -> Until Failure. Also you need to comment this code in com.alipay.sofa.jraft.core.NodeTest#setupNodeTest, so it will be possible then to run the test several times.

StorageOptionsFactory.registerRocksDBTableFormatConfig(GROUP_ID, RocksDBLogStorage.class, StorageOptionsFactory
            .getDefaultRocksDBTableConfig().setBlockCacheSize(256 * SizeUnit.MB));

@alievmirza
Copy link
Contributor Author

Also I can provide the thread dump
https://gist.github.com/alievmirza/07443fd2ae4e3f7db70fadaa06f4cc1c

@fengjiachun
Copy link
Contributor

Thank you, I have reproduced it on my machine.

@fengjiachun
Copy link
Contributor

The producer thread who handle the com.alipay.sofa.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1409) method owned the write lock

image

And then the consumer thread try to get the same write lock:

image

Deadlock occurred like this, due to the disruptor overload.

@killme2008
Copy link
Contributor

Good catch! It's a known bug. Related issues #138 and #720 have been addressed and a fix has been attempted at #764.

It seems that we still have some work to do.

@alievmirza
Copy link
Contributor Author

Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?

@killme2008
Copy link
Contributor

Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?

In real scenarios, it is quite uncommon. It occurs only when the node is overloaded, and in such circumstances, manual intervention like adding new nodes or resources may be the only viable solution.

How can it be fixed? I have not considered it carefully. One potential solution is to replace some lock places on Node#writeLock with a tryLock, enabling other tasks to progress.

@killme2008
Copy link
Contributor

I tried a fix at #1109, which makes it fail fast when log manager is overloaded while applying tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants