-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock on configuration application in NodeImpl when disruptors are full #1105
Comments
Thank you for your feedback. Could you provide the test code for reproduction? I couldn't replicate the issue on my machine. |
Oh, sorry, I've sent the wrong name for the base test class (fixed that in the description), the test is
|
Also I can provide the thread dump |
Thank you, I have reproduced it on my machine. |
Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue? |
In real scenarios, it is quite uncommon. It occurs only when the node is overloaded, and in such circumstances, manual intervention like adding new nodes or resources may be the only viable solution. How can it be fixed? I have not considered it carefully. One potential solution is to replace some |
I tried a fix at #1109, which makes it fail fast when log manager is overloaded while applying tasks. |
Describe the bug
There is a deadlock in
NodeImpl
when working with fullLogManagerImpl#diskQueue
,FSMCallerImpl#taskQueue
andNodeImpl#writeLock
.NodeImpl#executeApplyingTasks()
takesNodeImpl.writeLock
and callsLogManager.appendEntries()
LogManager
tries to enqueue a task todiskQueue
which is full, hence it blocks until a task gets consumed fromdiskQueue
diskQueue
is consumed byStableClosureEventHandler
StableClosureEventHandler
tries to enqueue a task toFSMCallerImpl#taskQueue
, which is also full, so this also blocks until a task gets consumed fromFSMCallerImpl#taskQueue
FSMCallerImpl#taskQueue
is consumed byApplyTaskHandler
ApplyTaskHandler
callsNodeImpl#onConfigurationChangeDone()
, which tries to takeNodeImpl#writeLock
As a result, there is a deadlock:
NodeImpl#writeLock
->LogManager#diskQueue
->FSMCallerImpl#taskQueue
->NodeImpl#writeLock
(disruptors are used as blocking queues in JRaft, so, when full, they act like locks).This was caught by
com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload
which uses extremely short disruptors (2 items max each).Steps to reproduce
Run
com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload
in a loop several times, for my local machine it is reproducible within 50-100 runs.Environment
java -version
): openjdk version "11.0.23"uname -a
): macOs 14.5The text was updated successfully, but these errors were encountered: