Deadlock on configuration application in NodeImpl when disruptors are full #1105

alievmirza · 2024-05-20T06:50:51Z

Describe the bug

There is a deadlock in NodeImpl when working with full LogManagerImpl#diskQueue, FSMCallerImpl#taskQueue and NodeImpl#writeLock.

NodeImpl#executeApplyingTasks() takes NodeImpl.writeLock and calls LogManager.appendEntries()
LogManager tries to enqueue a task to diskQueue which is full, hence it blocks until a task gets consumed from diskQueue
diskQueue is consumed by StableClosureEventHandler
StableClosureEventHandler tries to enqueue a task to FSMCallerImpl#taskQueue, which is also full, so this also blocks until a task gets consumed from FSMCallerImpl#taskQueue
FSMCallerImpl#taskQueue is consumed by ApplyTaskHandler
ApplyTaskHandler calls NodeImpl#onConfigurationChangeDone(), which tries to take NodeImpl#writeLock

As a result, there is a deadlock: NodeImpl#writeLock -> LogManager#diskQueue -> FSMCallerImpl#taskQueue -> NodeImpl#writeLock (disruptors are used as blocking queues in JRaft, so, when full, they act like locks).

This was caught by com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload which uses extremely short disruptors (2 items max each).

Steps to reproduce

Run com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload in a loop several times, for my local machine it is reproducible within 50-100 runs.

Environment

SOFAJRaft version: v1.3.14 (latest commit 890033a)
JVM version (e.g. java -version): openjdk version "11.0.23"
OS version (e.g. uname -a): macOs 14.5
Maven version: 3.9.6
IDE version: IntelliJ IDEA 2024.1 (Community Edition)

The text was updated successfully, but these errors were encountered:

fengjiachun · 2024-05-20T09:44:42Z

Thank you for your feedback.

Could you provide the test code for reproduction? I couldn't replicate the issue on my machine.
So, could please add a related test and send a PR?

alievmirza · 2024-05-20T10:38:40Z

Oh, sorry, I've sent the wrong name for the base test class (fixed that in the description), the test is com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload. To reproduce, just run the test in the loop until failure. For that purposes I've used Edit Run Configuration in Intellij IDEA for the test, Modify options -> Repeat -> Until Failure. Also you need to comment this code in com.alipay.sofa.jraft.core.NodeTest#setupNodeTest, so it will be possible then to run the test several times.

StorageOptionsFactory.registerRocksDBTableFormatConfig(GROUP_ID, RocksDBLogStorage.class, StorageOptionsFactory
            .getDefaultRocksDBTableConfig().setBlockCacheSize(256 * SizeUnit.MB));

alievmirza · 2024-05-20T10:45:10Z

Also I can provide the thread dump
https://gist.github.com/alievmirza/07443fd2ae4e3f7db70fadaa06f4cc1c

fengjiachun · 2024-05-20T13:11:20Z

Thank you, I have reproduced it on my machine.

fengjiachun · 2024-05-20T13:16:50Z

The producer thread who handle the com.alipay.sofa.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1409) method owned the write lock

And then the consumer thread try to get the same write lock:

Deadlock occurred like this, due to the disruptor overload.

killme2008 · 2024-05-20T13:17:45Z

Good catch! It's a known bug. Related issues #138 and #720 have been addressed and a fix has been attempted at #764.

It seems that we still have some work to do.

alievmirza · 2024-06-10T07:28:39Z

Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?

killme2008 · 2024-06-10T08:23:22Z

Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?

In real scenarios, it is quite uncommon. It occurs only when the node is overloaded, and in such circumstances, manual intervention like adding new nodes or resources may be the only viable solution.

How can it be fixed? I have not considered it carefully. One potential solution is to replace some lock places on Node#writeLock with a tryLock, enabling other tasks to progress.

killme2008 · 2024-06-16T23:13:09Z

I tried a fix at #1109, which makes it fail fast when log manager is overloaded while applying tasks.

killme2008 mentioned this issue Jun 16, 2024

fix: last committed index in BallotBox #1109

Merged

killme2008 closed this as completed Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on configuration application in NodeImpl when disruptors are full #1105

Deadlock on configuration application in NodeImpl when disruptors are full #1105

alievmirza commented May 20, 2024 •

edited

Loading

fengjiachun commented May 20, 2024

alievmirza commented May 20, 2024 •

edited

Loading

alievmirza commented May 20, 2024

fengjiachun commented May 20, 2024

fengjiachun commented May 20, 2024

killme2008 commented May 20, 2024

alievmirza commented Jun 10, 2024

killme2008 commented Jun 10, 2024

killme2008 commented Jun 16, 2024

Deadlock on configuration application in NodeImpl when disruptors are full #1105

Deadlock on configuration application in NodeImpl when disruptors are full #1105

Comments

alievmirza commented May 20, 2024 • edited Loading

Describe the bug

Steps to reproduce

Environment

fengjiachun commented May 20, 2024

alievmirza commented May 20, 2024 • edited Loading

alievmirza commented May 20, 2024

fengjiachun commented May 20, 2024

fengjiachun commented May 20, 2024

killme2008 commented May 20, 2024

alievmirza commented Jun 10, 2024

killme2008 commented Jun 10, 2024

killme2008 commented Jun 16, 2024

alievmirza commented May 20, 2024 •

edited

Loading

alievmirza commented May 20, 2024 •

edited

Loading