-
Notifications
You must be signed in to change notification settings - Fork 486
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
main (development)
Please describe the bug 🐞
The Tabletserver may run into deadlock status in some case so that no any request can be processed.
From the jstack, I saw some theads are try to acquiring leaderIsrUpdateLock of Replica, but can't acquire. It blocks forever.
After add debug logs, I saw two theads hold lock, but don't release it forever.
One thread assuming A is
om.alibaba.fluss.utils.concurrent.LockUtils [] - lock id 10555894, current stacktrace
[java.base/java.lang.Thread.getStackTrace(Thread.java:1610),
com.alibaba.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:60),
com.alibaba.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:76),
com.alibaba.fluss.server.replica.Replica.appendRecordsToLeader(Replica.java:807),
com.alibaba.fluss.server.replica.ReplicaManager.appendToLocalLog(ReplicaManager.java:860),
com.alibaba.fluss.server.replica.ReplicaManager.appendRecordsToLog(ReplicaManager.java:389),
Another thread one assuming B is
lock id 10558468, current stacktrace [java.base/java.lang.Thread.getStackTrace(Thread.java:1610),
com.alibaba.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:60),
com.alibaba.fluss.server.replica.delay.DelayedOperation.safeTryComplete(DelayedOperation.java:122),
com.alibaba.fluss.server.replica.delay.DelayedOperationManager$Watcher.tryCompletedWatched(DelayedOperationManager.java:286),
com.alibaba.fluss.server.replica.delay.DelayedOperationManager.checkAndComplete(DelayedOperationManager.java:136),
com.alibaba.fluss.server.replica.Replica.tryCompleteDelayedOperations(Replica.java:1344),
Seems it's:
- A acquired the read lock of
leaderIsrUpdateLockin methodappendRecordsToLeader - The other thread are trying acquire write lock of
leaderIsrUpdateLock - B acquired the lock of
DelayedOperation, try complete delayFetch operation, which will then try to acquire read lock ofleaderIsrUpdateLockin methodfetchOffsetSnapshot - A try to acquire the lock of
DelayedOperationdue to the following code in methodappendRecordsToLeader
if (hwIncreased) {
tryCompleteDelayedOperations();
}
But since the readWrite lock leaderIsrUpdateLock is unfair lock which cause the threads acquiring write lock higher piority. So, in step 3,
B can't acquire read lock leaderIsrUpdateLock to finsh to release the lock of DelayedOperation. Then cause A can't acquire lock of DelayedOperation.
That's dead lock happens.
Solution
No to acquire the lock of DelayedOperation in step 4, not to tryCompleteDelayedOperations(); just like kafka.
Are you willing to submit a PR?
- I'm willing to submit a PR!