Skip to content

Tabletserver may run into deadlock status #762

@luoyuxia

Description

@luoyuxia

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

main (development)

Please describe the bug 🐞

The Tabletserver may run into deadlock status in some case so that no any request can be processed.
From the jstack, I saw some theads are try to acquiring leaderIsrUpdateLock of Replica, but can't acquire. It blocks forever.
After add debug logs, I saw two theads hold lock, but don't release it forever.
One thread assuming A is

om.alibaba.fluss.utils.concurrent.LockUtils                 [] - lock id 10555894, current stacktrace 
[java.base/java.lang.Thread.getStackTrace(Thread.java:1610), 
com.alibaba.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:60), 
com.alibaba.fluss.utils.concurrent.LockUtils.inReadLock(LockUtils.java:76), 
com.alibaba.fluss.server.replica.Replica.appendRecordsToLeader(Replica.java:807), 
com.alibaba.fluss.server.replica.ReplicaManager.appendToLocalLog(ReplicaManager.java:860), 
com.alibaba.fluss.server.replica.ReplicaManager.appendRecordsToLog(ReplicaManager.java:389), 

Another thread one assuming B is

lock id 10558468, current stacktrace [java.base/java.lang.Thread.getStackTrace(Thread.java:1610), 
com.alibaba.fluss.utils.concurrent.LockUtils.inLock(LockUtils.java:60), 
com.alibaba.fluss.server.replica.delay.DelayedOperation.safeTryComplete(DelayedOperation.java:122), 
com.alibaba.fluss.server.replica.delay.DelayedOperationManager$Watcher.tryCompletedWatched(DelayedOperationManager.java:286), 
com.alibaba.fluss.server.replica.delay.DelayedOperationManager.checkAndComplete(DelayedOperationManager.java:136), 
com.alibaba.fluss.server.replica.Replica.tryCompleteDelayedOperations(Replica.java:1344), 

Seems it's:

  1. A acquired the read lock of leaderIsrUpdateLock in method appendRecordsToLeader
  2. The other thread are trying acquire write lock of leaderIsrUpdateLock
  3. B acquired the lock of DelayedOperation, try complete delayFetch operation, which will then try to acquire read lock of leaderIsrUpdateLock in method fetchOffsetSnapshot
  4. A try to acquire the lock of DelayedOperation due to the following code in method appendRecordsToLeader
 if (hwIncreased) {
  tryCompleteDelayedOperations();
}

But since the readWrite lock leaderIsrUpdateLock is unfair lock which cause the threads acquiring write lock higher piority. So, in step 3,
B can't acquire read lock leaderIsrUpdateLock to finsh to release the lock of DelayedOperation. Then cause A can't acquire lock of DelayedOperation.
That's dead lock happens.

Solution

No to acquire the lock of DelayedOperation in step 4, not to tryCompleteDelayedOperations(); just like kafka.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions