Skip to content

Replication deadlock when replica times out #3649

@andydunstall

Description

@andydunstall

We're seeing a Dragonfly replication deadlock on our test suite, which seems to happen when a replica times out.

I can reproduce the replication deadlock locally (or at least what I'm assuming is the same issue).

Running two Dragonfly processes, but intentionally setting the master process replication_timeout to a very small value (100ms) to force the replica to timeout:

./dragonfly --alsologtostderr --dbfilename= --port 7000 --replication_timeout 100
./dragonfly --alsologtostderr --dbfilename= --port 7001

Then start populating the master with 5GB:

redis-cli -p 7000 debug populate 5000000 test 1000 rand

Then while its populating configure the replica:

redis-cli -p 7001 replicaof localhost 7000

The replica will timeout as expected, but then the master is partially deadlocked.

INFO hangs even after you shutdown the replica, and attempting to add another replica hangs as the master is unresponsive.

Running v1.22.0 on AWS t4g.medium. I added some logs and I think its a deadlock attempting to lock DflyCmd::mu_ but thats as far as I got

Edit: poking around a bit out of curiosity, DflyCmd::BreakStalledFlowsInShard never releases mu_ as it blocks on replica_ptr->Cancel()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingimportanthigher priority than the usual ongoing development tasksurgentImportant issue that needs to be fixed asap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions