-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
We're seeing a Dragonfly replication deadlock on our test suite, which seems to happen when a replica times out.
I can reproduce the replication deadlock locally (or at least what I'm assuming is the same issue).
Running two Dragonfly processes, but intentionally setting the master process replication_timeout to a very small value (100ms) to force the replica to timeout:
./dragonfly --alsologtostderr --dbfilename= --port 7000 --replication_timeout 100
./dragonfly --alsologtostderr --dbfilename= --port 7001
Then start populating the master with 5GB:
redis-cli -p 7000 debug populate 5000000 test 1000 rand
Then while its populating configure the replica:
redis-cli -p 7001 replicaof localhost 7000
The replica will timeout as expected, but then the master is partially deadlocked.
INFO hangs even after you shutdown the replica, and attempting to add another replica hangs as the master is unresponsive.
Running v1.22.0 on AWS t4g.medium. I added some logs and I think its a deadlock attempting to lock DflyCmd::mu_ but thats as far as I got
Edit: poking around a bit out of curiosity, DflyCmd::BreakStalledFlowsInShard never releases mu_ as it blocks on replica_ptr->Cancel()