Replication deadlock when replica times out

We're seeing a Dragonfly replication deadlock on our test suite, which seems to happen when a replica times out.

I can reproduce the replication deadlock locally (or at least what I'm assuming is the same issue).

Running two Dragonfly processes, but intentionally setting the master process `replication_timeout` to a very small value (100ms) to force the replica to timeout:
```
./dragonfly --alsologtostderr --dbfilename= --port 7000 --replication_timeout 100
./dragonfly --alsologtostderr --dbfilename= --port 7001
```

Then start populating the master with 5GB:
```
redis-cli -p 7000 debug populate 5000000 test 1000 rand
```

Then while its populating configure the replica:
```
redis-cli -p 7001 replicaof localhost 7000
```

The replica will timeout as expected, but then the master is partially deadlocked.

`INFO` hangs even after you shutdown the replica, and attempting to add another replica hangs as the master is unresponsive.

Running v1.22.0 on AWS `t4g.medium`. I added some logs and I think its a deadlock attempting to lock `DflyCmd::mu_` but thats as far as I got

Edit: poking around a bit out of curiosity, `DflyCmd::BreakStalledFlowsInShard` never releases `mu_` as it blocks on `replica_ptr->Cancel()`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replication deadlock when replica times out #3649

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replication deadlock when replica times out #3649

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions