Skip to content

Commit

Permalink
roachtest: enable DistSender circuit breakers in failover/partial/lea…
Browse files Browse the repository at this point in the history
…se-leader

DistSender circuit breakers are useful in this test to avoid
artificially inflated latencies due to the way the test measures
failover time (pMax, no timeouts). Without circuit breakers, a request
stuck on the partitioned leaseholder will get blocked indefinitely,
despite the range recovering on the other side of the partition and
becoming available to all new traffic. As a result, the test won't
differentiate between temporary and permanent range unavailability. We
have other tests which demonstrate the benefit of DistSender circuit
breakers (especially when applications do not use statement timeouts),
so we don't need to test them here.

With this change, the test's failover time drops from:

| lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) |
|-----------------:|----------------------:|------------------:|
| 60,129           | 60,129                | 60,129            |

down to:

| lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) |
|-----------------:|----------------------:|------------------:|
| 60,129           | 22,549                | 31,139            |

Notice that expiration and leader leases now recover, while epoch leases
remain unavailable indefinitely.

Epic: None
Release note: None
  • Loading branch information
nvanbenschoten committed Oct 23, 2024
1 parent 9e49a0b commit 9b08a1e
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions pkg/cmd/roachtest/tests/failover.go
Original file line number Diff line number Diff line change
Expand Up @@ -539,6 +539,18 @@ func runFailoverPartialLeaseLeader(ctx context.Context, t test.Test, c cluster.C
settings := install.MakeClusterSettings()
settings.Env = append(settings.Env, "COCKROACH_SCAN_MAX_IDLE_TIME=100ms") // speed up replication

// DistSender circuit breakers are useful in this test to avoid artificially
// inflated latencies due to the way the test measures failover time. Without
// circuit breakers, a request stuck on the partitioned leaseholder will get
// blocked indefinitely, despite the range recovering on the other side of the
// partition. As a result, the test won't differentiate between temporary and
// permanent range unavailability. We have other tests which demonstrate the
// benefit of DistSender circuit breakers (especially when applications do not
// use statement timeouts), so we don't need to test them here.
// TODO(arul): this can be removed if/when we turn on DistSender circuit
// breakers for all ranges by default.
settings.ClusterSettings["kv.dist_sender.circuit_breakers.mode"] = "all ranges"

m := c.NewMonitor(ctx, c.CRDBNodes())

failer := makeFailer(t, c, m, failureModeBlackhole, settings, rng).(PartialFailer)
Expand Down

0 comments on commit 9b08a1e

Please sign in to comment.