Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
131624: roachtest: add `failover` variants with leader leases r=nvanbenschoten a=nvanbenschoten Part of #132762. Leader leases have different availability properties than epoch leases under most failure modes. This patch adds failover test variants that use leader leases where possible. Initial test results: | test | lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) | parity with expiration | |:---------------------------------------------|-----------------:|----------------------:|------------------:|:----------------------:| | failover/chaos/read-only | 60,129 | 18,253 | 60,129 | ✔ | | failover/chaos/read-write | 60,129 | 20,401 | 60,129 | ❌❌❌ | | failover/liveness/blackhole | 9,663 | 369 | 335 | ✔ | | failover/liveness/blackhole-recv | 11,274 | 402 | 369 | ✔ | | failover/liveness/blackhole-send | 9,663 | 385 | 469 | ✔ | | failover/liveness/crash | 8,053 | 352 | 318 | ✔ | | failover/liveness/deadlock | 24,696 | 385 | 369 | ✔ | | failover/liveness/disk-stall | 26,843 | 369 | 419 | ✔ | | failover/liveness/pause | 10,200 | 385 | 436 | ✔ | | failover/non-system/blackhole | 7,247 | 7,516 | 15,032 | ❌❌ | | failover/non-system/blackhole-recv | 12,348 | 10,737 | 18,253 | ❌❌ | | failover/non-system/blackhole-send | 6,979 | 6,979 | 8,053 | ❌ | | failover/non-system/crash | 7,247 | 6,979 | 9,126 | ❌ | | failover/non-system/deadlock | 60,129 | 60,129 | 60,129 | ✔ | | failover/non-system/disk-stall | 22,548 | 22,548 | 25,769 | ❌ | | failover/non-system/pause | 7,247 | 7,247 | 9,126 | ❌ | | failover/partial/lease-gateway | 8,589 | 19,327 [^1] | 60,129 | ❌❌❌ | | failover/partial/lease-leader | 60,129 | 22,549 [^2] | 31,139 [^2] | ❌❌ | | failover/partial/lease-liveness | 8,589 | 301 | 318 | ✔ | | failover/system-non-liveness/blackhole | 369 | 402 | 352 | ✔ | | failover/system-non-liveness/blackhole-recv | 335 | 285 | 318 | ✔ | | failover/system-non-liveness/blackhole-send | 402 | 419 | 335 | ✔ | | failover/system-non-liveness/crash | 419 | 301 | 453 | ✔ | | failover/system-non-liveness/deadlock | 369 | 352 | 402 | ✔ | | failover/system-non-liveness/disk-stall | 402 | 318 | 453 | ✔ | | failover/system-non-liveness/pause | 369 | 385 | 335 | ✔ | _note: because of the way the test measures pMax, anything under 1,000ms is essentially "no impact"_ **Key _(comparing leader vs. expiration)_**: ✔ = parity ❌ = minor regression ❌❌ = major regression ❌❌❌ = unavailability [^1]: I don't understand why expiration-based lease perform worse than epoch-based leases on this test. [^2]: With #133214. Epic: none Release note: None 133214: roachtest: enable DistSender circuit breakers in failover/partial/lease-leader r=nvanbenschoten a=nvanbenschoten DistSender circuit breakers are useful in this test to avoid artificially inflated latencies due to the way the test measures failover time (pMax, no timeouts). Without circuit breakers, a request stuck on the partitioned leaseholder will get blocked indefinitely, despite the range recovering on the other side of the partition and becoming available to all new traffic. As a result, the test won't differentiate between temporary and permanent range unavailability. We have other tests which demonstrate the benefit of DistSender circuit breakers (especially when applications do not use statement timeouts), so we don't need to test them here. With this change, the test's meaured failover time drops from: | lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) | |-----------------:|----------------------:|------------------:| | 60,129 | 60,129 | 60,129 | down to: | lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) | |-----------------:|----------------------:|------------------:| | 60,129 | 22,549 | 31,139 | This is because the circuit breakers place a 10s timeout on all KV requests, so no request gets stuck indefinitely. Notice that expiration and leader leases now recover, while epoch leases remain unavailable indefinitely. Epic: None Release note: None 133281: sql: allow old enum value '1' for sql.defaults.vectorize r=mw5h a=michae2 Fixes: #133278 Release note (bug fix): This commit fixes a bug which causes new connections to fail with the following error after upgrading to v24.2: ``` ERROR: invalid value for parameter "vectorize": "unknown(1)" SQLSTATE: 22023 HINT: Available values: off,on,experimental_always ``` In order to hit this bug, the cluster must have: 1. been on version v21.1 at some point in the past 2. run `SET CLUSTER SETTING sql.defaults.vectorize = 'on';` on v21.1 3. not set sql.defaults.vectorize after upgrading past v21.1 4. upgraded all the way to v24.2 The conditions required for this bug can be detected using: ``` SELECT * FROM system.settings WHERE name = 'sql.defaults.vectorize'; ``` If the value is '1', the following statement should be run to fix it before upgrading to v24.2: ``` RESET CLUSTER SETTING sql.defaults.vectorize; ``` This commit fixes the bug by making '1' a legal value for sql.defaults.vectorize again (mapping to 'on'). Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
- Loading branch information