Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
131624: roachtest: add `failover` variants with leader leases r=nvanbenschoten a=nvanbenschoten

Part of #132762.

Leader leases have different availability properties than epoch leases under most failure modes. This patch adds failover test variants that use leader leases where possible.

Initial test results:

| test                                         | lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) | parity with expiration |
|:---------------------------------------------|-----------------:|----------------------:|------------------:|:----------------------:|
| failover/chaos/read-only                     | 60,129           | 18,253                | 60,129            | ✔                      |
| failover/chaos/read-write                    | 60,129           | 20,401                | 60,129            | ❌❌❌                    |
| failover/liveness/blackhole                  | 9,663            | 369                   | 335               | ✔                      |
| failover/liveness/blackhole-recv             | 11,274           | 402                   | 369               | ✔                      |
| failover/liveness/blackhole-send             | 9,663            | 385                   | 469               | ✔                      |
| failover/liveness/crash                      | 8,053            | 352                   | 318               | ✔                      |
| failover/liveness/deadlock                   | 24,696           | 385                   | 369               | ✔                      |
| failover/liveness/disk-stall                 | 26,843           | 369                   | 419               | ✔                      |
| failover/liveness/pause                      | 10,200           | 385                   | 436               | ✔                      |
| failover/non-system/blackhole                | 7,247            | 7,516                 | 15,032            | ❌❌                     |
| failover/non-system/blackhole-recv           | 12,348           | 10,737                | 18,253            | ❌❌                     |
| failover/non-system/blackhole-send           | 6,979            | 6,979                 | 8,053             | ❌                      |
| failover/non-system/crash                    | 7,247            | 6,979                 | 9,126             | ❌                      |
| failover/non-system/deadlock                 | 60,129           | 60,129                | 60,129            | ✔                      |
| failover/non-system/disk-stall               | 22,548           | 22,548                | 25,769            | ❌                      |
| failover/non-system/pause                    | 7,247            | 7,247                 | 9,126             | ❌                      |
| failover/partial/lease-gateway               | 8,589            | 19,327 [^1]           | 60,129            | ❌❌❌                    |
| failover/partial/lease-leader                | 60,129           | 22,549 [^2]           | 31,139 [^2]       | ❌❌                     |
| failover/partial/lease-liveness              | 8,589            | 301                   | 318               | ✔                      |
| failover/system-non-liveness/blackhole       | 369              | 402                   | 352               | ✔                      |
| failover/system-non-liveness/blackhole-recv  | 335              | 285                   | 318               | ✔                      |
| failover/system-non-liveness/blackhole-send  | 402              | 419                   | 335               | ✔                      |
| failover/system-non-liveness/crash           | 419              | 301                   | 453               | ✔                      |
| failover/system-non-liveness/deadlock        | 369              | 352                   | 402               | ✔                      |
| failover/system-non-liveness/disk-stall      | 402              | 318                   | 453               | ✔                      |
| failover/system-non-liveness/pause           | 369              | 385                   | 335               | ✔                      |

_note: because of the way the test measures pMax, anything under 1,000ms is essentially "no impact"_

**Key _(comparing leader vs. expiration)_**:
✔ = parity
❌ = minor regression
❌❌ = major regression
❌❌❌ = unavailability

[^1]: I don't understand why expiration-based lease perform worse than epoch-based leases on this test.
[^2]: With #133214.

Epic: none
Release note: None

133214: roachtest: enable DistSender circuit breakers in failover/partial/lease-leader r=nvanbenschoten a=nvanbenschoten

DistSender circuit breakers are useful in this test to avoid artificially inflated latencies due to the way the test measures failover time (pMax, no timeouts). Without circuit breakers, a request stuck on the partitioned leaseholder will get blocked indefinitely, despite the range recovering on the other side of the partition and becoming available to all new traffic. As a result, the test won't differentiate between temporary and permanent range unavailability. We have other tests which demonstrate the benefit of DistSender circuit breakers (especially when applications do not use statement timeouts), so we don't need to test them here.

With this change, the test's meaured failover time drops from:

| lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) |
|-----------------:|----------------------:|------------------:|
| 60,129           | 60,129                | 60,129            |

down to:

| lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) |
|-----------------:|----------------------:|------------------:|
| 60,129           | 22,549                | 31,139            |

This is because the circuit breakers place a 10s timeout on all KV requests, so no request gets stuck indefinitely. Notice that expiration and leader leases now recover, while epoch leases remain unavailable indefinitely.

Epic: None
Release note: None

133281: sql: allow old enum value '1' for sql.defaults.vectorize r=mw5h a=michae2

Fixes: #133278

Release note (bug fix): This commit fixes a bug which causes new connections to fail with the following error after upgrading to v24.2:

```
ERROR: invalid value for parameter "vectorize": "unknown(1)"
SQLSTATE: 22023
HINT: Available values: off,on,experimental_always
```

In order to hit this bug, the cluster must have:
1. been on version v21.1 at some point in the past
2. run `SET CLUSTER SETTING sql.defaults.vectorize = 'on';` on v21.1
3. not set sql.defaults.vectorize after upgrading past v21.1
4. upgraded all the way to v24.2

The conditions required for this bug can be detected using:

```
SELECT * FROM system.settings WHERE name = 'sql.defaults.vectorize';
```

If the value is '1', the following statement should be run to fix it before upgrading to v24.2:

```
RESET CLUSTER SETTING sql.defaults.vectorize;
```

This commit fixes the bug by making '1' a legal value for sql.defaults.vectorize again (mapping to 'on').

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
  • Loading branch information
3 people committed Oct 24, 2024
4 parents 88ba5a1 + 961c233 + 9b08a1e + c756c80 commit 4efd2b4
Show file tree
Hide file tree
Showing 10 changed files with 105 additions and 9 deletions.
2 changes: 1 addition & 1 deletion docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ This session variable default should now be configured using ALTER ROLE... SET:
sql.defaults.use_declarative_schema_changer enumeration on "default value for use_declarative_schema_changer session setting;disables new schema changer by default [off = 0, on = 1, unsafe = 2, unsafe_always = 3]
This cluster setting is being kept to preserve backwards-compatibility.
This session variable default should now be configured using ALTER ROLE... SET: https://www.cockroachlabs.com/docs/stable/alter-role.html" application
sql.defaults.vectorize enumeration on "default vectorize mode [on = 0, on = 2, experimental_always = 3, off = 4]
sql.defaults.vectorize enumeration on "default vectorize mode [on = 0, on = 1, on = 2, experimental_always = 3, off = 4]
This cluster setting is being kept to preserve backwards-compatibility.
This session variable default should now be configured using ALTER ROLE... SET: https://www.cockroachlabs.com/docs/stable/alter-role.html" application
sql.defaults.zigzag_join.enabled boolean false "default value for enable_zigzag_join session setting; disallows use of zig-zag join by default
Expand Down
2 changes: 1 addition & 1 deletion docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@
<tr><td><div id="setting-sql-defaults-transaction-rows-written-err" class="anchored"><code>sql.defaults.transaction_rows_written_err</code></div></td><td>integer</td><td><code>0</code></td><td>the limit for the number of rows written by a SQL transaction which - once exceeded - will fail the transaction (or will trigger a logging event to SQL_INTERNAL_PERF for internal transactions); use 0 to disable<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-defaults-transaction-rows-written-log" class="anchored"><code>sql.defaults.transaction_rows_written_log</code></div></td><td>integer</td><td><code>0</code></td><td>the threshold for the number of rows written by a SQL transaction which - once exceeded - will trigger a logging event to SQL_PERF (or SQL_INTERNAL_PERF for internal transactions); use 0 to disable<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-defaults-use-declarative-schema-changer" class="anchored"><code>sql.defaults.use_declarative_schema_changer</code></div></td><td>enumeration</td><td><code>on</code></td><td>default value for use_declarative_schema_changer session setting;disables new schema changer by default [off = 0, on = 1, unsafe = 2, unsafe_always = 3]<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-defaults-vectorize" class="anchored"><code>sql.defaults.vectorize</code></div></td><td>enumeration</td><td><code>on</code></td><td>default vectorize mode [on = 0, on = 2, experimental_always = 3, off = 4]<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-defaults-vectorize" class="anchored"><code>sql.defaults.vectorize</code></div></td><td>enumeration</td><td><code>on</code></td><td>default vectorize mode [on = 0, on = 1, on = 2, experimental_always = 3, off = 4]<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-defaults-zigzag-join-enabled" class="anchored"><code>sql.defaults.zigzag_join.enabled</code></div></td><td>boolean</td><td><code>false</code></td><td>default value for enable_zigzag_join session setting; disallows use of zig-zag join by default<br/>This cluster setting is being kept to preserve backwards-compatibility.<br/>This session variable default should now be configured using <a href="alter-role.html"><code>ALTER ROLE... SET</code></a></td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-distsql-temp-storage-workmem" class="anchored"><code>sql.distsql.temp_storage.workmem</code></div></td><td>byte size</td><td><code>64 MiB</code></td><td>maximum amount of memory in bytes a processor can use before falling back to temp storage</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-sql-guardrails-max-row-size-err" class="anchored"><code>sql.guardrails.max_row_size_err</code></div></td><td>byte size</td><td><code>512 MiB</code></td><td>maximum size of row (or column family if multiple column families are in use) that SQL can write to the database, above which an error is returned; use 0 to disable</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
Expand Down
7 changes: 6 additions & 1 deletion pkg/cmd/roachtest/registry/test_spec.go
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,8 @@ func (l LeaseType) String() string {
return "epoch"
case ExpirationLeases:
return "expiration"
case LeaderLeases:
return "leader"
case MetamorphicLeases:
return "metamorphic"
default:
Expand All @@ -238,8 +240,11 @@ const (
EpochLeases
// ExpirationLeases uses expiration leases for all ranges.
ExpirationLeases
// LeaderLeases uses leader leases where possible.
LeaderLeases
// MetamorphicLeases randomly chooses epoch or expiration
// leases (across the entire cluster)
// leases (across the entire cluster).
// TODO(nvanbenschoten): add leader leases to this mix.
MetamorphicLeases
)

Expand Down
4 changes: 4 additions & 0 deletions pkg/cmd/roachtest/test_runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -902,8 +902,12 @@ func (r *testRunner) runWorker(
case registry.DefaultLeases:
case registry.EpochLeases:
c.clusterSettings["kv.expiration_leases_only.enabled"] = "false"
c.clusterSettings["kv.raft.leader_fortification.fraction_enabled"] = "0.0"
case registry.ExpirationLeases:
c.clusterSettings["kv.expiration_leases_only.enabled"] = "true"
case registry.LeaderLeases:
c.clusterSettings["kv.expiration_leases_only.enabled"] = "false"
c.clusterSettings["kv.raft.leader_fortification.fraction_enabled"] = "1.0"
case registry.MetamorphicLeases:
enabled := prng.Float64() < 0.5
c.status(fmt.Sprintf("metamorphically setting kv.expiration_leases_only.enabled = %t",
Expand Down
26 changes: 24 additions & 2 deletions pkg/cmd/roachtest/tests/failover.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,20 @@ var rangeLeaseRenewalDuration = func() time.Duration {
// requests are successful with nominal latencies. See also:
// https://github.com/cockroachdb/cockroach/issues/103654
func registerFailover(r registry.Registry) {
for _, leases := range []registry.LeaseType{registry.EpochLeases, registry.ExpirationLeases} {
leaseTypes := []registry.LeaseType{registry.EpochLeases, registry.ExpirationLeases, registry.LeaderLeases}
for _, leases := range leaseTypes {
var leasesStr string
if leases == registry.ExpirationLeases {
switch leases {
case registry.EpochLeases:
// TODO(nvanbenschoten): when leader leases become the default, we should
// change this to "/lease=epoch" and change leader leases to "".
leasesStr = ""
case registry.ExpirationLeases:
leasesStr = "/lease=expiration"
case registry.LeaderLeases:
leasesStr = "/lease=leader"
default:
panic(errors.AssertionFailedf("unknown lease type: %v", leases))
}

for _, readOnly := range []bool{false, true} {
Expand Down Expand Up @@ -539,6 +549,18 @@ func runFailoverPartialLeaseLeader(ctx context.Context, t test.Test, c cluster.C
settings := install.MakeClusterSettings()
settings.Env = append(settings.Env, "COCKROACH_SCAN_MAX_IDLE_TIME=100ms") // speed up replication

// DistSender circuit breakers are useful in this test to avoid artificially
// inflated latencies due to the way the test measures failover time. Without
// circuit breakers, a request stuck on the partitioned leaseholder will get
// blocked indefinitely, despite the range recovering on the other side of the
// partition. As a result, the test won't differentiate between temporary and
// permanent range unavailability. We have other tests which demonstrate the
// benefit of DistSender circuit breakers (especially when applications do not
// use statement timeouts), so we don't need to test them here.
// TODO(arul): this can be removed if/when we turn on DistSender circuit
// breakers for all ranges by default.
settings.ClusterSettings["kv.dist_sender.circuit_breakers.mode"] = "all ranges"

m := c.NewMonitor(ctx, c.CRDBNodes())

failer := makeFailer(t, c, m, failureModeBlackhole, settings, rng).(PartialFailer)
Expand Down
3 changes: 2 additions & 1 deletion pkg/cmd/roachtest/tests/tpcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -1280,7 +1280,8 @@ type tpccBenchSpec struct {
// Encryption-At-Rest / EAR).
EncryptionEnabled bool
// ExpirationLeases enables use of expiration-based leases.
ExpirationLeases bool
ExpirationLeases bool
// TODO(nvanbenschoten): add a leader lease variant.
EnableDefaultScheduledBackup bool
// SharedProcessMT, if true, indicates that the cluster should run in
// shared-process mode of multi-tenancy.
Expand Down
6 changes: 5 additions & 1 deletion pkg/sql/exec_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -586,7 +586,11 @@ var VectorizeClusterMode = settings.RegisterEnumSetting(
func() map[sessiondatapb.VectorizeExecMode]string {
m := make(map[sessiondatapb.VectorizeExecMode]string, len(sessiondatapb.VectorizeExecMode_name))
for k := range sessiondatapb.VectorizeExecMode_name {
// Note that VectorizeExecMode.String() remaps "unset" to "on".
// Note that for historical reasons, VectorizeExecMode.String() remaps
// both "unset" and "201auto" to "on", so we end up with a map like:
// 0: on, 1: on, 2: on, 3: experimental_always, 4: off. This means that
// after SET CLUSTER SETTING sql.defaults.vectorize = 'on'; we could have
// 0, 1, or 2 in system.settings and must handle all three cases as 'on'.
m[sessiondatapb.VectorizeExecMode(k)] = sessiondatapb.VectorizeExecMode(k).String()
}
return m
Expand Down
54 changes: 54 additions & 0 deletions pkg/sql/logictest/testdata/logic_test/system
Original file line number Diff line number Diff line change
Expand Up @@ -1428,3 +1428,57 @@ ALTER TABLE system.public.tenant_usage CONFIGURE ZONE USING
num_replicas = 5,
constraints = '[]',
lease_preferences = '[]'

# Regression test for 133278, that clusters with sql.defaults.vectorize set to
# old value '1' can still start new connections.

statement ok
UPSERT INTO system.settings (name, value, "valueType") VALUES ('sql.defaults.vectorize', '1', 'e')

query TT
SELECT name, value FROM system.settings WHERE name = 'sql.defaults.vectorize'
----
sql.defaults.vectorize 1

query T
SHOW CLUSTER SETTING sql.defaults.vectorize
----
on

statement ok
SET vectorize = DEFAULT

query T
SHOW vectorize
----
on

# Make sure we can open a new connection.
user testuser newsession

user root

statement ok
RESET CLUSTER SETTING sql.defaults.vectorize

statement ok
RESET vectorize

query TT
SELECT name, value FROM system.settings WHERE name = 'sql.defaults.vectorize'
----

query T
SHOW CLUSTER SETTING sql.defaults.vectorize
----
on

query T
SHOW vectorize
----
on

user testuser

statement ok
RESET vectorize
5 changes: 4 additions & 1 deletion pkg/sql/sessiondatapb/session_data.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ func (c DataConversionConfig) GetFloatPrec(typ *types.T) int {
}

func (m VectorizeExecMode) String() string {
if m == VectorizeUnset {
if m == VectorizeUnset || m == DeprecatedVectorize201Auto {
m = VectorizeOn
}
name, ok := VectorizeExecMode_name[int32(m)]
Expand All @@ -72,6 +72,9 @@ func VectorizeExecModeFromString(val string) (VectorizeExecMode, bool) {
if m == VectorizeUnset {
return 0, false
}
if m == DeprecatedVectorize201Auto {
m = VectorizeOn
}
return m, true
}

Expand Down
5 changes: 4 additions & 1 deletion pkg/sql/sessiondatapb/session_data.proto
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,10 @@ enum VectorizeExecMode {
// the first enum value as zero is required by proto3. This is mapped to
// VectorizeOn.
unset = 0 [(gogoproto.enumvalue_customname) = "VectorizeUnset"];
reserved 1;
// DeprecatedVectorized201Auto is only possible for clusters that have been
// upgraded from v21.1 or older and had the old setting of "201auto" which
// was mapped to "on" in v21.1. It is now an alias for "on".
deprecated201auto = 1 [(gogoproto.enumvalue_customname) = "DeprecatedVectorize201Auto"];
// VectorizeOn means that any supported queries will be run using the
// columnar execution.
on = 2 [(gogoproto.enumvalue_customname) = "VectorizeOn"];
Expand Down

0 comments on commit 4efd2b4

Please sign in to comment.