Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.3: roachtest: add failover variants with leader leases #133369

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Oct 24, 2024

Backport 1/1 commits from #131624 on behalf of @nvanbenschoten.

/cc @cockroachdb/release


Part of #132762.

Leader leases have different availability properties than epoch leases under most failure modes. This patch adds failover test variants that use leader leases where possible.

Initial test results:

test lease=epoch (ms) lease=expiration (ms) lease=leader (ms) parity with expiration
failover/chaos/read-only 60,129 18,253 60,129
failover/chaos/read-write 60,129 20,401 60,129 ❌❌❌
failover/liveness/blackhole 9,663 369 335
failover/liveness/blackhole-recv 11,274 402 369
failover/liveness/blackhole-send 9,663 385 469
failover/liveness/crash 8,053 352 318
failover/liveness/deadlock 24,696 385 369
failover/liveness/disk-stall 26,843 369 419
failover/liveness/pause 10,200 385 436
failover/non-system/blackhole 7,247 7,516 15,032 ❌❌
failover/non-system/blackhole-recv 12,348 10,737 18,253 ❌❌
failover/non-system/blackhole-send 6,979 6,979 8,053
failover/non-system/crash 7,247 6,979 9,126
failover/non-system/deadlock 60,129 60,129 60,129
failover/non-system/disk-stall 22,548 22,548 25,769
failover/non-system/pause 7,247 7,247 9,126
failover/partial/lease-gateway 8,589 19,327 1 60,129 ❌❌❌
failover/partial/lease-leader 60,129 22,549 2 31,139 2 ❌❌
failover/partial/lease-liveness 8,589 301 318
failover/system-non-liveness/blackhole 369 402 352
failover/system-non-liveness/blackhole-recv 335 285 318
failover/system-non-liveness/blackhole-send 402 419 335
failover/system-non-liveness/crash 419 301 453
failover/system-non-liveness/deadlock 369 352 402
failover/system-non-liveness/disk-stall 402 318 453
failover/system-non-liveness/pause 369 385 335

note: because of the way the test measures pMax, anything under 1,000ms is essentially "no impact"

Key (comparing leader vs. expiration):
✔ = parity
❌ = minor regression
❌❌ = major regression
❌❌❌ = unavailability

Epic: none
Release note: None


Release justification: testing only

Footnotes

  1. I don't understand why expiration-based lease perform worse than epoch-based leases on this test.

  2. With roachtest: enable DistSender circuit breakers in failover/partial/lease-leader #133214. 2

Part of #132762.

Leader leases have different availability properties than epoch leases under
most failure modes. This patch adds failover test variants that use leader
leases where possible.

Initial test results:

| test                                         | lease=epoch (ms) | lease=expiration (ms) | lease=leader (ms) | parity with expiration |
|:---------------------------------------------|-----------------:|----------------------:|------------------:|:----------------------:|
| failover/chaos/read-only                     | 60,129           | 18,253                | 60,129            | ✔                      |
| failover/chaos/read-write                    | 60,129           | 20,401                | 60,129            | XXX                    |
| failover/liveness/blackhole                  | 9,663            | 369                   | 335               | ✔                      |
| failover/liveness/blackhole-recv             | 11,274           | 402                   | 369               | ✔                      |
| failover/liveness/blackhole-send             | 9,663            | 385                   | 469               | ✔                      |
| failover/liveness/crash                      | 8,053            | 352                   | 318               | ✔                      |
| failover/liveness/deadlock                   | 24,696           | 385                   | 369               | ✔                      |
| failover/liveness/disk-stall                 | 26,843           | 369                   | 419               | ✔                      |
| failover/liveness/pause                      | 10,200           | 385                   | 436               | ✔                      |
| failover/non-system/blackhole                | 7,247            | 7,516                 | 15,032            | XX                     |
| failover/non-system/blackhole-recv           | 12,348           | 10,737                | 18,253            | XX                     |
| failover/non-system/blackhole-send           | 6,979            | 6,979                 | 8,053             | X                      |
| failover/non-system/crash                    | 7,247            | 6,979                 | 9,126             | X                      |
| failover/non-system/deadlock                 | 60,129           | 60,129                | 60,129            | ✔                      |
| failover/non-system/disk-stall               | 22,548           | 22,548                | 25,769            | X                      |
| failover/non-system/pause                    | 7,247            | 7,247                 | 9,126             | X                      |
| failover/partial/lease-gateway               | 8,589            | 19,327 [^1]           | 60,129            | XXX                    |
| failover/partial/lease-leader                | 60,129           | 22,549 [^2]           | 31,139 [^2]       | XX                     |
| failover/partial/lease-liveness              | 8,589            | 301                   | 318               | ✔                      |
| failover/system-non-liveness/blackhole       | 369              | 402                   | 352               | ✔                      |
| failover/system-non-liveness/blackhole-recv  | 335              | 285                   | 318               | ✔                      |
| failover/system-non-liveness/blackhole-send  | 402              | 419                   | 335               | ✔                      |
| failover/system-non-liveness/crash           | 419              | 301                   | 453               | ✔                      |
| failover/system-non-liveness/deadlock        | 369              | 352                   | 402               | ✔                      |
| failover/system-non-liveness/disk-stall      | 402              | 318                   | 453               | ✔                      |
| failover/system-non-liveness/pause           | 369              | 385                   | 335               | ✔                      |

_note: because of the way the test measures pMax, anything under 1,000ms is essentially "no impact"_

**Key _(comparing leader vs. expiration)_**:
✔ = parity
X = minor regression
XX = major regression
XXX = unavailability

[^1]: I don't understand why expiration-based lease perform worse than epoch-based leases on this test.
[^2]: With #133214.

Epic: none
Release note: None
@blathers-crl blathers-crl bot requested a review from a team as a code owner October 24, 2024 17:01
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Oct 24, 2024
@blathers-crl blathers-crl bot requested review from nameisbhaskar and vidit-bhat and removed request for a team October 24, 2024 17:01
Copy link
Author

blathers-crl bot commented Oct 24, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Oct 24, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 6247e46 into release-24.3 Oct 29, 2024
20 of 21 checks passed
@nvanbenschoten nvanbenschoten deleted the blathers/backport-release-24.3-131624 branch October 29, 2024 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants