Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of [VAULT-17827] Rollback manager worker pool into release/1.14.x #23691

Conversation

hc-github-team-secure-vault-core
Copy link
Collaborator

Backport

This PR is auto-generated from #22567 to be assessed for backporting due to the inclusion of the label backport/1.14.x.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@miagilepner
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/vault/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


(Description is still a WIP)

This PR adds a worker pool to the rollback manager with a default size of 256. The size of the worker pool can be adjusted with the environment variable VAULT_ROLLBACK_WORKERS.

Considerations:

  • The worker pool removes the goroutine scheduling pressure:
    Scheduler latency profile with unlimited workers, with 9000 mounts:
image

256 workers, with 9000 mounts:

image
  • The worker pool queue is limited by the number of mounts, because the rollback manager ensures that there's never more than 1 operation submitted to the worker pool per mount.
  • If backends take longer than 60 seconds to complete their rollback operation, then the number of workers isn't able to keep up. The queue remains stable in size, but rollbacks are triggered less often. Rollback operations have a request context timeout of 90 seconds, which means that if all of the mounts are timing out, you could end up having rollbacks triggering (# mounts / # workers) * 90 seconds, rather than every 60 seconds.
  • Rollback operations can cause backends to do 2 things - trigger their PeriodicFunc and call WALRollback with a collection of WAL entries. To be clear, these WAL entries are not the same WAL that Vault uses for replication. This is a separate, namespace/mount-scoped storage location, and the path is only written to by plugins via framework.PutWAL. By default, the WAL entries that get passed to the WALRollback function are any entries older than 10 minutes.
  • unmount and remount operations trigger a rollback through the rollback manager, then wait for the rollback to complete before continuing. Because we're now using a worker pool it's possible that unmount and remounts will take longer to complete. Note that unmount and remount can be called replication invalidation operations.

Overview of commits

@hc-github-team-secure-vault-core hc-github-team-secure-vault-core force-pushed the backport/miagilepner/rollback-manager-worker-pool/luckily-major-dragon branch 2 times, most recently from 77f8d45 to 5c81b94 Compare October 17, 2023 08:40
@hashicorp-cla
Copy link

hashicorp-cla commented Oct 17, 2023

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Oct 17, 2023
* workerpool implementation

* rollback tests

* website documentation

* add changelog

* fix failing test
* fix flaky rollback test

* better fix

* switch to defer

* add comment
@miagilepner miagilepner force-pushed the backport/miagilepner/rollback-manager-worker-pool/luckily-major-dragon branch from e210ff6 to 0dcb93d Compare October 17, 2023 08:54
@miagilepner miagilepner marked this pull request as ready for review October 17, 2023 08:55
@miagilepner miagilepner requested a review from a team as a code owner October 17, 2023 08:55
@miagilepner miagilepner added this to the 1.14.5 milestone Oct 17, 2023
@github-actions
Copy link

Build Results:
All builds succeeded! ✅

@github-actions
Copy link

CI Results:
All Go tests succeeded! ✅

@miagilepner miagilepner requested a review from banks October 17, 2023 09:29
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@miagilepner miagilepner merged commit 93efe66 into release/1.14.x Oct 17, 2023
@miagilepner miagilepner deleted the backport/miagilepner/rollback-manager-worker-pool/luckily-major-dragon branch October 17, 2023 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants