Vault is unable to restore a large snapshot #24245

benvanstaveren · 2023-11-23T12:45:10Z

Describe the bug
Vault is seemingly unable to restore a 29Gb snapshot

To Reproduce
Get yourself a nice big snapshot, attempt to restore it to a newly initialized cluster (using -force), watch the errors

Expected behavior
A snapshot to be restored

Environment:

Vault Server Version (retrieve with vault status): 1.15.2
Vault CLI Version (retrieve with vault version): 1.15.2
Server Operating System/Architecture: Linux AMD64

Vault server configuration file(s):

disable_mlock = true
storage "raft" {
    path    = "/opt/vault/data"
    node_id = "myshinynode"
}
service_registration "consul" {
    address      = "127.0.0.1:8500"
}

ui = "true"
pid_file = "/opt/vault/vault.pid"

listener "tcp" {
    address                             = "10.x.x.2:8200"
    tls_disable                         = "true"
    proxy_protocol_behavior             = "allow_authorized"
    proxy_protocol_authorized_addrs     = "10.x.x.29"
    x_forwarded_for_authorized_addrs    = "10.x.x.29"
    x_forwarded_for_reject_not_authorized = "false"
    x_forwarded_for_reject_not_present  = "false"
    max_request_duration                = "3600s"
    max_request_size                    = -1
}
listener "tcp" {
    address                             = "127.0.0.1:8200"
    tls_disable                         = "true"
    max_request_duration                = "3600s"
    max_request_size                    = -1
}
api_addr = "http://10.x.x.2:8200"
cluster_addr = "https://10.x.x.2:8201"

Additional context
I can't replicate the exact error message at the moment due to being in the middle of an attempt at recovery using some filthy methods, but:

attempt #1: "could not read request body"
then increased the vault client timeout by export VAULT_CLIENT_TIMEOUT=86400s
attempt #2: "could not read request body"
then set the max_request_duration and max_request_size in the vault listeners config
attempt #3..n: "broken pipe"

This tells me that the vault client is attempting to dump the entire 29Gb to the vault server in one sitting, and the vault server is obviously not liking this very much.

It's mildly annoying that the "official" backup and restore method isn't actually working to restore the backup I made...

The text was updated successfully, but these errors were encountered:

benvanstaveren · 2023-11-23T23:07:24Z

To add: if you switch to using curl instead of the vault client, to restore a (now) 31Gb snapshot, you need a machine with more than 64Gb memory because otherwise the OOM killer will get you. I'm concerned this is a problem for Consul and Nomad snapshots as well and kind of puts me ill at ease with regards to restoring from disastrous outages.

benvanstaveren · 2023-11-25T02:30:48Z

Set up a fresh server with vault, 128gb ram, reasonably freshly taken snapshot, and this is the result of attempting to restore said snapshot with curl:

# curl -v --header 'X-Vault-Token: redacted.token' --request POST --data-binary @2023-11-23-22h13.snap http://127.0.0.1:8200/v1/sys/storage/raft/snapshot-force
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8200...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8200 (#0)
> POST /v1/sys/storage/raft/snapshot-force HTTP/1.1
> Host: 127.0.0.1:8200
> User-Agent: curl/7.68.0
> Accept: */*
> X-Vault-Token: redacted.token
> Content-Length: 33029265974
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Cache-Control: no-store
< Content-Type: application/json
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< Date: Sat, 25 Nov 2023 02:26:24 GMT
< Content-Length: 43
< Connection: close
<
{"errors":["failed to read request body"]}
* we are done reading and this is set to close, stop send
* Closing connection 0

The vault log (with level debug) shows only the following:

Nov 25 03:26:24 mynode vault[13768]: 2023-11-25T02:26:24.941Z [DEBUG] core: completed_request: start_time=2023-11-25T02:25:52Z duration=32749ms client_id="" client_address=127.0.0.1:38232 status_code=500 request_path=/v1/sys/storage/raft/snapshot-force request_method=POST

What do I do? I'm now running our supposed "vault cluster" on a single node, that I can back up to snapshots, but I apparently cannot restore said snapshots. I'd like a solution...

benvanstaveren · 2023-11-25T02:55:46Z

Welp. Solution found: increasing the http_read_timeout on the listener did the trick; I do still feel the default timeout on this is too low for production use, I'm quite sure I'm not the only one with large snapshots to restore. Anyway. I'll close this, but maybe an idea to document this somewhere (i.e. large snapshots -> increase http read timeout)

benvanstaveren closed this as completed Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault is unable to restore a large snapshot #24245

Vault is unable to restore a large snapshot #24245

benvanstaveren commented Nov 23, 2023

benvanstaveren commented Nov 23, 2023

benvanstaveren commented Nov 25, 2023

benvanstaveren commented Nov 25, 2023

Vault is unable to restore a large snapshot #24245

Vault is unable to restore a large snapshot #24245

Comments

benvanstaveren commented Nov 23, 2023

benvanstaveren commented Nov 23, 2023

benvanstaveren commented Nov 25, 2023

benvanstaveren commented Nov 25, 2023