-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for snapshots and restores. #2396
Conversation
Not ready for review yet - still adding tests and documentation. Putting up here to checkpoint working end-to-end and to get any early feedback. |
c0543e9
to
8c85b52
Compare
89624b6
to
6991d25
Compare
bc95660
to
06e926e
Compare
I realized I had two servers in bootstrap mode so this wasn't a good setup.
4669469
to
7d7e596
Compare
This is independent of Consul so we can pull this out into a library later if we want to.
Snapshots require the Raft snapshots to be readable, which isn't supported in dev mode. Send a clear error instead of a deep-down Raft one.
This gives folks a way to extract data even if the cluster has no leader.
// as part of these requests. | ||
func (s *HTTPServer) Snapshot(resp http.ResponseWriter, req *http.Request) (interface{}, error) { | ||
var args structs.SnapshotRequest | ||
s.parseDC(req, &args.Datacenter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use the normal parse
that handles DC, Token, and Stale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add the full set of QueryOptions
in here since not all of them are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks super good to me! Left some minor comments.
setMeta(resp, &reply.QueryMeta) | ||
return nil | ||
} | ||
if err := s.agent.SnapshotRPC(&args, req.Body, resp, replyFn); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to provide req.Body
for the SnapshotSave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call - it's not worth plumbing it through since it will be ignored.
|
||
case "PUT": | ||
args.Op = structs.SnapshotRestore | ||
if err := s.agent.SnapshotRPC(&args, req.Body, resp, nil); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also respond to the response body on restore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't - it's all in the return code and HTTP headers. I plumbed this in case we add something later that's returned.
|
||
func (c *SnapshotCommand) Help() string { | ||
helpText := ` | ||
Usage: consul snapshot <subcommand> [options] [args] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a consul snapshot inspect
to view metadata about a snapshot, like the indexes, checksums, etc. This way if you have multiple snapshots you can help disambiguate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is kind of a monster PR and this would require a little work to open the archive from the CLI, I created #2438 to track this. Totally agree - this is also where we might do things like let you snag a KV entry out of a snapshot for an emergency.
} | ||
|
||
// Stream the snapshot. | ||
if _, err := io.Copy(out, snap); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we guard against a nil out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call since this is a client function designed for general use.
} | ||
|
||
// Stream the snapshot. | ||
if _, err := io.Copy(out, snap); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, wonder if this could potentially have a nil out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
// Make a scratch file to receive the contents so that we don't buffer | ||
// everything in memory. This gets deleted in Close() since we keep it | ||
// around for re-reading. | ||
archive, err := ioutil.TempFile("", "snapshot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should nest this inside the data dir
? We already have a tmp
we use for various things in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that, though this really is temp stuff - like if something crashed / box rebooted kind of thing you wouldn't want this to persist. Also thought like if you were in a container or using some network volume you wouldn't want this to go there where you might have mapped a data dir.
}() | ||
|
||
// Wrap the file writer in a gzip compressor. | ||
compressor := gzip.NewWriter(archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, this is done at a higher level. Any reason to not push this down to the tar generation level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it there originally but changed it so that all the archive code would be independent of the compression. Seemed like a good layering since it would be kind of a trivial detail in the archive code to just wrap everything so we can push that up and possibly change it later w/o touching the tar part at all.
|
||
// Make a scratch file to receive the contents of the snapshot data so | ||
// we can avoid buffering in memory. | ||
snap, err := ioutil.TempFile("", "snapshot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, maybe nest in data dir
// Snapshots don't work in dev mode because we need Raft's snapshots to | ||
// be readable. Better to present a clear error than one from deep down | ||
// in the Raft snapshot store. | ||
if s.config.DevMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if this was possible, as it lets you do some interesting things with bootstrapping vault on a dev cluster, then doing a save/restore into a PKI-enabled cluster. Anyways, room for future enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this is kind of lame and unintuitive. I captured fixing this here - #2439.
|
||
// keep will disarm the defer on success if we are returning the caller | ||
// our connection to stream the output. | ||
var keep bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the connection switches into this snapshot mode, is there a point of keeping it? Why not just always close given the infrequency of the operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah it's passed up to the caller to copy the data out of it and then close it (io.ReadCloser is returned) so we shouldn't close it here. I realized we didn't close it up above so I'll clean that up.
} | ||
|
||
// Feed the snapshot into Raft. | ||
future := r.Restore(&metadata, snap, 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 minutes seems like a really long enqueue timeout...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was trying to be conservative :-) Agree this is probably too much though - I'll lower it to like 60 seconds (there's disk i/o and stuff in here, so for huge restores it could take a little while).
case <-r.shutdownCh: | ||
return errorFuture{ErrRaftShutdown} | ||
case r.userRestoreCh <- restore: | ||
// If the restore is ingested then wait for it to complete. | ||
if err := restore.Error(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to return a future since this call always blocks. If we always block, may as well return an err
, otherwise we should not block on the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - this is kind of a weird mix - will change to return an error.
if args.AllowStale { | ||
return nil, fmt.Errorf("stale not allowed for restore") | ||
} | ||
if err := snapshot.Restore(s.logger, in, s.raft); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to call the leader hooks here so that we can reset tombstone GC and session timers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Late to the party here but this looks great! 🎉 🎉 🎉 |
This adds a new API for taking full snapshots of Consul's state, and for restoring it. The intent of this is to make it easier to back up consul and recover in the event of a disaster.