Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for snapshots and restores. #2396

Merged
merged 42 commits into from
Oct 26, 2016
Merged

Adds support for snapshots and restores. #2396

merged 42 commits into from
Oct 26, 2016

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Oct 7, 2016

This adds a new API for taking full snapshots of Consul's state, and for restoring it. The intent of this is to make it easier to back up consul and recover in the event of a disaster.

@slackpad
Copy link
Contributor Author

slackpad commented Oct 7, 2016

Not ready for review yet - still adding tests and documentation. Putting up here to checkpoint working end-to-end and to get any early feedback.

@slackpad slackpad force-pushed the f-snapshot-restore branch 2 times, most recently from c0543e9 to 8c85b52 Compare October 7, 2016 15:58
This is independent of Consul so we can pull this out into a library
later if we want to.
Snapshots require the Raft snapshots to be readable, which isn't supported
in dev mode. Send a clear error instead of a deep-down Raft one.
// as part of these requests.
func (s *HTTPServer) Snapshot(resp http.ResponseWriter, req *http.Request) (interface{}, error) {
var args structs.SnapshotRequest
s.parseDC(req, &args.Datacenter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the normal parse that handles DC, Token, and Stale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add the full set of QueryOptions in here since not all of them are supported.

Copy link
Member

@armon armon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks super good to me! Left some minor comments.

setMeta(resp, &reply.QueryMeta)
return nil
}
if err := s.agent.SnapshotRPC(&args, req.Body, resp, replyFn); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide req.Body for the SnapshotSave?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call - it's not worth plumbing it through since it will be ignored.


case "PUT":
args.Op = structs.SnapshotRestore
if err := s.agent.SnapshotRPC(&args, req.Body, resp, nil); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also respond to the response body on restore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't - it's all in the return code and HTTP headers. I plumbed this in case we add something later that's returned.


func (c *SnapshotCommand) Help() string {
helpText := `
Usage: consul snapshot <subcommand> [options] [args]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a consul snapshot inspect to view metadata about a snapshot, like the indexes, checksums, etc. This way if you have multiple snapshots you can help disambiguate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is kind of a monster PR and this would require a little work to open the archive from the CLI, I created #2438 to track this. Totally agree - this is also where we might do things like let you snag a KV entry out of a snapshot for an emergency.

}

// Stream the snapshot.
if _, err := io.Copy(out, snap); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we guard against a nil out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call since this is a client function designed for general use.

}

// Stream the snapshot.
if _, err := io.Copy(out, snap); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, wonder if this could potentially have a nil out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

// Make a scratch file to receive the contents so that we don't buffer
// everything in memory. This gets deleted in Close() since we keep it
// around for re-reading.
archive, err := ioutil.TempFile("", "snapshot")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should nest this inside the data dir? We already have a tmp we use for various things in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, though this really is temp stuff - like if something crashed / box rebooted kind of thing you wouldn't want this to persist. Also thought like if you were in a container or using some network volume you wouldn't want this to go there where you might have mapped a data dir.

}()

// Wrap the file writer in a gzip compressor.
compressor := gzip.NewWriter(archive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, this is done at a higher level. Any reason to not push this down to the tar generation level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it there originally but changed it so that all the archive code would be independent of the compression. Seemed like a good layering since it would be kind of a trivial detail in the archive code to just wrap everything so we can push that up and possibly change it later w/o touching the tar part at all.


// Make a scratch file to receive the contents of the snapshot data so
// we can avoid buffering in memory.
snap, err := ioutil.TempFile("", "snapshot")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, maybe nest in data dir

// Snapshots don't work in dev mode because we need Raft's snapshots to
// be readable. Better to present a clear error than one from deep down
// in the Raft snapshot store.
if s.config.DevMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if this was possible, as it lets you do some interesting things with bootstrapping vault on a dev cluster, then doing a save/restore into a PKI-enabled cluster. Anyways, room for future enhancement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this is kind of lame and unintuitive. I captured fixing this here - #2439.


// keep will disarm the defer on success if we are returning the caller
// our connection to stream the output.
var keep bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the connection switches into this snapshot mode, is there a point of keeping it? Why not just always close given the infrequency of the operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah it's passed up to the caller to copy the data out of it and then close it (io.ReadCloser is returned) so we shouldn't close it here. I realized we didn't close it up above so I'll clean that up.

}

// Feed the snapshot into Raft.
future := r.Restore(&metadata, snap, 10*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 minutes seems like a really long enqueue timeout...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was trying to be conservative :-) Agree this is probably too much though - I'll lower it to like 60 seconds (there's disk i/o and stuff in here, so for huge restores it could take a little while).

case <-r.shutdownCh:
return errorFuture{ErrRaftShutdown}
case r.userRestoreCh <- restore:
// If the restore is ingested then wait for it to complete.
if err := restore.Error(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to return a future since this call always blocks. If we always block, may as well return an err, otherwise we should not block on the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - this is kind of a weird mix - will change to return an error.

if args.AllowStale {
return nil, fmt.Errorf("stale not allowed for restore")
}
if err := snapshot.Restore(s.logger, in, s.raft); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to call the leader hooks here so that we can reset tombstone GC and session timers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@slackpad slackpad added this to the 0.7.1 milestone Oct 25, 2016
@slackpad slackpad added the type/enhancement Proposed improvement or new feature label Oct 25, 2016
@slackpad slackpad merged commit c01a387 into master Oct 26, 2016
@slackpad slackpad deleted the f-snapshot-restore branch October 26, 2016 02:20
@ryanuber
Copy link
Member

Late to the party here but this looks great! 🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants