Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring snapshots suddenly stopped working #4738

Closed
far-blue opened this issue Oct 2, 2018 · 3 comments
Closed

Restoring snapshots suddenly stopped working #4738

far-blue opened this issue Oct 2, 2018 · 3 comments
Labels
waiting-reply Waiting on response from Original Poster or another individual in the thread

Comments

@far-blue
Copy link

far-blue commented Oct 2, 2018

Overview of the Issue

We've been testing a consul / vault setup for use in the company and have setup hourly snapshotting for backup. Testing these snapshots, we found that between the 22nd sept 9am and the 22nd sept 10am snapshots across all three nodes in the cluster suddenly started failing to restore.

The particular error is:
Error restoring snapshot: Unexpected response code: 500 (failed to read snapshot file: failed checking integrity of snapshot: hash check failed for "meta.json")

Looking at the source code this suggests that the snap file was correctly inflated and unpacked and the files exist but that the calculated sha256 of the meta.json file doesn't match the hash in the SHA256SUMS file.

However, if we manually unpack the snap and use the sha256sum cli tool the checks pass and a new hash generated using the cli tool matches the content of the SHA256SUMS file.

Reproduction Steps

Just try to restore the snapshot like usual and it will fail (but one from an hour earlier succeeds).

Consul info for both Client and Server

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = 48d287ef
    version = 1.2.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 192.168.x.x:8300
    server = true
raft:
    applied_index = 100612
    commit_index = 100612
    fsm_pending = 0
    last_contact = 25.211601ms
    last_log_index = 100612
    last_log_term = 2
    last_snapshot_index = 100427
    last_snapshot_term = 2
    latest_configuration = [{Suffrage:Voter ID:21e2f522-ca4b-7683-243b-0c967c3b654d Address:192.168.x.x:8300} {Suffrage:Voter ID:fd4fbfa5-b1e0-0bf6-a4fa-46fb60f06185 Address:192.168.x.y:8300} {Suffrage:Voter ID:f516f0e5-9b90-5419-391e-75c9adf94a55 Address:192.168.x.z:8300}]
    latest_configuration_index = 1
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 2
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 78
    max_procs = 4
    os = linux
    version = go1.10.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 2
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 4
    members = 3
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 6
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

CentOS Linux release 7.5.1804 (Core)

@far-blue
Copy link
Author

far-blue commented Oct 2, 2018

To be clear, older snapshots still restore without any issues.

@pearkes pearkes added the needs-investigation The issue described is detailed and complex. label Oct 8, 2018
@pearkes
Copy link
Contributor

pearkes commented Oct 26, 2018

Thanks for reporting this.

Do newer snapshots restore successfully as well? Is it possible the snapshot process, filesystem, etc. was interrupted or modified in some way for that specific snapshot?

@pearkes pearkes added waiting-reply Waiting on response from Original Poster or another individual in the thread and removed needs-investigation The issue described is detailed and complex. labels Oct 26, 2018
@pearkes
Copy link
Contributor

pearkes commented Oct 26, 2018

I think this is a duplicate of #4452. Please report any further information there, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-reply Waiting on response from Original Poster or another individual in the thread
Projects
None yet
Development

No branches or pull requests

2 participants