Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot save error: hash check failed for "meta.json" #3933

Closed
tkald opened this issue Mar 4, 2018 · 11 comments
Closed

snapshot save error: hash check failed for "meta.json" #3933

tkald opened this issue Mar 4, 2018 · 11 comments

Comments

@tkald
Copy link

tkald commented Mar 4, 2018

Description of the Issue (and unexpected/desired result)

When running from command line on server node consul snapshot save consul.bak error is returned:
Error verifying snapshot file: failed to read snapshot file: failed checking integrity of snapshot: hash check failed for "meta.json"
consul.bak file itself is created.

Reproduction steps

run consul snapshot save consul.bak on server

consul version for both Client and Server

Client: [client version here]
Server: Consul v1.0.6 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

consul info for both Client and Server

Client:

[Client `consul info` here]

Server:

[Server `agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 9a494b5f
        version = 1.0.6
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.40.163:8300
        server = true
raft:
        applied_index = 107869621
        commit_index = 107869621
        fsm_pending = 0
        last_contact = 88.753954ms
        last_log_index = 107869621
        last_log_term = 228
        last_snapshot_index = 107869621
        last_snapshot_term = 228
        latest_configuration = [{Suffrage:Voter ID:701ab7c8-734e-e13c-3e18-c5b7efafa59b Address:192.168.40.161:8300} {Suffrage:Voter ID:6d220527-4180-0114-2c6c-0bb4e7fefb23 Address:192.168.40.162:8300} {Suffrage:Voter ID:cf6607b9-b9c1-7f90-7164-c7060a0eb0ed Address:192.168.40.163:8300}]
        latest_configuration_index = 1
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 228
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 260
        max_procs = 2
        os = linux
        version = go1.9.3
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 150
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3529
        members = 15
        query_queue = 0
        query_time = 81
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 186
        members = 3
        query_queue = 0
        query_time = 1
`]

Operating system and Environment details

Ubuntu 16.04

Log Fragments or Link to gist

Mar  4 14:02:44 consul1 consul[11411]:     2018/03/04 14:02:44 [DEBUG] http: Request GET /v1/snapshot (150.647991ms) from=127.0.0.1:53493
Mar  4 14:02:44 consul1 consul[11411]: http: Request GET /v1/snapshot (150.647991ms) from=127.0.0.1:53493

TIP: Use -log-level=TRACE on the client and server to capture the maximum log detail.

@preetapan
Copy link
Contributor

@tkald I am on Ubuntu 16.04 as well and unable to reproduce this. Is it possible you have disk corruption? https://help.ubuntu.com/community/FilesystemTroubleshooting

Please reopen if you continue to see this after ruling out disk problems.

@tkald
Copy link
Author

tkald commented Mar 6, 2018

I tried to run this backup task on 3 different consul server nodes.
Highly unlikely that all 3 nodes (running on different hardware) have developed disk corrupiton at the same time.

@tkald
Copy link
Author

tkald commented Mar 6, 2018

Also consul internal/automatic snapshots are created just fine:

Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] consul.fsm: snapshot created in 158.767µs
Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] raft: Starting snapshot up to 113372088
Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] snapshot: Creating new snapshot at /opt/consul/data/raft/snapshots/228-113372088-1520358013464.tmp
Mar  6 19:40:13 consul2 consul[6737]: consul.fsm: snapshot created in 158.767µs
Mar  6 19:40:13 consul2 consul[6737]: raft: Starting snapshot up to 113372088
Mar  6 19:40:13 consul2 consul[6737]: snapshot: Creating new snapshot at /opt/consul/data/raft/snapshots/228-113372088-1520358013464.tmp
Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] snapshot: reaping snapshot /opt/consul/data/raft/snapshots/228-113355693-1520357410548
Mar  6 19:40:13 consul2 consul[6737]: snapshot: reaping snapshot /opt/consul/data/raft/snapshots/228-113355693-1520357410548
Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] raft: Compacting logs from 113353652 to 113361848
Mar  6 19:40:13 consul2 consul[6737]: raft: Compacting logs from 113353652 to 113361848
Mar  6 19:40:13 consul2 consul[6737]:     2018/03/06 19:40:13 [INFO] raft: Snapshot to 113372088 complete
Mar  6 19:40:13 consul2 consul[6737]: raft: Snapshot to 113372088 complete

@rikwasmus
Copy link

Can reproduce it with Consul 1.0.7 (the Debian SID package).

consul --version
Consul 1.0.7-dev
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Error verifying snapshot file: failed to read snapshot file: failed checking integrity of snapshot: hash check failed for "meta.json"

Consistently, on all nodes, all on different hardware.

@yesteph
Copy link

yesteph commented Aug 3, 2018

Same issue with consul 1.1.0.

`[root@i-03f7f697d05a6944b consul]# consul snapshot save -stale -http-addr=${leader}:8500 test
Error verifying snapshot file: failed to read snapshot file: failed checking integrity of snapshot: hash check failed for "meta.json"

[root@i-03f7f697d05a6944b consul]# tar -xvf test
meta.json
state.bin
SHA256SUMS

[root@i-03f7f697d05a6944b consul]# cat SHA256SUMS
3c01a7a67f617dd1914e8baf219ba3f68cd1accb1b6a945b7eae65c8838c7344 meta.json
3795782d3bfeaf633289c3b57132f1d1eb24336f343c846340632e1758aea2fa state.bin

[root@i-03f7f697d05a6944b consul]# sha256sum meta.json
3c01a7a67f617dd1914e8baf219ba3f68cd1accb1b6a945b7eae65c8838c7344 meta.json

[root@i-03f7f697d05a6944b consul]# consul version
Consul v1.1.0
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
`

@yesteph
Copy link

yesteph commented Aug 3, 2018

Was in strange state as 1 one the 4 servers was leaving ("http://${LOCAL_IP}:8500/v1/operator/autopilot/health").

Solved this and no more problem with snapshots.

@rikwasmus
Copy link

(note that I still can reproduce the issue without 'strange state's :) )

@robincher
Copy link

Hi,

I encountered this issue with RHEL too, is there any work around with it? :)

@maf23
Copy link

maf23 commented Nov 27, 2018

I have a cluster running 1.3.0 which constantly exhibits this problem. I am no golang expert but did some digging. If I unpack the snapshot file with gtar the meta.json file ends with a newline. But it seems that when consul calculates the checksum when inspecting the file internally the data returned by tar.NewReader does not include the newline.

The checksum in SHA256SUMS is correct for the file with the newline.

Extracting and then repacking the snapshot files with gtar does not fix the problem.

@pearkes
Copy link
Contributor

pearkes commented Nov 27, 2018

@maf23 Can you report that over on #4452? I believe this may be the same issue. See the note there at a possible root cause.

Thanks for the info.

@maf23
Copy link

maf23 commented Nov 27, 2018

I continued experimenting and I can repair a broken snapshot by doing this:

  1. Unpack the snapshot with gtar
  2. Edit the extracted meta.json file and add a blank character (keep the json valid)
  3. Calculate the sha256 checksum for the modified meta.json and update SHA256SUMS
  4. Pack the files into a snapshot with gtar

This repacked archive now passes the inspect test.

The original meta.json (in the broken file) was exactly 513 bytes long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants