Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic error when saving snapshots: unexpected EOF #6893

Open
lomori opened this issue Dec 6, 2019 · 10 comments
Open

Sporadic error when saving snapshots: unexpected EOF #6893

lomori opened this issue Dec 6, 2019 · 10 comments
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected

Comments

@lomori
Copy link

lomori commented Dec 6, 2019

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

I keep getting random errors when creating backups of our Consul.

$ consul snapshot save /consul/data/20191206120000-1.snap
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF

Reproduction Steps

Just keep creating snapshots and you can see that it sometimes fails.

$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068184
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068185
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068186
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068186
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068188
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068188
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068188
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068188
$ consul snapshot save /consul/data/20191206120000-1.snap
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068194
$ consul snapshot save /consul/data/20191206120000-1.snap
Saved and verified snapshot to index 1068195

Consul info for both Client and Server

$ consul version
Consul v1.6.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Client info
output from client 'consul info' command here
Server info
agent:
        check_monitors = 0
        check_ttls = 3
        checks = 3
        services = 3
build:
        prerelease =
        revision = 9be6dfc3
        version = 1.6.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.129.8.88:8300
        server = true
raft:
        applied_index = 1068318
        commit_index = 1068318
        fsm_pending = 0
        last_contact = 10.52169ms
        last_log_index = 1068318
        last_log_term = 332
        last_snapshot_index = 1061622
        last_snapshot_term = 332
        latest_configuration = [{Suffrage:Voter ID:faea4c4e-2b87-83af-b831-9b3fe9b34c58 Address:10.131.12.242:8300} {Suffrage:Voter ID:a1b1991e-8f46-45b6-a09f-3c3cd1155bba Address:10.130.13.199:8300} {Suffrage:Voter ID:3261cc81-307f-c0be-3e7a-e224704bf2a3 Address:10.129.8.88:8300} {Suffrage:Voter ID:a12d5052-dc73-b1af-ed86-313e6c0d038c Address:10.130.6.140:8300} {Suffrage:Voter ID:14e82d1e-a6f2-eba8-d2ff-cc648a5daaac Address:10.129.12.38:8300}]
        latest_configuration_index = 1047983
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 332
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 92
        max_procs = 4
        os = linux
        version = go1.12.1
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 19
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 137
        members = 5
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 193
        members = 5
        query_queue = 0
        query_time = 1

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

@benumbed
Copy link

benumbed commented Dec 7, 2019

This happens to me on 1.6.2 on Arm64. It was also happening on a different cluster that's on amd64 before I upgraded that cluster to 1.6.2, now all is quiet on that front.

Errors from the logs of the member which requested the snapshot:

2019/12/07 00:00:02 [ERR] http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446
http: Request GET /v1/snapshot, error: failed to stream snapshot: read tcp 172.xx.xx.xx:57649->172.xx.xx.xx:8300: read: connection reset by peer from=127.0.0.1:42446

The cluster leader's logs look normal:

2019/12/07 00:00:02 [INFO] consul.fsm: snapshot created in 182.885µs
2019/12/07 00:00:02 [INFO]  raft: Starting snapshot up to 164365
2019/12/07 00:00:02 [INFO] snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp
 consul.fsm: snapshot created in 182.885µs
 raft: Starting snapshot up to 164365
 snapshot: Creating new snapshot at /var/lib/consul/data/raft/snapshots/940-164365-1575705602005.tmp
2019/12/07 00:00:02 [INFO] snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708
 snapshot: reaping snapshot /var/lib/consul/data/raft/snapshots/933-163151-1575698401708
2019/12/07 00:00:02 [INFO]  raft: Compacting logs from 153522 to 154125
  raft: Compacting logs from 153522 to 154125
2019/12/07 00:00:02 [INFO]  raft: Snapshot to 164365 complete
 raft: Snapshot to 164365 complete

Something additional that's interesting. For the failed snapshots, every single one of them has the same exact size 32871, a little over a hundred bytes over 32K.

Example:

-rw-r----- 1 consul consul 85640 Dec 6 14:00 20191206-1400.snap
-rw-r----- 1 consul consul 32871 Dec 6 16:00 20191206-1600.snap
-rw-r----- 1 consul consul 85635 Dec 6 18:00 20191206-1800.snap
-rw-r----- 1 consul consul 32871 Dec 6 20:00 20191206-2000.snap
-rw-r----- 1 consul consul 32871 Dec 6 21:00 20191206-2100.snap
-rw-r----- 1 consul consul 32871 Dec 7 00:00 20191207-0000.snap

All the numbers around 85K are good snapshots, the rest are failures.

This does not seem to happen on the cluster leader, just the follower nodes.

@danlsgiga
Copy link

Having the same issue on 1.7.1... here's the output of the same command ran 3 times consecutively

[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Error saving snapshot: Unexpected response code: 500 (failed to decode response: read tcp 192.168.0.1:60444->192.168.0.2:8300: read: connection reset by peer)
[root@server server]# CONSUL_HTTP_TOKEN=my-token ./bin/consul snapshot save consul.snapshot
Saved and verified snapshot to index 23436477

@agaudreault
Copy link

It also happens on Consul v1.6.4

@jsosulska jsosulska added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Apr 20, 2020
@ChipV223
Copy link
Contributor

Hi all!

Just wanted to check anyone has seen this issue when trying to create snapshots since the release of v1.7.3, which includes the following pull request that address this error.

@ChipV223
Copy link
Contributor

Hi all!

Closing this as I just tried to repro this behavior with v1.9.4 by running a watch command that saved a consul snapshot every 3 seconds for 20 minutes. But do feel to re-open if you see this behavior in any of the more recent Consul releases

@wsams
Copy link

wsams commented May 7, 2021

For what it's worth this is occurring in k8s 1.19 with the consul-helm chart. Consul 1.9.3

ERROR] agent.http: Request error: method=GET url=/v1/snapshot from=10.200.90.8:51322 error="failed to stream snapshot: read tcp 10.200.87.97:48112->10.200.54.76:8300: read: connection reset by peer"

@drewby08
Copy link

I'm seeing very similar behavior while running the snapshot in agent mode under Consul Enterprise 1.9.3

2021-06-16T15:12:05.528-0400 [INFO]  snapshot.license: Automated Backups feature is now licensed
2021-06-16T15:12:05.532-0400 [INFO]  license: Consul license updated
2021-06-16T15:12:05.554-0400 [INFO]  snapshot: Waiting to obtain leadership...
2021-06-16T15:12:05.570-0400 [INFO]  snapshot: Obtained leadership
2021-06-16T15:12:11.003-0400 [ERROR] snapshot: Snapshot failed (will retry at next interval): error="snapshot verification failed: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF"
2021-06-16T16:06:59.728-0400 [WARN]  snapshot: Lost leadership, stopping snapshots
2021-06-16T16:06:59.749-0400 [INFO]  snapshot: Waiting to obtain leadership...
2021-06-16T16:06:59.750-0400 [ERROR] snapshot: Unable to use leader lock that was held previously and presumed lost, giving up the lock (will retry): error="Lock already held"
2021-06-16T16:06:59.752-0400 [ERROR] snapshot: Error trying to fetch lock key: error="Unexpected response code: 500"

@shomeprasanjit
Copy link

shomeprasanjit commented Nov 15, 2021

we are using consul 1.8.0 as pods using the helm chart deployment and while performing snapshot save and restore, we got the same error

Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF

is there a fix for this yet?

Also for the workaround, can anyone verify if taking snapshot from the leader solves the problem?

@ChipV223 please re-open this issue since this is not fixed.

@ChipV223 ChipV223 reopened this Nov 17, 2021
@Amier3
Copy link
Contributor

Amier3 commented Nov 23, 2021

Thanks for re-opening this @ChipV223!

This seems like a strange issue and persistent issue. I mentioned this issue to a few of the other engineers, and the recent activity on this being focused on consul in k8s leads us to believe it could be a PV issue...

Also, @benumbed noted that:

For the failed snapshots, every single one of them has the same exact size 32871, a little over a hundred bytes over 32K.

Which is making us wonder if this issue only occurs at this exact size ( or at a different, but consistent size ) for others. So my questions to @shomeprasanjit , @wsams , and @drewby08 would be:

  • Are you using consul in containers? and if so how are you all handling PVs?
  • Are all the failed snapshots you all are seeing the exact same size in bytes?

@shomeprasanjit
Copy link

@Amier3: Thanks for following up.

Here are your answers:

Are you using consul in containers? and if so how are you all handling PVs?

Ans) Yes we are running consul as pods as a standard helm chart. we are using gp3 EBS volumes as persistent storage. this is controlled by StatefulSet.

Are all the failed snapshots you all are seeing the exact same size in bytes?

Ans) not sure exactly, it was probably in MB's but definitely not in bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

10 participants