Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

xzycn · 2022-09-13T10:17:54Z

I have done some procedures for this:

copy a snapshot file to someplace named etcd-snapshot.db
scale the statefulset to 0
start a static pod with etcdctl and mount the pvc used by etcd members
execute the restore command:

etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --initial-cluster=apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --initial-cluster-token=etcd-cluster-k8s --initial-advertise-peer-urls=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --name apisix-etcd-2  --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

but the command has a problem: the pods have been shut down,therefore there is no pod-domain exists,get errors:

{"level":"warn","ts":1663053636.7199378,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","host":"apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","retry-interval":1,"error":"lookup apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local on 192.168.0.2:53: no such host"}

If I restore without extra options:

etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

All things are OK except that the node start as a single-node,etcdctl member list only shows itself :(

So, how should I restore ETCD deployed in K8S, thank you in advance.

etcd Version: 3.4.16
Git SHA: d19fbe5
Go Version: go1.12.17
Go OS/Arch: linux/amd64

The text was updated successfully, but these errors were encountered:

ahrtr · 2022-09-15T21:22:24Z

Thanks @xzycn for raising this ticket. It looks like an issue to me.

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

pchan · 2022-09-16T12:49:19Z

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

[1] https://etcd.io/docs/v3.5/op-guide/recovery/

ahrtr · 2022-09-16T22:12:45Z

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

Please feel free to deliver a PR. Please follow #14456 (comment) to reproduce and fix this issue. I think command alone should be enough to reproduce and fix this issue. But eventually we need to verify the real scenario raised by the reporter (@xzycn ).

xzycn · 2022-09-17T07:58:53Z

@ahrtr This command comes from the helm chart https://github.com/apache/apisix-helm-chart/tree/master/charts/apisix/charts,**etcd** is the subchart of chart called apisix.

hasethuraman · 2022-09-19T12:18:59Z

My below comment is wrt 3.5.*

I did hit the issue. Adding a work around for time being without touching the chart.

Assume you have snapshot and etcd cluster is down.

Steps:
1.
Bring up the etcd cluster (etcd1, etcd2, etcd3). lets say the data dir is at /tmp/etcd1/data, /tmp/etcd2/data, /tmp/etcd3/data respectively. (if corrupted, backup the data directories and start afresh)
2.
Run the restore command to a new data-directory (--data-dir /tmp/etcd{1,2,3}/data.backup in restore command) say /tmp/etcd1/data.backup, /tmp/etcd2/data.backup, /tmp/etcd3/data.backup
3.
Bring down the etcd cluster (lets say if it is kubernetes, make replicas from 3 to 0)
4.
mv /tmp/etcd1/data /tmp/etcd1/data.prev
mv /tmp/etcd2/data /tmp/etcd2/data.prev
mv /tmp/etcd3/data /tmp/etcd3/data.prev
5.
mv /tmp/etcd1/data.backup /tmp/etcd1/data
mv /tmp/etcd2/data.backup /tmp/etcd2/data
mv /tmp/etcd3/data.backup /tmp/etcd3/data
6.
Bring up the etcd cluster (replicas from 0 to 3)

xzycn · 2022-09-19T15:29:47Z

@hasethuraman
In your steps,the restore command only with one option “--data-dir”？If so,each member will start as single-node?If not,using an option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

hasethuraman · 2022-09-20T05:12:46Z

@hasethuraman In your steps,the restore command only with one option “--data-dir”？If so,each member will start as single-node?If not,using a option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

Correct. The restore command arguments I tried is same as in https://etcd.io/docs/v3.3/op-guide/recovery/#restoring-a-cluster

pchan · 2022-09-21T10:43:39Z

I am having trouble replicating the problem. I created 2 etcd members (static configuration) in a cluster, similar to @xzycn commandline. When I restore it seems to work fine without creating the problem log message. Note that I used etcd version 3.5.5 and etcdutl (rather than etcdctl which is deprecated). I have given the command-line below. It could be because I am using IP addresses rather than hostnames. Also, I noticed that the message is a warning and not fatal, does it prevent etcd from completing ?

etcd Version: 3.5.5

Create cluster (2 such instances)

/tmp/etcd-download-test/etcd --name etcd1 --initial-advertise-peer-urls http://10.160.0.9:2380 \ --listen-peer-urls http://10.160.0.9:2380 \ --listen-client-urls http://10.160.0.9:2379,http://127.0.0.1:2379 \ --advertise-client-urls http://10.160.0.9:2379 \ --initial-cluster-token etcd-cluster-1 \ --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 \ --initial-cluster-state new \ --data-dir /home/prasadc/etcd_data

Create snapshot

etcdctl snapshot save hello.db

restore from snapshot

cat ./restore.sh

/tmp/etcd-download-test/etcdutl snapshot restore --skip-hash-check hello.db --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls http://10.160.0.9:2380 --name etcd1  --data-dir /home/prasadc/etcd_data_restore

./restore.sh
2022-09-21T10:31:04Z    info    snapshot/v3_snapshot.go:248     restoring snapshot      {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.Start\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/ctl.go:50\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/main.go:23\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
2022-09-21T10:31:04Z    info    membership/store.go:141 Trimming membership information from the backend...
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "7012f0c6b3126ac4", "added-peer-peer-urls": ["http://10.160.0.10:2380"]}
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "f3655e2d7dd93afe", "added-peer-peer-urls": ["http://10.160.0.9:2380"]}
2022-09-21T10:31:05Z    info    snapshot/v3_snapshot.go:269     restored snapshot       {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap"}

ahrtr · 2022-09-22T08:54:25Z

You need to reproduce this issue using unsolvable URL such as http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380

sanjeev98kumar · 2022-09-25T18:15:12Z

@ahrtr I want to work on this issue. could you please assign me this.

ahrtr · 2022-09-25T20:24:00Z

Thanks @sanjeev98kumar

@pchan are you still working on this issue?

pchan · 2022-09-26T01:56:25Z

@pchan are you still working on this issue?

Yes, I will implement the following part. I expect to have a PR or an update soon.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

ahrtr · 2022-09-26T02:07:28Z

Thanks @pchan for the update.

@sanjeev98kumar Please find something else to work on. FYI. find-something-to-work-on

pchan · 2022-10-03T12:46:07Z

@ahrtr I have created a PR (#14546 ) that attempts to fix this by adding a flag. Can you please review and add reviewers. I wasn't able to follow everything from Contributing guide. It passes make test-unit. I am looking for feedback, will update PR with the rest of the steps specified in Contribution guide.

ahrtr · 2022-10-03T23:35:56Z

I just realized that actually the main branch doesn't have this issue, and it can only be reproduced on 3.5 and 3.4. The issue has already been resolved in b272b98 in main branch. Please backport the commit to both 3.5 and 3.4. Thanks.

ahrtr · 2022-10-03T23:38:27Z

The original PR is #13224

pchan · 2022-10-11T11:41:30Z

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

pchan · 2022-10-11T11:47:51Z

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

The fix that is back-ported front loads the URL comparison between advertise peer (--initial-advertise-peer-urls) and initial cluster (--initial-cluster) so that resolve is not called. So if a user gives different URLs that resolves to the same ip address, the issue will still be manifested and the only way to prevent that is to use the flag. I checked the reporter's description and the backport should be enough.

ahrtr · 2022-10-11T11:51:20Z

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

Could you double check whether 3.4 have this issue and backport it to release-3.4 as well if needed? thx

ahrtr · 2022-10-12T10:08:41Z

Resolved in #14577 and #14573

The fix will be included in 3.5.6 and 3.4.22.

@pchan please add a changelog item for both 3.4 and 3.5. FYI. #14573 (comment)

ahrtr added type/bug good first issue labels Sep 15, 2022

ahrtr changed the title ~~[Help Wanted] How should I restore ETCD deployed in K8S?~~ Failed to store etcd from snapshot due to resolving peer URL failure Sep 16, 2022

ahrtr changed the title ~~Failed to store etcd from snapshot due to resolving peer URL failure~~ Failed to restore etcd from a snapshot due to resolving peer URL failure Sep 16, 2022

pchan mentioned this issue Oct 3, 2022

etcdutl: Add flag to ignore bootstrap verificaiton during restore #14546

Closed

ahrtr assigned pchan Oct 6, 2022

pchan mentioned this issue Oct 11, 2022

Automated cherry pick of #13224 #14572 #14573

Merged

ahrtr closed this as completed Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

xzycn commented Sep 13, 2022

ahrtr commented Sep 15, 2022 •

edited

Loading

pchan commented Sep 16, 2022

ahrtr commented Sep 16, 2022

xzycn commented Sep 17, 2022

hasethuraman commented Sep 19, 2022 •

edited

Loading

xzycn commented Sep 19, 2022 •

edited

Loading

hasethuraman commented Sep 20, 2022

pchan commented Sep 21, 2022 •

edited

Loading

ahrtr commented Sep 22, 2022

sanjeev98kumar commented Sep 25, 2022

ahrtr commented Sep 25, 2022

pchan commented Sep 26, 2022

ahrtr commented Sep 26, 2022

pchan commented Oct 3, 2022 •

edited

Loading

ahrtr commented Oct 3, 2022 •

edited

Loading

ahrtr commented Oct 3, 2022

pchan commented Oct 11, 2022

pchan commented Oct 11, 2022

ahrtr commented Oct 11, 2022

ahrtr commented Oct 12, 2022

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

Comments

xzycn commented Sep 13, 2022

ahrtr commented Sep 15, 2022 • edited Loading

pchan commented Sep 16, 2022

ahrtr commented Sep 16, 2022

xzycn commented Sep 17, 2022

hasethuraman commented Sep 19, 2022 • edited Loading

xzycn commented Sep 19, 2022 • edited Loading

hasethuraman commented Sep 20, 2022

pchan commented Sep 21, 2022 • edited Loading

Create cluster (2 such instances)

Create snapshot

restore from snapshot

ahrtr commented Sep 22, 2022

sanjeev98kumar commented Sep 25, 2022

ahrtr commented Sep 25, 2022

pchan commented Sep 26, 2022

ahrtr commented Sep 26, 2022

pchan commented Oct 3, 2022 • edited Loading

ahrtr commented Oct 3, 2022 • edited Loading

ahrtr commented Oct 3, 2022

pchan commented Oct 11, 2022

pchan commented Oct 11, 2022

ahrtr commented Oct 11, 2022

ahrtr commented Oct 12, 2022

ahrtr commented Sep 15, 2022 •

edited

Loading

hasethuraman commented Sep 19, 2022 •

edited

Loading

xzycn commented Sep 19, 2022 •

edited

Loading

pchan commented Sep 21, 2022 •

edited

Loading

pchan commented Oct 3, 2022 •

edited

Loading

ahrtr commented Oct 3, 2022 •

edited

Loading