Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

Closed
xzycn opened this issue Sep 13, 2022 · 20 comments
Closed

Failed to restore etcd from a snapshot due to resolving peer URL failure #14456

xzycn opened this issue Sep 13, 2022 · 20 comments

Comments

@xzycn
Copy link

xzycn commented Sep 13, 2022

I have done some procedures for this:

  1. copy a snapshot file to someplace named etcd-snapshot.db
  2. scale the statefulset to 0
  3. start a static pod with etcdctl and mount the pvc used by etcd members
  4. execute the restore command:
etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --initial-cluster=apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --initial-cluster-token=etcd-cluster-k8s --initial-advertise-peer-urls=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380 --name apisix-etcd-2  --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

but the command has a problem: the pods have been shut down,therefore there is no pod-domain exists,get errors:

{"level":"warn","ts":1663053636.7199378,"caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","host":"apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380","retry-interval":1,"error":"lookup apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local on 192.168.0.2:53: no such host"}

If I restore without extra options:

etcdctl snapshot restore --skip-hash-check etcd-snapshot.db --data-dir=/opt/nfsdata/apisix-data-apisix-etcd-2-pvc-f8ef09a4-e8f2-404d-8d14-63b905e324be/data

All things are OK except that the node start as a single-node,etcdctl member list only shows itself :(

So, how should I restore ETCD deployed in K8S, thank you in advance.

etcd Version: 3.4.16
Git SHA: d19fbe5
Go Version: go1.12.17
Go OS/Arch: linux/amd64

@ahrtr
Copy link
Member

ahrtr commented Sep 15, 2022

Thanks @xzycn for raising this ticket. It looks like an issue to me.

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

@ahrtr ahrtr changed the title [Help Wanted] How should I restore ETCD deployed in K8S? Failed to store etcd from snapshot due to resolving peer URL failure Sep 16, 2022
@ahrtr ahrtr changed the title Failed to store etcd from snapshot due to resolving peer URL failure Failed to restore etcd from a snapshot due to resolving peer URL failure Sep 16, 2022
@pchan
Copy link
Contributor

pchan commented Sep 16, 2022

The error is coming from VerifyBootstrap. Specifically, it's coming from netutil.URLStringsEqual. When etcdctl/etcdutl tries to verify whether --initial-advertise-peer-urls matches the its URL included in --initial-cluster, it may need to resolve the TCPAddress.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

It should be an easy fix. Please anyone feel free to deliver a PR for this, and we can have more discussion under the PR.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

[1] https://etcd.io/docs/v3.5/op-guide/recovery/

@ahrtr
Copy link
Member

ahrtr commented Sep 16, 2022

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

I would like to work on this. To replicate this, will etcd commands [1] alone should be enough or should there be other circumstances that causes the url resolution to fail ?

Please feel free to deliver a PR. Please follow #14456 (comment) to reproduce and fix this issue. I think command alone should be enough to reproduce and fix this issue. But eventually we need to verify the real scenario raised by the reporter (@xzycn ).

@xzycn
Copy link
Author

xzycn commented Sep 17, 2022

@ahrtr This command comes from the helm chart https://github.com/apache/apisix-helm-chart/tree/master/charts/apisix/charts,**etcd** is the subchart of chart called apisix.

@hasethuraman
Copy link

hasethuraman commented Sep 19, 2022

My below comment is wrt 3.5.*

I did hit the issue. Adding a work around for time being without touching the chart.

Assume you have snapshot and etcd cluster is down.

Steps:
1.
Bring up the etcd cluster (etcd1, etcd2, etcd3). lets say the data dir is at /tmp/etcd1/data, /tmp/etcd2/data, /tmp/etcd3/data respectively. (if corrupted, backup the data directories and start afresh)
2.
Run the restore command to a new data-directory (--data-dir /tmp/etcd{1,2,3}/data.backup in restore command) say /tmp/etcd1/data.backup, /tmp/etcd2/data.backup, /tmp/etcd3/data.backup
3.
Bring down the etcd cluster (lets say if it is kubernetes, make replicas from 3 to 0)
4.
mv /tmp/etcd1/data /tmp/etcd1/data.prev
mv /tmp/etcd2/data /tmp/etcd2/data.prev
mv /tmp/etcd3/data /tmp/etcd3/data.prev
5.
mv /tmp/etcd1/data.backup /tmp/etcd1/data
mv /tmp/etcd2/data.backup /tmp/etcd2/data
mv /tmp/etcd3/data.backup /tmp/etcd3/data
6.
Bring up the etcd cluster (replicas from 0 to 3)

@xzycn
Copy link
Author

xzycn commented Sep 19, 2022

@hasethuraman
In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using an option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

@hasethuraman
Copy link

@hasethuraman In your steps,the restore command only with one option “--data-dir”?If so,each member will start as single-node?If not,using a option(e.g. initial-cluster ) with a domain will cause the problem as title describled.

Correct. The restore command arguments I tried is same as in https://etcd.io/docs/v3.3/op-guide/recovery/#restoring-a-cluster

@pchan
Copy link
Contributor

pchan commented Sep 21, 2022

I am having trouble replicating the problem. I created 2 etcd members (static configuration) in a cluster, similar to @xzycn commandline. When I restore it seems to work fine without creating the problem log message. Note that I used etcd version 3.5.5 and etcdutl (rather than etcdctl which is deprecated). I have given the command-line below. It could be because I am using IP addresses rather than hostnames. Also, I noticed that the message is a warning and not fatal, does it prevent etcd from completing ?

etcd Version: 3.5.5

Create cluster (2 such instances)

/tmp/etcd-download-test/etcd --name etcd1 --initial-advertise-peer-urls http://10.160.0.9:2380 \ --listen-peer-urls http://10.160.0.9:2380 \ --listen-client-urls http://10.160.0.9:2379,http://127.0.0.1:2379 \ --advertise-client-urls http://10.160.0.9:2379 \ --initial-cluster-token etcd-cluster-1 \ --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 \ --initial-cluster-state new \ --data-dir /home/prasadc/etcd_data

Create snapshot

etcdctl snapshot save hello.db

restore from snapshot

cat ./restore.sh

/tmp/etcd-download-test/etcdutl snapshot restore --skip-hash-check hello.db --initial-cluster etcd1=http://10.160.0.9:2380,etcd2=http://10.160.0.10:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls http://10.160.0.9:2380 --name etcd1  --data-dir /home/prasadc/etcd_data_restore

./restore.sh
2022-09-21T10:31:04Z    info    snapshot/v3_snapshot.go:248     restoring snapshot      {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.snapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:117\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/siarkowicz/.gvm/pkgsets/go1.16.15/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.Start\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/ctl.go:50\nmain.main\n\t/tmp/etcd-release-3.5.5/etcd/release/etcd/etcdutl/main.go:23\nruntime.main\n\t/usr/local/google/home/siarkowicz/.gvm/gos/go1.16.15/src/runtime/proc.go:225"}
2022-09-21T10:31:04Z    info    membership/store.go:141 Trimming membership information from the backend...
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "7012f0c6b3126ac4", "added-peer-peer-urls": ["http://10.160.0.10:2380"]}
2022-09-21T10:31:04Z    info    membership/cluster.go:421       added member    {"cluster-id": "5856bec5b20bce76", "local-member-id": "0", "added-peer-id": "f3655e2d7dd93afe", "added-peer-peer-urls": ["http://10.160.0.9:2380"]}
2022-09-21T10:31:05Z    info    snapshot/v3_snapshot.go:269     restored snapshot       {"path": "hello.db", "wal-dir": "/home/prasadc/etcd_data_restore/member/wal", "data-dir": "/home/prasadc/etcd_data_restore", "snap-dir": "/home/prasadc/etcd_data_restore/member/snap"}

@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2022

You need to reproduce this issue using unsolvable URL such as http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380

@sanjeev98kumar
Copy link

@ahrtr I want to work on this issue. could you please assign me this.

@ahrtr
Copy link
Member

ahrtr commented Sep 25, 2022

Thanks @sanjeev98kumar

@pchan are you still working on this issue?

@pchan
Copy link
Contributor

pchan commented Sep 26, 2022

@pchan are you still working on this issue?

Yes, I will implement the following part. I expect to have a PR or an update soon.

The proposed fix is to add a flag something like "--ignore-bootstrap-verify" for etcdutl to bypass the issue for the case of restoring snapshot.

@ahrtr
Copy link
Member

ahrtr commented Sep 26, 2022

Thanks @pchan for the update.

@sanjeev98kumar Please find something else to work on. FYI. find-something-to-work-on

@pchan
Copy link
Contributor

pchan commented Oct 3, 2022

@ahrtr I have created a PR (#14546 ) that attempts to fix this by adding a flag. Can you please review and add reviewers. I wasn't able to follow everything from Contributing guide. It passes make test-unit. I am looking for feedback, will update PR with the rest of the steps specified in Contribution guide.

@ahrtr
Copy link
Member

ahrtr commented Oct 3, 2022

I just realized that actually the main branch doesn't have this issue, and it can only be reproduced on 3.5 and 3.4. The issue has already been resolved in b272b98 in main branch. Please backport the commit to both 3.5 and 3.4. Thanks.

@ahrtr
Copy link
Member

ahrtr commented Oct 3, 2022

The original PR is #13224

@pchan
Copy link
Contributor

pchan commented Oct 11, 2022

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

@pchan
Copy link
Contributor

pchan commented Oct 11, 2022

Why do we need to bypass during restore ? Is there any specific reason why the URL resolution will fail in this case ?

Because the etcd POD isn't running when restoring from the snapshot, so the URL something like apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local can't be resolved. Please refer to the reporter's description above.

The fix that is back-ported front loads the URL comparison between advertise peer (--initial-advertise-peer-urls) and initial cluster (--initial-cluster) so that resolve is not called. So if a user gives different URLs that resolves to the same ip address, the issue will still be manifested and the only way to prevent that is to use the flag. I checked the reporter's description and the backport should be enough.

@ahrtr
Copy link
Member

ahrtr commented Oct 11, 2022

I have created a cherry-pick of #13224 in #14573 for release 3.5. What are the next steps, should I run tests in 3.5 ?

Could you double check whether 3.4 have this issue and backport it to release-3.4 as well if needed? thx

@ahrtr
Copy link
Member

ahrtr commented Oct 12, 2022

Resolved in #14577 and #14573

The fix will be included in 3.5.6 and 3.4.22.

@pchan please add a changelog item for both 3.4 and 3.5. FYI. #14573 (comment)

@ahrtr ahrtr closed this as completed Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants