Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test etcd container in K8s cluster #15139

Closed
ahrtr opened this issue Jan 18, 2023 · 12 comments · Fixed by #15505
Closed

Test etcd container in K8s cluster #15139

ahrtr opened this issue Jan 18, 2023 · 12 comments · Fixed by #15505

Comments

@ahrtr
Copy link
Member

ahrtr commented Jan 18, 2023

What would you like to be added?

Currently all the existing automatic test cases are running etcd as process instead of container. But usually etcd is running as POD/container (e.g. static POD) in K8s. So It would be better to add automatic test case to running etcd container in a K8s cluster.

At least, we should do some sanity test for such scenario. For example, all members should can communicate with each other via the K8s service name (e.g etcd-1.etcd-headless-svc.svc.cluster.local). I am not sure whether we can use Kind to do this.

I am thinking the other alternative is to leverage the existing K8s pipeline to do such testing, but in that case we should clearly document the guide/steps, so that everyone can do it.

Why is this needed?

It can improve etcd release's quality.

@serathius
Copy link
Member

Not sure about it, have we got any complaints about problems with running etcd in K8s? There are many different way to deploy etcd in K8s and I don't see any benefits on just testing one of them. I think it's up to projects packaging etcd on K8s to test in properly.

On the other hand, etcd project distributes container image that those project use. This is the artifact passed between etcd project and K8s distributions. We might want to test the image itself to eliminate chances we break K8s users.

To test etcd image we don't need to run K8s, just can start etcd with docker, which should be much simpler.

@serathius
Copy link
Member

serathius commented Jan 18, 2023

Idea to add test is good, however I would want to avoid adding new testing framework in this time as we still didn't addressed issues of test flakiness and deduplicating integration and e2e tests.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 18, 2023

Testing the etcd image itself is must-to-have. We can do some sanity test with command something like docker run .

Testing etcd POD/container in K8s is nice-to-have to me. Note we are not going to test K8s, instead we are testing the etcd's integration with K8s. K8s is the most important user of etcd, I think it deserves the effort. At least we should run some sanity test in K8s before publishing each release.

Some issues which aren't discovered in etcd's existing workflow/pipeline might be relatively easier to be reproduced in K8s environment, such as #14420. We also received issue that etcd's DNS name (e.g. etcd-0.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc) can't be correctly resolved or reversely lookup (#15062) in K8s environment. Although it looks not like an etcd issue (I mean 15062), but it's complained that it's related to a recent change #14573.

Automatically testing the etcd image itself using docker should be an easy task. But I'm not sure what's the best approach to sanity test etcd image in K8s. Automatic test is definitely preferred, but manual test might be also accepted in the first step.

cc @ptabor @spzala @liggitt @dims as well.

@dims
Copy link
Contributor

dims commented Jan 18, 2023

cc @chaochn47 @geetasg

@serathius
Copy link
Member

I think you are talking about two different use cases, etcd being used by K8s as storage running in control plane and etcd deployed in K8s running on nodes. K8s is the biggest user of etcd for the first use case, second is much less popular.

I would like to first classify the errors to decide which use case we should focus on, for example issues you linked:

Overall I'm not as concerned about K8s testing as every etcd release is immediately tested in K8s CI in both scenarios and used for detecting performance regressions. Don't think tightening the feedback loop will decrease number of issues in noticeable way.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 19, 2023

Overall I'm not as concerned about K8s testing as every etcd release is immediately tested in K8s CI in both scenarios and used for detecting performance regressions.

Yes, it's good. The only minor concern is it's post verification. It would be better to do the some simple verification before we publish each release.

I think you are talking about two different use cases, etcd being used by K8s as storage running in control plane and etcd deployed in K8s running on nodes. K8s is the biggest user of etcd for the first use case, second is much less popular.

Ideally we can cover both. But the former seems not feasible, and It's OK to leverage the K8s CI although it's post verification. The latter should be relatively easy.

Please see discussion in #14402. It should be the former case (etcd being used by K8s as storage running in control plane) mentioned above.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 28, 2023

In the first step, let's add a github workflow to do some sanity test on the etcd image on each PR. Please anyone feel free to deliver a PR for this. Please refer to the release TEXT, pasted below,

rm -rf /tmp/etcd-data.tmp && mkdir -p /tmp/etcd-data.tmp && \
  docker rmi gcr.io/etcd-development/etcd:v3.5.7 || true && \
  docker run \
  -p 2379:2379 \
  -p 2380:2380 \
  --mount type=bind,source=/tmp/etcd-data.tmp,destination=/etcd-data \
  --name etcd-gcr-v3.5.7 \
  gcr.io/etcd-development/etcd:v3.5.7 \
  /usr/local/bin/etcd \
  --name s1 \
  --data-dir /etcd-data \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://0.0.0.0:2379 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-advertise-peer-urls http://0.0.0.0:2380 \
  --initial-cluster s1=http://0.0.0.0:2380 \
  --initial-cluster-token tkn \
  --initial-cluster-state new \
  --log-level info \
  --logger zap \
  --log-outputs stderr

docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcd --version
docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcdctl version
docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcdutl version
docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcdctl endpoint health
docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcdctl put foo bar
docker exec etcd-gcr-v3.5.7 /usr/local/bin/etcdctl get foo

@pchan
Copy link
Contributor

pchan commented Feb 6, 2023

In the first step, let's add a github workflow to do some sanity test on the etcd image on each PR. Please anyone feel free to deliver a PR for this. Please refer to the release TEXT, pasted below,

The pseudo code for this seems to be:

Build the binaries
Do a docker push to gcr.io/etcd-development/etcd:<some-tag>
Check docker exec with some basic get/put

Will that constitute the "first step" ? I can work on a PR which accomplishes the above. Is the next step being etcd being deployed in K8s running on nodes ?

@ahrtr
Copy link
Member Author

ahrtr commented Feb 6, 2023

Build the binaries
Do a docker push to gcr.io/etcd-development/etcd:<some-tag>
Check docker exec with some basic get/put

No need to push the image to any register for now, I think we just need to build binaries, build image and then verify the local image. Actually we already have a workflow release, it builds images; so I would suggest to add an additional step or job in it. Please feel free to deliver a PR for this. Thanks.

Is the next step being etcd being deployed in K8s running on nodes ?

Since Kubernetes is the most important user of etcd, so running etcd as the K8s apiserver's backend makes more sense. But it seems not an easy job to automate it ourselves. Just as I mentioned in #15139 (comment), K8s's CI verifies each etcd release, but it's post verification instead of ex-ante. It's also acceptable if we can even manually do this via K8s's CI, but it should be clearly documented.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 21, 2023

@pchan any update?

@ahrtr
Copy link
Member Author

ahrtr commented Feb 21, 2023

This issue is a priority to me, and I expect we add some sanity test per my comment above #15139 (comment), and also add some cases to verify the image arch (please see #15270).

@pchan
Copy link
Contributor

pchan commented Feb 21, 2023

@pchan any update?

I will prioritize this and come up with a PR within this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment