Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

Closed
chen-anders opened this issue Feb 29, 2024 · 8 comments
Labels
bug stale Issue or PR is stale

Comments

@chen-anders
Copy link

chen-anders commented Feb 29, 2024

What happened:

Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.

These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.

Example error output

Sometimes we see errors around establishing connections over HTTPS:

test9 | Connecting to embed-ssl.wistia.com (embed-ssl.wistia.com)|2600:9000:244d:7800:1e:c86:4140:93a1|:443... connected.
test9 | Unable to establish SSL connection.
test9 | exit status 4
test3 | Resolving embed-ssl.wistia.com (embed-ssl.wistia.com)... failed: Try again.
test3 | wget: unable to resolve host address 'embed-ssl.wistia.com'
test3 | exit status 4

We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.

What you expected to happen:

Downloads complete without connection errors

How to reproduce it (as minimally and precisely as possible):

We have a Procfile that runs 9 downloads of a 700MB file in parallel.

Debian Slim Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash

apt-get update && apt-get install -y wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Alpine Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash
`

apk add wget # use non-busybox wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Anything else we need to know?:

Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.

Environment:
Kubernetes Versions:

  • 1.28.5 (eks.7) w/ kube-proxy v1.28.2-eksbuild.2
  • 1.29.0 (eks.1) w/ kube-proxy v1.29.0-eksbuild.2

Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:

-AL2: 5.10.209-198.858.amzn2.aarch64 / 5.10.209-198.858.amzn2.x86_64

  • Ubuntu 22: 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP
  • Ubuntu 20: 5.15.0-1048-aws #53~20.04.1-Ubuntu SMP
  • Bottlerocket: 1.18.0-7452c37e , 1.19.2-29cc92cc

Reproduced on AWS VPC CNI versions:

  • v1.16.3-eksbuild.2
  • v1.15.1-eksbuild.1

Instance types used:

  • m6g.xlarge
  • c6g.xlarge
  • m7a.8xlarge
  • m6a.8xlarge
@jdn5126
Copy link
Contributor

jdn5126 commented Mar 4, 2024

@chen-anders I suggest filing an AWS support case here, as the complexity for this issue will likely require debug sessions and cluster access.

In the meantime, I recommend collecting the node logs from the AL2 reproduction by executing the following bash script: https://github.com/awslabs/amazon-eks-ami/blob/main/log-collector-script/linux/eks-log-collector.sh

@chen-anders
Copy link
Author

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

@jdn5126
Copy link
Contributor

jdn5126 commented Mar 6, 2024

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

I see that Bottlerocket has a section on logs: https://github.com/bottlerocket-os/bottlerocket#logs, but it does not look like it collects everything that we would need. I wonder if we can use the same strategy laid out there to execute the EKS AMI bash script

Copy link

github-actions bot commented May 6, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label May 6, 2024
@acj
Copy link

acj commented May 7, 2024

Sorry for the delay on our end. We're still planning to collect and share logs.

@acj
Copy link

acj commented May 29, 2024

We've repeated our tests over the past few days and are not able to repro the download stall anymore. We haven't made any related changes to our infrastructure and are still puzzled by the behavior.

A few notes for anyone who might run into the same problem:

  • Downloads seemed to stall more frequently on Bottlerocket- and Ubuntu-based EKS worker nodes than on AL2-based ones
  • We think we were able to repro the issue (it was very similar, at least) in March on bare EC2 instances running Ubuntu, so it's unclear whether this was a VPC CNI issue at all
  • The stalls seemed vaguely correlated with network load, happening somewhat more frequently when load was heavy

Hopefully this is resolved. Thanks for your help!

@github-actions github-actions bot removed the stale Issue or PR is stale label Jul 26, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 24, 2024
Copy link

github-actions bot commented Oct 9, 2024

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

3 participants