IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

chen-anders · 2024-02-29T16:35:42Z

What happened:

Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.

These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.

Example error output

Sometimes we see errors around establishing connections over HTTPS:

test9 | Connecting to embed-ssl.wistia.com (embed-ssl.wistia.com)|2600:9000:244d:7800:1e:c86:4140:93a1|:443... connected.
test9 | Unable to establish SSL connection.
test9 | exit status 4

test3 | Resolving embed-ssl.wistia.com (embed-ssl.wistia.com)... failed: Try again.
test3 | wget: unable to resolve host address 'embed-ssl.wistia.com'
test3 | exit status 4

We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.

What you expected to happen:

Downloads complete without connection errors

How to reproduce it (as minimally and precisely as possible):

We have a Procfile that runs 9 downloads of a 700MB file in parallel.

Debian Slim Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash

apt-get update && apt-get install -y wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Alpine Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash
`

apk add wget # use non-busybox wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Anything else we need to know?:

Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.

Environment:
Kubernetes Versions:

1.28.5 (eks.7) w/ kube-proxy v1.28.2-eksbuild.2
1.29.0 (eks.1) w/ kube-proxy v1.29.0-eksbuild.2

Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:

-AL2: 5.10.209-198.858.amzn2.aarch64 / 5.10.209-198.858.amzn2.x86_64

Ubuntu 22: 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP
Ubuntu 20: 5.15.0-1048-aws #53~20.04.1-Ubuntu SMP
Bottlerocket: 1.18.0-7452c37e , 1.19.2-29cc92cc

Reproduced on AWS VPC CNI versions:

v1.16.3-eksbuild.2
v1.15.1-eksbuild.1

Instance types used:

m6g.xlarge
c6g.xlarge
m7a.8xlarge
m6a.8xlarge

The text was updated successfully, but these errors were encountered:

jdn5126 · 2024-03-04T17:29:55Z

@chen-anders I suggest filing an AWS support case here, as the complexity for this issue will likely require debug sessions and cluster access.

In the meantime, I recommend collecting the node logs from the AL2 reproduction by executing the following bash script: https://github.com/awslabs/amazon-eks-ami/blob/main/log-collector-script/linux/eks-log-collector.sh

chen-anders · 2024-03-06T09:45:07Z

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

jdn5126 · 2024-03-06T16:09:01Z

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

I see that Bottlerocket has a section on logs: https://github.com/bottlerocket-os/bottlerocket#logs, but it does not look like it collects everything that we would need. I wonder if we can use the same strategy laid out there to execute the EKS AMI bash script

github-actions · 2024-05-06T00:03:29Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

acj · 2024-05-07T13:03:00Z

Sorry for the delay on our end. We're still planning to collect and share logs.

acj · 2024-05-29T19:16:44Z

We've repeated our tests over the past few days and are not able to repro the download stall anymore. We haven't made any related changes to our infrastructure and are still puzzled by the behavior.

A few notes for anyone who might run into the same problem:

Downloads seemed to stall more frequently on Bottlerocket- and Ubuntu-based EKS worker nodes than on AL2-based ones
We think we were able to repro the issue (it was very similar, at least) in March on bare EC2 instances running Ubuntu, so it's unclear whether this was a VPC CNI issue at all
The stalls seemed vaguely correlated with network load, happening somewhat more frequently when load was heavy

Hopefully this is resolved. Thanks for your help!

github-actions · 2024-09-24T00:04:05Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2024-10-09T00:03:59Z

Issue closed due to inactivity.

chen-anders added the bug label Feb 29, 2024

github-actions bot added the stale Issue or PR is stale label May 6, 2024

github-actions bot removed the stale Issue or PR is stale label Jul 26, 2024

github-actions bot added the stale Issue or PR is stale label Sep 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

chen-anders commented Feb 29, 2024 •

edited

Loading

jdn5126 commented Mar 4, 2024 •

edited

Loading

chen-anders commented Mar 6, 2024

jdn5126 commented Mar 6, 2024

github-actions bot commented May 6, 2024

acj commented May 7, 2024

acj commented May 29, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Oct 9, 2024

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

Comments

chen-anders commented Feb 29, 2024 • edited Loading

Example error output

Debian Slim Container

Alpine Container

jdn5126 commented Mar 4, 2024 • edited Loading

chen-anders commented Mar 6, 2024

jdn5126 commented Mar 6, 2024

github-actions bot commented May 6, 2024

acj commented May 7, 2024

acj commented May 29, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Oct 9, 2024

chen-anders commented Feb 29, 2024 •

edited

Loading

jdn5126 commented Mar 4, 2024 •

edited

Loading