Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(kernel): process hangs due to io_uring #2070

Closed
dbarrosop opened this issue Nov 21, 2024 · 8 comments
Closed

bug(kernel): process hangs due to io_uring #2070

dbarrosop opened this issue Nov 21, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@dbarrosop
Copy link

dbarrosop commented Nov 21, 2024

What happened:

In our staging environment we have karpenter configured to update AMIs as they come to detect issues before upgrading production. After rolling out new nodes with the AMI amazon-eks-node-al2023-arm64-standard-1.29-v20241115 (image ID ami-068228c3265014a86) workloads with CPU limits seem to break. Changing the AMI to the previous version amazon-eks-node-al2023-arm64-standard-1.29-v20241109 (image ID ami-0f12332f7a73b222b) fixes the problem.

What you expected to happen:

Upgrading shouldn't break workloads.

How to reproduce it (as minimally and precisely as possible):

Environment:

  • AWS Region: eu-central-1
  • Instance Type(s): t4g/c6g/c6gd
  • Cluster Kubernetes version: v1.29.8-eks-a737599
  • Node Kubernetes version: v1.29.10-eks-94953ac
  • AMI Version: amazon-eks-node-al2023-arm64-standard-1.29-v20241115 (ami-068228c3265014a86)

You can replicate the issue with the following steps (might be easier ways, but this is similar-ish to how we noticed as we use these nodes to build software, you can probably ignore or remove the affinities and tolerations):

apiVersion: v1
kind: Pod
metadata:
  name: deleteme17
  namespace: some-namespace
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nhost-nodegroup
            operator: In
            values:
            - build-node
  containers:
  - command:
    - sleep
    - "100000000000000"
    image: ubuntu:noble-20241015
    imagePullPolicy: Always
    name: deleteme
    resources:
      limits:
        cpu: 500m
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 1Gi
  tolerations:
  - effect: NoSchedule
    key: nhost-nodegroup
    value: build-node
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Then connect to the terminal and run:

mkdir asd
cd asd
apt-get update
apt-get install -y npm nodejs
npm install express --loglevel verbose

what you will notice (I hope) is that the installation will eventually just freeze. Reverting to a previous AMI makes this work again. Removing CPU limits also fixes the issue so that might be a clue of where the issue lies.

Thanks!

@dbarrosop dbarrosop added the bug Something isn't working label Nov 21, 2024
@bryantbiggs
Copy link
Contributor

this is an issue in the upstream Linux kernel - see here for details and a possible interim work around amazonlinux/amazon-linux-2023#840 (comment)

@cartermckinnon cartermckinnon changed the title bug(amazon-eks-node-al2023-arm64-standard-1.29-v20241115): AMI breaks workloads bug(kernel): npm hangs due to io_uring Nov 21, 2024
@cartermckinnon cartermckinnon changed the title bug(kernel): npm hangs due to io_uring bug(kernel): process hangs due to io_uring Nov 21, 2024
@cartermckinnon
Copy link
Member

@dbarrosop Setting the environment variable UV_USE_IO_URING=0 will mitigate this for npm. The fixed kernel will be included in an AMI release as soon as it’s available 👍

@dbarrosop
Copy link
Author

Thanks a lot to both of you for the information :)

@infakt-HNP
Copy link

Thank you, also applies to yarn install. You saved us a lot of time.

@sylr
Copy link
Contributor

sylr commented Dec 11, 2024

It would be nice if the team could release new AL2023 images based on AMIs which use the recently released kernel-6.1.119-129.201.amzn2023.

See: amazonlinux/amazon-linux-2023#840 (comment)

@sylr
Copy link
Contributor

sylr commented Dec 11, 2024

For information latest official AL2023 EKS images (v20241205) still ship 6.1.115-126.197.amzn2023.x86_64.

@cartermckinnon
Copy link
Member

@sylr there's a release in-progress now that will address this: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241213

The new AMIs will be available in all regions within 24 hours 👍

@ndbaker1
Copy link
Member

thanks all, release for v20241213 with kernel-6.1.119-129.201.amzn2023 is complete 🚀
closing this issue out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants