Skip to content

Long running exec calls getting prematurely exiting when other nodes scale down #748

@boomanaiden154

Description

@boomanaiden154

We're currently building out a CI system (https://github.com/llvm/llvm-zorg/tree/main/premerge) that uses Github ARC on a GKE setup with autoscaling. Github ARC has a feature where jobs can dynamically specify containers they want to use which ARC will then spawn as additional pods. The main pod will then execute commands in these additional pods using the API call equivalent of kubectl exec.

We were running into issues where if we have a long running command (30+ minutes) and another node gets scaled down (thus stopping the konnectivity-agent pod on it), the exec would finish saying everything succeeded without actually finishing execution. The timing of the konnectivity-agent pods getting deleted lined up best with this happening and seems a likely candidate for the actual issue given it manages the connection with the k8s control plane (from my understanding). Not entirely sure if this is just a red herring though.

We can try to assist with a minimal reproducer if needed, although it did occur quite rarely when our cluster was processing quite a few jobs, so I'm not sure if there are additional reproduction criteria than just autoscaling and an active exec command. If this isn't the right forum for this type of issue, please let us know and we'll move elsewhere.

We have some additional context in b/389220221 for this with access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions