Long running exec calls getting prematurely exiting when other nodes scale down

We're currently building out a CI system (https://github.com/llvm/llvm-zorg/tree/main/premerge) that uses Github ARC on a GKE setup with autoscaling. Github ARC has a feature where jobs can dynamically specify containers they want to use which ARC will then spawn as additional pods. The main pod will then execute commands in these additional pods using the API call equivalent of `kubectl exec`.

We were running into issues where if we have a long running command (30+ minutes) and another node gets scaled down (thus stopping the `konnectivity-agent` pod on it), the `exec` would finish saying everything succeeded without actually finishing execution. The timing of the `konnectivity-agent` pods getting deleted lined up best with this happening and seems a likely candidate for the actual issue given it manages the connection with the k8s control plane (from my understanding). Not entirely sure if this is just a red herring though.

We can try to assist with a minimal reproducer if needed, although it did occur quite rarely when our cluster was processing quite a few jobs, so I'm not sure if there are additional reproduction criteria than just autoscaling and an active `exec` command. If this isn't the right forum for this type of issue, please let us know and we'll move elsewhere.

We have some additional context in b/389220221 for this with access.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long running exec calls getting prematurely exiting when other nodes scale down #748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long running exec calls getting prematurely exiting when other nodes scale down #748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions