-
Notifications
You must be signed in to change notification settings - Fork 204
Description
We're currently building out a CI system (https://github.com/llvm/llvm-zorg/tree/main/premerge) that uses Github ARC on a GKE setup with autoscaling. Github ARC has a feature where jobs can dynamically specify containers they want to use which ARC will then spawn as additional pods. The main pod will then execute commands in these additional pods using the API call equivalent of kubectl exec.
We were running into issues where if we have a long running command (30+ minutes) and another node gets scaled down (thus stopping the konnectivity-agent pod on it), the exec would finish saying everything succeeded without actually finishing execution. The timing of the konnectivity-agent pods getting deleted lined up best with this happening and seems a likely candidate for the actual issue given it manages the connection with the k8s control plane (from my understanding). Not entirely sure if this is just a red herring though.
We can try to assist with a minimal reproducer if needed, although it did occur quite rarely when our cluster was processing quite a few jobs, so I'm not sure if there are additional reproduction criteria than just autoscaling and an active exec command. If this isn't the right forum for this type of issue, please let us know and we'll move elsewhere.
We have some additional context in b/389220221 for this with access.