Remediate intermittent deploy failures/timeout on prometheus #122

zachariahmiller · 2024-01-23T15:24:39Z

Describe what should be investigated or refactored

Occasionally in local dev and frequently in CI Prometheus is failing to deploy successfully and deployment times out without manual intervention.

Additional context

This may be related to pepr istio container job termination and the admission/patch jobs not getting killed but we dont always have good feedback as to if this is the case in ci.

corang · 2024-02-01T19:04:49Z

Just encountered this myself, the istio sidecar isn't terminating correctly on the kube-prometheus patch something or other, job container is done but sidecar won't die

zachariahmiller · 2024-02-01T21:05:53Z

Yeah that's the assumption. Pepr is supposed to terminate that job. It's an edge case, but not totally sure if it's the watch getting dropped or something in the actual job termination code.

corang · 2024-02-05T18:17:12Z

I do want to say after like 8 minutes the sidecar did eventually terminate, so its like pepr is getting hung up on something

docandrew · 2024-05-04T06:05:02Z

I'm running into this as well in the monitoring namespace when trying UDS Core on RKE2. I had to manually kubectl debug into the pod and kill istio-proxy with the magic /quitquitquit URL to get the UDS Core deployment to continue.

mjnagel · 2024-05-28T14:45:16Z

Should be resolved by #419 - leaving open until that is released and we have feedback though.

mjnagel · 2024-06-20T17:40:57Z

Tentatively closing - please reopen or create a new issue if you encounter this problem in latest core versions.

zachariahmiller added the possible-bug Something may not be working label Jan 23, 2024

mjnagel added the ci Issues pertaining to CI / Pipelines / Testing label Mar 28, 2024

mjnagel closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remediate intermittent deploy failures/timeout on prometheus #122

Remediate intermittent deploy failures/timeout on prometheus #122

zachariahmiller commented Jan 23, 2024

corang commented Feb 1, 2024

zachariahmiller commented Feb 1, 2024

corang commented Feb 5, 2024

docandrew commented May 4, 2024

mjnagel commented May 28, 2024

mjnagel commented Jun 20, 2024

Remediate intermittent deploy failures/timeout on prometheus #122

Remediate intermittent deploy failures/timeout on prometheus #122

Comments

zachariahmiller commented Jan 23, 2024

Describe what should be investigated or refactored

Additional context

corang commented Feb 1, 2024

zachariahmiller commented Feb 1, 2024

corang commented Feb 5, 2024

docandrew commented May 4, 2024

mjnagel commented May 28, 2024

mjnagel commented Jun 20, 2024