-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
we have no alerts for label_sync failing to run #9121
Comments
Can we migrate it to a |
We need to create a cluster key that targets a cluster with trusted credentials, and write automation to ensure that only sig-testing jobs can use it. |
/milestone v1.13 |
Unfortunately the test grid approach isn't useful for us and we have the same problem! Maybe we could have a Prometheus alert if we publish per-job failure metrics? |
/milestone clear I'm opting to punt this if the testgrid approach isn't good enough. Feel free to add back in to v1.14 if you're inclined to take the approach you suggested @stevekuznetsov |
You guys have the background to run it as a periodic now, so I think that should work ... |
/milestone v1.14 OK, let's see if we can get to this |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
It's running as a periodic now after @cjwagner changed it so alerting on it in testgrid should not be hard? |
Already done: test-infra/testgrid/config.yaml Lines 2191 to 2194 in ae2330f
We could probably tighten the alert window now. |
/kind cleanup
/area label_sync
ref: #9054
While trying to deploy a new copy of the label_sync cronjob, I noticed that it had hung a few days ago. An invalid label description caused it to hang, and no more copies were run subsequently. How could we alert on this kind of situation?
I'm thinking of a pattern where logs are uploaded to GCS and consumed by testgrid just like all of our other maintenance jobs, and an alert is setup to fire if we see failure (or no success within a certain time)
The text was updated successfully, but these errors were encountered: