-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added alerts for failing connectors and tasks #10315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Given users can use auto-restarting of the failed connectors / tasks, the message should be adjusted to avoid suggesting it will not recover as you don't know that.
You should also not do any changes to the examples
folder as that containes only the released files.
ee862ef
to
943d27b
Compare
Thank you for the feedback! I suppose if there is a mechanism for auto restarting the failed connectors/tasks then maybe we should wait with the alerting. I have read this article: https://strimzi.io/blog/2023/01/25/auto-restarting-connectors/ but it is not completely clear for me how much we should wait with the alert, is it ~ 30 min? how much is the backoff on the last restart cycle? |
As far as I'm concerned, failure is always a failure. So I do not think there is anything wrong with raising the alerts right away with an improved message. If you want to add some wait, I would maybe wait for the first restart only? Something like 5 minutes should maybe cover it? |
943d27b
to
8de09c0
Compare
The state will transition into "restarting" so I added for: 1m to the alerts, that should alert between the first and second restart |
8de09c0
to
52ec612
Compare
I think the 1 minute will not make much change. After it fails in Connect, the operator will normally need 0-2 minutes to discover that it failed and restart it if needed. So 1 minute would fall right between that and give unpredictable results. So I would either set it to 2+ minutes to give the operator a chance to recover it. Or keep it as immediate. It also looks like you opened this based on some old main branch -> I think you will need to rebase it to solve the conflcit in the CHANGELOG (the change should go under the |
Also, could you please try to fix the DCO sign-off? The instructions should be under the Details link next to the DCO status. |
52ec612
to
f611b15
Compare
…omatically recovered and need manual intervention. Signed-off-by: Laszlo I. Hunyady <laszlo.istvan.hunyady@gmail.com>
Updated the description of the alerts Added for: 1m to the alerts to wait a loop of auto restart Signed-off-by: Laszlo I. Hunyady <laszlo.istvan.hunyady@gmail.com>
Signed-off-by: Laszlo I. Hunyady <laszlo.istvan.hunyady@gmail.com>
f611b15
to
cb7afd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
Thanks for the PR. |
Signed-off-by: Laszlo I. Hunyady <laszlo.istvan.hunyady@gmail.com>
Type of change
Description
Added alerts for failing connectors and tasks as these can not be automatically recovered and need manual intervention.