-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If an error propagates all the way out, bail execution #3009
If an error propagates all the way out, bail execution #3009
Conversation
@olemarkus in general I don't like the idea of ignoring an error in one source. Consider the update controller scenario were you run current version A and update to version B. In A all sources M,N and K work, but in version B source M does not work for some reason. |
Is that to increase the visibility of the fail? In our case, where the ambassador host source failed, it broke all the other sources as well. This means it's very easy for one cluster tenant to break external-dns for everyone else. Of course if the sources behaves more nicely, such as ambassador hosts after #3008, then it makes sense to fail faster. Maybe. |
If you for example deploy a new version of external-dns it should either work or break, but it should not delete records if one source can not be fetched, which would likely break half of the cluster ingress. |
I could agree if it did crashloop. But it doesn't. It just logs and restarts the iteration. It may very well be that one should |
@olemarkus great catch that one should be a fatal! |
How do we proceed with this? |
@olemarkus do you mean the change to fatal? |
Either this PR or change to fatal, yes. |
Fatal would be preferred |
352cf68
to
07dc39a
Compare
Done. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olemarkus, szuecs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This seems to be partially responsible for #3754. I think this change should be reverted. What's the point of a reconciling controller if it exits and crashes on every error it encounters? |
Agreed, this completely breaks the controller. So if any errors occur, the controller just blows up, after never doing that since it's inception? Terrible idea. |
As mentioned in the first comment, my original PR was to log and carry on, but this approach was preferred instead. In the long run I do think this approach is better, but it may be a bit painful in the beginning as some source/providers may propagate errors as an acceptable path. |
Sorry, I'm confused....how is this approach better? As far as I can tell, the controller is completely broken now and non-functional, as ANY issues with creating DNS records cause the entire thing to blow up. We have a ton of clusters that have multiple teams and multiple apps on them. As it is today, if one batch of records is erroring, it doesn't affect anything else....other teams can still deploy In an attempt to make an analogy, imagine if AWS functioned so that if someone created a s3 bucket with a bad bucket policy, no one could then create an s3 bucket as the s3 console would crash and become non-functional. How would that ever be a good idea? |
I agree, fail fast is great on a single request but in the case of a live-forever controller it would make more sense to log and continue. There's no guarantee that all good resources will be serviced before the bad resource forces an exit -- the controller is orphaning good resources. I will be downgrading my external-dns and not participating in this flavor of execution. Please consider making the fail-fast opt-out through a command line argument. |
I just wanted to add my two cents on this. We are using AWS EKS and Route53. I deployed a set of Kubernetes Ingress objects with annotations to create records and external-dns created them without issue. However, on the next run of updates, the external-dns pod kept crashing with a fatal error complaining that those same records already exist. I wasted hours troubleshooting this problem before finding this PR. The external-dns controller should not even be complaining about records already existing, let alone completely crashing, especially when it created those records in the first place. This is a bad design choice and this PR should either be reverted entirely or reworked so that the controller does not explode when it finds records that it created. I have pinned my external-dns chart version to v1.12.2 until this bug is fixed. |
Description
Related to #3008, when external-dns is configured with multiple sources and one of them fails, all other sources are skipped as well. This PR logs and continues rather than aborting completely and log.If an error propagates all the way out to the control loop, log and exit rather than log and carry on.