-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration reloading needs rethinking from error handling perspective #6226
Comments
@open-telemetry/collector-approvers @amdprophet @portertech @Aneurysm9 any thoughts on this? |
What is |
Oops, sorry. I mean |
There are multiple things that you are touching here, and independent of the
Should we treat these things separately? |
I think this is fair. I was approaching this from the remote configuration perspective but you are right, an argument can be made that these are also related concerns even with a local file config. For example: what happens if the machine is restarted and the Collector cannot start because the port it tries to listen on is not available. We current fail the entire Collector. Should this be the case or we may want a behavior more similar to what I described with retrying, etc.
BTW, this may also be important if there is no remote configuration enabled, but there is status reporting to a remote destination. OpAMP for example allows this, you can choose to have status reporting only even if the agent is not receiving the config from remote sources. OpAMP allows to report the effective config and the health of the agent. I believe this can be still very valuable even when you are using a local file as the config. |
Not saying it is not, but I think they are 2 separate problems that we need to discuss. |
In the retry case, this is mentioned:
How would this work in the following case where the machine is restarted? Would the good config be cached somewhere that persists machine restarts?
It would be really helpful to have a diagram of what happens in all these various cases, would be great for documentation and to help make sense of all the permutations :) |
@codeboten agree. So we need a design proposal with the state machine (maybe) for the case where loading a config fails or the loaded config is invalid. Also related to #5966 /cc @swiatekm-sumo who also talked about this. I feel we need a champion to think about this problem and propose a state diagrams (transitions). Agree on that, then we can discuss how to achieve that in code. |
@bogdandrutu @codeboten Here is the proposed transition diagram, assuming we get rid of ReportFatalError: cc @open-telemetry/collector-approvers |
I think that restoring the "last known good config" is against all SRE principles (you have different configs, kind of random running in production), so I am ok to have that only if we default to not do that and exit even in the case of an error when "last known config" is not empty. |
I agree with not restoring the "last known good config". But do we need to exit? We can keep trying to Start() (maybe it is a transient error) and we can keep waiting for a new config. In the meantime if the confmap.Provider is using OpAMP it can report that something bad happened and the operator can intervene. Also, independently from the above, one more thing: you said restoring the last config is bad because you can have different configs running in production. But we are going to have some config differences across the fleet no matter what. This can happen if:
Does this negate the argument against restoring the config or you still think it is a bad idea? Anyway, I don't mind eliminating the "restore" step if we believe it is harmful. Maybe it can be an option that the user can select? |
It's not exactly the same thing, but it's definitely normal to halt the rollout of a bad config change, and most orchestrators will do so automatically. The end result of that is that we have the old config live even though a new one was rolled out, which is what we're talking about here. I think the argument for making this optional is right though, as the user will typically want their orchestrator to manage this. But if they want to use OTRM without an orchestrator, they definitely don't want the collector to terminate on a bad config. |
This was previously raised in comments here.
Currently if the resolved config is invalid we shutdown the
Service
and exit the process. This works fine for Collectors with local configuration sources. It is an expected behavior to fail fast and show where the problem is when you try to start the Collector.On the contrary the current behavior is ill-suited for remote configuration management implemented via
confmap.Provider
. The config changes in this case are often unattended (config is rolled out at scale and no-one is typically baby-sitting individual Collectors) and if anything goes wrong during the config update the Collector is seen as suddenly crashing and losing the connection to the remote config source and losing any chance to recover.Instead of this we may want to have a more desirable behavior:
Service
at all. Instead we may want to be able to report the fact of the invalid config to theconfmap.Provider
so that it can report back to the remote config source. This requires validating the config first and only after that trying to restart theService
.Service.Start()
function fails we may want to report the failure toconfmap.Provider
.Start()
.Service.Start()
). IfStart()
fails with the last known good config as well (which may happen if some external condition is different) retry it with exponential backoff until it succeeds.An alternate to this is to say
confmap.Provider
is not recommended to be used with watchable remote config sources that can change the config while the Collector is unattended. Instead we can recommend that such watchable remote config source should only be used with external Supervisor processes which take care of restarting failed Collector processes. This is a valid approach if we want to go with this, but we need to be explicit that the currentconfmap.Provider
design is not suited for such uses cases.The text was updated successfully, but these errors were encountered: