-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1816904: Ensure the Installer checks for the correct number of compute replicas before exiting #4071
Conversation
@kirankt: This pull request references Bugzilla bug 1816904, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/label platform/baremetal |
/assign @abhinavdahiya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally i don't think it is the installer's responsibilty to check if the compute nodes are running or how many,
We should probably work with machine-api-operator to track a condition on it's cluster operator that reports the machines are that haven't yet become nodes ... the installer can then provide that condition as part of the error message when the cluster complete failed.
If the cluster complete is done, worker nodes that are still provisioning or haven't provisioned means that they were not required for install-complete and hence it's weird to prompt that in installer, the cluster already has an alert for such machines in those cases.
Hi. Thanks for the feedback. I was wondering if there is an example of how we currently use the cluster operator in the installer.
Instead of an error, can we just print a warning before we call logcomplete? Instead of an error, I'm currently just printing (in the newest commit) a warning, so its not too disruptive. WDYT? |
Everyone points to everyone else as responsibility for handling this 🤷♂️ We've previously discussed this on openshift/enhancements#328 (comment) |
@abhinavdahiya What do you think of the new changes. We're just printing a warning message before exiting. IMO its pretty harmless and will not really affect other platforms. |
/retest |
1 similar comment
/retest |
@kirankt: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@abhinavdahiya . Ideally we would like to exit with a non-zero return code, rather than a warning, which is what this PR is currently doing. Also, I've ensured that this affects only the baremetal platform. PTAL and let me know what you think. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…h an error message if the number of worker replicas doesn't match the replicas requested in the install config. The error is non-fatal and the cluster will be operational except for a message that prints the inconsistency between requested worker replicas and what was available after the installation.
/retest |
@kirankt: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
even for baremetal my concerns are still the same from #4071 (review) and none of it are fundamentally answered. How do you ensure our figure out if the compute node that is missing or still not running actually required for install-complete signal? Marking installs as failed in these cases imo is inaccurate and should be avoided. If you can update this PR to that constraint, more than happy to have the installer run this check for install complete. secondly the people also use this code-path for UPI clusters and expecting/requiring those customers to match the install config replica count seems like an unnecessary requirement. |
Thanks, @abhinavdahiya for taking time to re-review. I understand your concern. It makes no sense to error out an otherwise perfectly installed cluster. What about how this PR stands now? Currently we're only printing a warning message about compute replicas. Can we at least do this (for BM platform only) and then continue to the normal exit? I think I'll be OK with this compromise. These bugs are being reported a lot and doesn't seem to be a corner case. At least if the warning message exists, we can inform the end user about it and they can take action on day-2.
FWIW, UPI uses platform=none for their installs and AFAIK it should not affect them. :-) |
I moved this bug out of the baremetal installer queue and back to the core installer team, this bug was reported in March and we've been playing a game of musical chairs with it. I don't think we're any closer to a resolution on it, and IMHO it really should be a fix for every platform in OCP. Let's close this PR. Feel free to reopen if there's a path forward that fixes it for everyone. /close |
@stbenjam: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kirankt: This pull request references Bugzilla bug 1816904. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR ensures that the installer prints out an non-fatal error message before exiting if the number of worker replicas doesn't match the replicas requested in the install config.
These errors happen mostly in the baremetal platform where nodes might take a longer to finish introspection or enter an error state before which the installer might have exited leaving with incorrect compute/worker replicas. Since we ask the user for number of replicas in the install config, it makes sense to at least see to it that the replicas have been successfully deployed before exiting.