-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: set defaults for ignoredUnrecoverableEvents operator config #1310
Conversation
Signed-off-by: Mykhailo Kuznietsov <mkuznets@redhat.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mkuznyetsov The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @mkuznyetsov :)
Please run make fmt
but make sure you have goimports installed as well, as the format CI check is currently failing: go install golang.org/x/tools/cmd/goimports@latest
Some thoughts:
I think there's 3 important cases to test:
- Is the FailedScheduling event ignored by default? Your current test case covers this.
- Can users remove the FailedScheduling event from the ignoredUnrecoverableEvents list? In my testing, this is possible by setting ignoredUnrecoverableEvents to an empty array [] -- however, just adding ignoredUnrecoverableEvents:, won't work. To test this do a
kubectl edit dwoc -n $NAMESPACE
:
The following works:
apiVersion: controller.devfile.io/v1alpha1
config:
routing:
clusterHostSuffix: 192.168.49.2.nip.io
defaultRoutingClass: basic
workspace:
+ ignoredUnrecoverableEvents: []
imagePullPolicy: Always
progressTimeout: 60s
kind: DevWorkspaceOperatorConfig
The following will not work:
apiVersion: controller.devfile.io/v1alpha1
config:
routing:
clusterHostSuffix: 192.168.49.2.nip.io
defaultRoutingClass: basic
workspace:
+ ignoredUnrecoverableEvents:
imagePullPolicy: Always
progressTimeout: 60s
kind: DevWorkspaceOperatorConfig
IMO, this behaviour is acceptable.
- What happens when we add an extra ignoredUnrecoverableEvent? Does it merge the user-provided event(s) with the default event list (that contains FailedScheduling)? Or does it overwrite the default list with the user-provided event(s) list.
Since the DWOC CR doesn't currently show that the FailedScheduling event is being ignored, I would expect it to overwrite the default list with the user-provided list.
However, merging the default event list with the user-provided list might make sense if we use Kubebuilder annotations to set the default value in the CR level as well.
// if a transient cluster issue is triggering false-positives (for example, if | ||
// the cluster occasionally encounters FailedScheduling events). Events listed | ||
// here will not trigger DevWorkspace failures. | ||
// be ignored when deciding to fail a DevWorkspace startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure we need to mention the cluster auto-scaler in DWO (or rewrite the docs here). It might be better to mention this in the Che Cluster CRD documentation, since the ignoredUnrecoverableEvents can be configured from the Che Cluster CRD.
Instead, I would suggest:
- Mentioning "By default, the FailedScheduling is ignored"
- Removing the
"(for example, if the cluster occasionally encounters FailedScheduling events)"
since this example is no longer valid now that the FailedScheduling event is ignored by default
// For example, a FailedScheduling event, that occurs when workspace cannot start | ||
// due to exceeding available resources, should not fail the workspace startup, if there is | ||
// an autoscaler configured on the cluster, and we want to wait until it provisions additional resources. | ||
// FailedScheduling event can also occur as a false-positive, as a result of a transient cluster issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest experimenting with kubebuilder annotations for the IgnoredUnrecoverableEvents field.
We should try setting the default array value. I think this would be done with +kubebuilder:default:={"FailedScheduling"}
I believe that should be enough to populate the IgnoredUnrecoverableEvents list in the DWOC. Make sure you re-generate the CRD's in a seperate commit by running: make update_devworkspace_api update_devworkspace_crds generate_all
Something to note: This entire PR might be dropped and re-implemented in Che-Operator if we can get the kubebuilder approach working. We'd want Che admins to see that the FailedSchedling event is ignored by default & there would be no advantages to duplicating this code change in both DWO & Che-Operator (unless users who use DWO in isolation want this feature, however, this is not the current reason why we're resolving #1280).
What does this PR do?
Add FailedScheduling event to the default list of ignoredUnrecoverableEvents list in operator config.
(this PR is an alternative to #1306)
the relevant docs should also be updated:
https://eclipse.dev/che/docs/stable/administration-guide/configuring-machine-autoscaling/#_when_the_autoscaler_adds_a_new_node
What issues does this PR fix or reference?
#1280
Is it tested? How?
create a workspace with exceeding resource requests/limits (modified samples/plain.yaml):
check the workspace status, which will keep trying to start workspace, until it times out in 5 minutes:
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che