Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

Merged
merged 15 commits into from
Sep 23, 2021

Conversation

amisevsk
Copy link
Collaborator

What does this PR do?

  • Move Operator configuration into a custom resource (DevWorkspaceOperatorConfiguration, shortname dwoc), hopefully making configuration easier
    • Default config is stored internally and overridden by the CR, so deleting entries/the CR restores default values. This means that the operator works fine (on OpenShift, at least) without any additional resources.
    • On OpenShift, clusterRoutingSuffix is detected automatically, must be supplied in Kubernetes
    • Since config is a CR, it cannot be part of combined.yaml anymore (rejected as unrecognized). I've moved creation of a config from env vars to a separate step in make install
  • Add support for ignoring unrecoverable events

Currently, the config CR's name is hard-coded to devworkspace-operator-config. It might be worth changing that since it could be unclear (you have to create a CR with this specific name)

One potential task to complete before merging is to implement a sort of migration process that converts an existing configmap into a config CR, to allow seamless migration from previous installs.

Hopefully the commit history is legible; the change is fairly wide-ranging due to how frequently config is used in DWO.

What issues does this PR fix or reference?

Closes #550
Closes #191
Related to #577

Is it tested? How?

  1. Deploy DWO
  2. Try to start DevWorkspace that will encounter unrecoverable event
    cat <<EOF | kubectl apply -f -
    kind: DevWorkspace
    apiVersion: workspace.devfile.io/v1alpha2
    metadata:
      name: theia-next
    spec:
      started: true
      template:
        components:
          - name: theia
            plugin:
              uri: https://che-plugin-registry-main.surge.sh/v3/plugins/eclipse/che-theia/next/devfile.yaml
              components:
                - name: theia-ide
                  container:
                    memoryRequest: 100Gi
                    memoryLimit: 100Gi
    EOF
  3. Workspace fails to start; `kubectl delete dw theia-next
  4. Apply a new configuration to ignore FailedScheduling events:
    cat <<EOF | kubectl apply -f -
    kind: DevWorkspaceOperatorConfig
    apiVersion: controller.devfile.io/v1alpha1
    metadata:
      name: devworkspace-operator-config
    config:
      workspace:
        ignoredUnrecoverableEvents:
          - FailedScheduling
    EOF
  5. Re-apply workspace from step 2 -- startup should hang but not fail.

Bonus:
dwo-config-explain

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@amisevsk amisevsk changed the title Config in crd Move operator configuration into a custom resource and enable ignoring unrecoverable events Sep 16, 2021
@amisevsk
Copy link
Collaborator Author

I'm curious is we can reproduce the e2e test failure from #593 in this PR

/test v8-devworkspace-operator-e2e, v8-che-happy-path

@sleshchenko
Copy link
Member

go tests fail with

vet: pkg/library/flatten/flatten_test.go:237:11: SetupControllerCfg not declared by package testutil
make: *** [Makefile:171: vet] Error 2
Error: Process completed with exit code 2.

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, I haven't tested yet but flushing code review result.

PROJECT Outdated Show resolved Hide resolved
pkg/config/env.go Outdated Show resolved Hide resolved
pkg/config/sync.go Show resolved Hide resolved
pkg/config/defaults.go Show resolved Hide resolved
updatePublicConfig()
}

func updatePublicConfig() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to report in logs when changes are detected and Config is reloaded? Probably also print the result how it's merged with default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had this for debugging and removed it before opening the PR :D

I'll re-add

Copy link
Collaborator Author

@amisevsk amisevsk Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Currently, we log non-default values only (otherwise it can be very long).

One question: should we be logging clusterHostSuffix? Someone using DWO might want to suppress the URL of their cluster in logs, especially if sharing publicly in e.g. a bug report.

pkg/config/sync.go Show resolved Hide resolved
defaultTerminalDockerimageProperty = "devworkspace.default_dockerimage.redhat-developer.web-terminal"
)

func (wc *ControllerConfig) GetDefaultTerminalDockerimage() (*dw.Component, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think e2e test needs to be adapted after these changes:

------------------------------
• Failure [1.527 seconds]
[Create OpenShift Web Terminal Workspace]
/home/sleshche/projects/devworkspace-operator/test/e2e/pkg/tests/devworkspaces_tests.go:25
  Check that pod creator can execute a command in the container [It]
  /home/sleshche/projects/devworkspace-operator/test/e2e/pkg/tests/devworkspaces_tests.go:56

  Cannot execute command in the devworkspace container. Error: `exit status 1`. Exec output: `Error from server (BadRequest): container dev is not valid for pod workspaceda33612faf9843c6-5c775dc868-26vs7         

Currently, it references plugin from internal registry
https://github.com/devfile/devworkspace-operator/blob/main/samples/web-terminal.yaml#L16

I think the way we should rework it - make it not terminal specific test but restricted access. Then we can pretty any container component and controller.devfile.io/restricted-access: "true" annotation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's hoping I fixed it -- I don't have OpenShift handy to test changes 🤞

} else {
config.ConfigMapReference.Namespace = os.Getenv(infrastructure.WatchNamespaceEnvVar)
}
err = config.WatchControllerConfig(mgr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Che Operator needs to be adapted to these changes not to break Che + DevWorkspace integration on Kubernetes:

https://github.com/eclipse-che/che-operator/blob/main/pkg/deploy/dev-workspace/dev_workspace.go#L332

please note that Operator CR must be already available when Che Operator does this logic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened PR eclipse-che/che-operator#1081 (note no config CRD is necessary to run DWO anymore, and clusterHostSuffix is only required for the basic routing solver, which is outside the Che Operator use-case).

kind: DevWorkspaceOperatorConfig
apiVersion: controller.devfile.io/v1alpha1
metadata:
name: devworkspace-operator-config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: devworkspace-operator-config
name: devworkspace-operator-config
namespace: ${NAMESPACE}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I got my routing suffix wrongly assigned
Screenshot_20210920_144601

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, previously, on OpenShift we would always overwrite clusterHostSuffix from a route, even if the configmap contained a different value. With the current changes, we only create the test route when clusterHostSuffix is unset (which doesn't happen with the makefile, since it creates a config).

I'll think about how best to handle this -- should we just assume OpenShift users want to use the default fill for route.spec.host?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the Makefile to not set a default ROUTING_SUFFIX anymore. On OpenShift, this should result in the config getting an empty value, which would be auto-filled by the controller. On Kubernetes, we auto-detect it in the makefile for minikube and print a warning if it's empty.

name: devworkspace-operator-config
config:
routing:
clusterHostSuffix: ${ROUTING_SUFFIX}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually shows another issue we have: it would be better to report: Waiting for main endpoint be ready, health check fails:
Screenshot_20210920_145448

it makes sense to create a separate issue for that. I'll do

Copy link
Member

@sleshchenko sleshchenko Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the background for one more issue:
After clusterSuffix is cleaned up, it's not propagated to existing workspaces, even after restart. Removing DevWorkspace Routing helps but it's not desired behavior I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure how to reproduce the above issue (deployment is running but devworkspace reports waiting for workspace deployment). When I start a DevWorkspace locally, I see the Waiting for editor to start message (sometimes, only briefly) -- I assume if startup is really hung on the health check, that message should show up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After clusterSuffix is cleaned up, it's not propagated to existing workspaces, even after restart. Removing DevWorkspace Routing helps but it's not desired behavior I think.

This is because the DevWorkspaceRouting CR doesn't store clusterRouteSuffix, so reconciling the DevWorkspace won't trigger any reconciles to the DevWorkspaceRouting. Anything triggering a reconcile for the routing object will cause the change to be reflected (e.g. modifying annotations/labels on routes) but otherwise there are no events for the routing controller to respond to.

We could work around this by propagating the .spec.started field to DevWorkspaceRouting; this would at least enable stopping + restarting the workspace to propagate the config change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #602

@sleshchenko
Copy link
Member

@amisevsk
Copy link
Collaborator Author

Squashed the large mess of fixup commits

@sleshchenko
Copy link
Member

/test v8-devworkspace-operator-e2e, v8-che-happy-path

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed changes carefully again but I have tested on OpenShift and it works fine.

Probably logs much be more reach, like what is the whole config piece, but not just what is changed. And what are actually default config are
Screenshot_20210922_174430
But let's move it out of the current PR scope.

Good job!

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
* Remove web-terminal defaulting functionality
* Move all env-var related config settings into same file
* Remove unused GetTlsInsecureSkipVerify

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Log all updates to internal configuration to hopefully aid diagnosing
issues in the future.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Add new testing DevWorkspace resource that utilizes restricted access
but does not attempt to use plugins from the internal registry.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Since ROUTING_SUFFIX has to be set for each Kubernetes cluster (and not
set in OpenShift), don't use any default value for ROUTING_SUFFIX. If
running in minikube, autodetect appropriate ROUTING_SUFFIX; otherwise
(if on Kubernetes) warn user that ROUTING_SUFFIX is unset.

On OpenShift, rely on default detection unless ROUTING_SUFFIX is
explicitly set

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@amisevsk
Copy link
Collaborator Author

e2e tests failed because ExecCommandInContainer assumed a hardcoded container name. Should be fixed now.

/test v8-devworkspace-operator-e2e, v8-che-happy-path

@amisevsk
Copy link
Collaborator Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Sep 22, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, JPinkney, sleshchenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JPinkney,amisevsk,sleshchenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@amisevsk amisevsk merged commit cb13216 into devfile:main Sep 23, 2021
@amisevsk amisevsk deleted the config-in-crd branch September 23, 2021 03:41
@amisevsk amisevsk mentioned this pull request Oct 8, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make unrecoverable events in controller configurable Remove configmap from operator
3 participants