Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

amisevsk · 2021-09-16T17:13:53Z

What does this PR do?

Move Operator configuration into a custom resource (DevWorkspaceOperatorConfiguration, shortname dwoc), hopefully making configuration easier
- Default config is stored internally and overridden by the CR, so deleting entries/the CR restores default values. This means that the operator works fine (on OpenShift, at least) without any additional resources.
- On OpenShift, clusterRoutingSuffix is detected automatically, must be supplied in Kubernetes
- Since config is a CR, it cannot be part of combined.yaml anymore (rejected as unrecognized). I've moved creation of a config from env vars to a separate step in make install
Add support for ignoring unrecoverable events

Currently, the config CR's name is hard-coded to devworkspace-operator-config. It might be worth changing that since it could be unclear (you have to create a CR with this specific name)

One potential task to complete before merging is to implement a sort of migration process that converts an existing configmap into a config CR, to allow seamless migration from previous installs.

Hopefully the commit history is legible; the change is fairly wide-ranging due to how frequently config is used in DWO.

What issues does this PR fix or reference?

Closes #550
Closes #191
Related to #577

Is it tested? How?

Deploy DWO

Try to start DevWorkspace that will encounter unrecoverable event

cat <<EOF | kubectl apply -f -
kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: theia-next
spec:
  started: true
  template:
    components:
      - name: theia
        plugin:
          uri: https://che-plugin-registry-main.surge.sh/v3/plugins/eclipse/che-theia/next/devfile.yaml
          components:
            - name: theia-ide
              container:
                memoryRequest: 100Gi
                memoryLimit: 100Gi
EOF

Workspace fails to start; `kubectl delete dw theia-next

Apply a new configuration to ignore FailedScheduling events:

cat <<EOF | kubectl apply -f -
kind: DevWorkspaceOperatorConfig
apiVersion: controller.devfile.io/v1alpha1
metadata:
  name: devworkspace-operator-config
config:
  workspace:
    ignoredUnrecoverableEvents:
      - FailedScheduling
EOF

Re-apply workspace from step 2 -- startup should hang but not fail.

Bonus:

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

amisevsk · 2021-09-17T14:25:13Z

I'm curious is we can reproduce the e2e test failure from #593 in this PR

/test v8-devworkspace-operator-e2e, v8-che-happy-path

sleshchenko · 2021-09-20T10:38:03Z

go tests fail with

vet: pkg/library/flatten/flatten_test.go:237:11: SetupControllerCfg not declared by package testutil
make: *** [Makefile:171: vet] Error 2
Error: Process completed with exit code 2.

sleshchenko

Good job, I haven't tested yet but flushing code review result.

PROJECT

apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go

pkg/config/env.go

controllers/controller/devworkspacerouting/solvers/basic_solver.go

pkg/config/sync.go

pkg/config/defaults.go

sleshchenko · 2021-09-20T11:16:19Z

pkg/config/sync.go

+	updatePublicConfig()
+}
+
+func updatePublicConfig() {


Maybe it makes sense to report in logs when changes are detected and Config is reloaded? Probably also print the result how it's merged with default

I actually had this for debugging and removed it before opening the PR :D

I'll re-add

Added. Currently, we log non-default values only (otherwise it can be very long).

One question: should we be logging clusterHostSuffix? Someone using DWO might want to suppress the URL of their cluster in logs, especially if sharing publicly in e.g. a bug report.

pkg/config/sync.go

sleshchenko · 2021-09-20T11:42:09Z

pkg/config/cmd_terminal.go

-	defaultTerminalDockerimageProperty = "devworkspace.default_dockerimage.redhat-developer.web-terminal"
-)
-
-func (wc *ControllerConfig) GetDefaultTerminalDockerimage() (*dw.Component, error) {


I think e2e test needs to be adapted after these changes:

------------------------------ • Failure [1.527 seconds] [Create OpenShift Web Terminal Workspace] /home/sleshche/projects/devworkspace-operator/test/e2e/pkg/tests/devworkspaces_tests.go:25 Check that pod creator can execute a command in the container [It] /home/sleshche/projects/devworkspace-operator/test/e2e/pkg/tests/devworkspaces_tests.go:56 Cannot execute command in the devworkspace container. Error: `exit status 1`. Exec output: `Error from server (BadRequest): container dev is not valid for pod workspaceda33612faf9843c6-5c775dc868-26vs7

Currently, it references plugin from internal registry
https://github.com/devfile/devworkspace-operator/blob/main/samples/web-terminal.yaml#L16

I think the way we should rework it - make it not terminal specific test but restricted access. Then we can pretty any container component and controller.devfile.io/restricted-access: "true" annotation

Here's hoping I fixed it -- I don't have OpenShift handy to test changes 🤞

sleshchenko · 2021-09-20T11:43:45Z

main.go

-	} else {
-		config.ConfigMapReference.Namespace = os.Getenv(infrastructure.WatchNamespaceEnvVar)
-	}
-	err = config.WatchControllerConfig(mgr)


Che Operator needs to be adapted to these changes not to break Che + DevWorkspace integration on Kubernetes:

https://github.com/eclipse-che/che-operator/blob/main/pkg/deploy/dev-workspace/dev_workspace.go#L332

please note that Operator CR must be already available when Che Operator does this logic

Opened PR eclipse-che/che-operator#1081 (note no config CRD is necessary to run DWO anymore, and clusterHostSuffix is only required for the basic routing solver, which is outside the Che Operator use-case).

sleshchenko · 2021-09-20T11:45:47Z

deploy/default-config.yaml

+kind: DevWorkspaceOperatorConfig
+apiVersion: controller.devfile.io/v1alpha1
+metadata:
+  name: devworkspace-operator-config


Suggested change

name: devworkspace-operator-config

name: devworkspace-operator-config

namespace: ${NAMESPACE}

Also, I got my routing suffix wrongly assigned

Ahh yes, previously, on OpenShift we would always overwrite clusterHostSuffix from a route, even if the configmap contained a different value. With the current changes, we only create the test route when clusterHostSuffix is unset (which doesn't happen with the makefile, since it creates a config).

I'll think about how best to handle this -- should we just assume OpenShift users want to use the default fill for route.spec.host?

I've updated the Makefile to not set a default ROUTING_SUFFIX anymore. On OpenShift, this should result in the config getting an empty value, which would be auto-filled by the controller. On Kubernetes, we auto-detect it in the makefile for minikube and print a warning if it's empty.

sleshchenko · 2021-09-20T11:56:10Z

deploy/default-config.yaml

+  name: devworkspace-operator-config
+config:
+  routing:
+    clusterHostSuffix: ${ROUTING_SUFFIX}


This actually shows another issue we have: it would be better to report: Waiting for main endpoint be ready, health check fails:

it makes sense to create a separate issue for that. I'll do

Probably the background for one more issue:
After clusterSuffix is cleaned up, it's not propagated to existing workspaces, even after restart. Removing DevWorkspace Routing helps but it's not desired behavior I think.

Hmm I'm not sure how to reproduce the above issue (deployment is running but devworkspace reports waiting for workspace deployment). When I start a DevWorkspace locally, I see the Waiting for editor to start message (sometimes, only briefly) -- I assume if startup is really hung on the health check, that message should show up.

After clusterSuffix is cleaned up, it's not propagated to existing workspaces, even after restart. Removing DevWorkspace Routing helps but it's not desired behavior I think.

This is because the DevWorkspaceRouting CR doesn't store clusterRouteSuffix, so reconciling the DevWorkspace won't trigger any reconciles to the DevWorkspaceRouting. Anything triggering a reconcile for the routing object will cause the change to be reflected (e.g. modifying annotations/labels on routes) but otherwise there are no events for the routing controller to respond to.

We could work around this by propagating the .spec.started field to DevWorkspaceRouting; this would at least enable stopping + restarting the workspace to propagate the config change.

Created #602

sleshchenko · 2021-09-21T16:39:32Z

PR check is failing https://github.com/devfile/devworkspace-operator/pull/598/checks?check_run_id=3654476483

amisevsk · 2021-09-21T19:02:05Z

Squashed the large mess of fixup commits

sleshchenko · 2021-09-22T12:17:32Z

/test v8-devworkspace-operator-e2e, v8-che-happy-path

sleshchenko

I haven't reviewed changes carefully again but I have tested on OpenShift and it works fine.

Probably logs much be more reach, like what is the whole config piece, but not just what is changed. And what are actually default config are

But let's move it out of the current PR scope.

Good job!

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

* Remove web-terminal defaulting functionality * Move all env-var related config settings into same file * Remove unused GetTlsInsecureSkipVerify Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Log all updates to internal configuration to hopefully aid diagnosing issues in the future. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Add new testing DevWorkspace resource that utilizes restricted access but does not attempt to use plugins from the internal registry. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Since ROUTING_SUFFIX has to be set for each Kubernetes cluster (and not set in OpenShift), don't use any default value for ROUTING_SUFFIX. If running in minikube, autodetect appropriate ROUTING_SUFFIX; otherwise (if on Kubernetes) warn user that ROUTING_SUFFIX is unset. On OpenShift, rely on default detection unless ROUTING_SUFFIX is explicitly set Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

amisevsk · 2021-09-22T16:21:57Z

e2e tests failed because ExecCommandInContainer assumed a hardcoded container name. Should be fixed now.

/test v8-devworkspace-operator-e2e, v8-che-happy-path

amisevsk · 2021-09-22T17:03:43Z

/retest

openshift-ci · 2021-09-22T18:57:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, JPinkney, sleshchenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JPinkney,amisevsk,sleshchenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amisevsk requested review from sleshchenko and JPinkney September 16, 2021 17:13

openshift-ci bot added the approved label Sep 16, 2021

amisevsk force-pushed the config-in-crd branch from c1a0053 to 090284d Compare September 16, 2021 17:29

amisevsk changed the title ~~Config in crd~~ Move operator configuration into a custom resource and enable ignoring unrecoverable events Sep 16, 2021

amisevsk force-pushed the config-in-crd branch from fbf5083 to 25366f6 Compare September 17, 2021 19:18

sleshchenko reviewed Sep 20, 2021

View reviewed changes

This was referenced Sep 20, 2021

Propagate workspace started state to DevWorkspaceRouting #602

Closed

WIP Remove deploy step for DevWorkspaceOperator configmap eclipse-che/che-operator#1081

Closed

amisevsk force-pushed the config-in-crd branch from 87ca323 to 914162f Compare September 21, 2021 18:03

amisevsk mentioned this pull request Sep 21, 2021

Introduce timeout for hanging workspaces #605

Merged

3 tasks

amisevsk force-pushed the config-in-crd branch from 914162f to 05aca76 Compare September 21, 2021 19:25

sleshchenko approved these changes Sep 22, 2021

View reviewed changes

openshift-ci bot assigned sleshchenko Sep 22, 2021

openshift-ci bot added the lgtm label Sep 22, 2021

amisevsk added 6 commits September 22, 2021 12:11

Add DWO config CRD to make managing configuration easier.

cbd060a

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Regenerate files to include new config CRD

cd4ce15

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Clean up existing config package in preparation for using CRD

484edf5

* Remove web-terminal defaulting functionality * Move all env-var related config settings into same file * Remove unused GetTlsInsecureSkipVerify Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Update config package to use new configuration CRD

ef6ab80

Refactor existing code to use new config CRD for settings

5bd6950

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Move previous config management code to separate package

6464995

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

amisevsk added 9 commits September 22, 2021 12:11

Remove configmap from kustomize templates

5fb3f27

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Update make install to create a config CR on cluster from env vars

ce93e6b

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Fill routing clusterHostSuffix automatically on OpenShift

6d4b169

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Add tests to configuration CRD handling

a9eb4be

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Add support for ignoring specific unrecoverable events

3cf436f

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Add logging to operator configuration updates

5f25457

Log all updates to internal configuration to hopefully aid diagnosing issues in the future. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Fix e2e tests to not rely on internal registry

07c52d3

Add new testing DevWorkspace resource that utilizes restricted access but does not attempt to use plugins from the internal registry. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Update README.md to document configuration via the new CRD.

7d280c9

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

amisevsk force-pushed the config-in-crd branch from 05aca76 to 7d280c9 Compare September 22, 2021 16:12

openshift-ci bot removed the lgtm label Sep 22, 2021

JPinkney approved these changes Sep 22, 2021

View reviewed changes

openshift-ci bot assigned JPinkney Sep 22, 2021

openshift-ci bot added the lgtm label Sep 22, 2021

amisevsk merged commit cb13216 into devfile:main Sep 23, 2021

amisevsk deleted the config-in-crd branch September 23, 2021 03:41

amisevsk mentioned this pull request Sep 30, 2021

Automatically migrate old configmap-based configuration to new CRD #626

Merged

3 tasks

amisevsk mentioned this pull request Oct 8, 2021

Remove internal registry #637

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

amisevsk commented Sep 16, 2021

amisevsk commented Sep 17, 2021

sleshchenko commented Sep 20, 2021

sleshchenko left a comment

sleshchenko Sep 20, 2021

amisevsk Sep 20, 2021

amisevsk Sep 20, 2021 •

edited

Loading

sleshchenko Sep 20, 2021

amisevsk Sep 20, 2021

sleshchenko Sep 20, 2021

amisevsk Sep 20, 2021

sleshchenko Sep 20, 2021

sleshchenko Sep 20, 2021

amisevsk Sep 20, 2021

amisevsk Sep 20, 2021

sleshchenko Sep 20, 2021

sleshchenko Sep 20, 2021 •

edited

Loading

amisevsk Sep 20, 2021

amisevsk Sep 20, 2021

amisevsk Sep 20, 2021

sleshchenko commented Sep 21, 2021

amisevsk commented Sep 21, 2021

sleshchenko commented Sep 22, 2021

sleshchenko left a comment

amisevsk commented Sep 22, 2021

amisevsk commented Sep 22, 2021

openshift-ci bot commented Sep 22, 2021

	name: devworkspace-operator-config
	name: devworkspace-operator-config
	namespace: ${NAMESPACE}

Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

Move operator configuration into a custom resource and enable ignoring unrecoverable events #598

Conversation

amisevsk commented Sep 16, 2021

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

amisevsk commented Sep 17, 2021

sleshchenko commented Sep 20, 2021

sleshchenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amisevsk Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleshchenko Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleshchenko commented Sep 21, 2021

amisevsk commented Sep 21, 2021

sleshchenko commented Sep 22, 2021

sleshchenko left a comment

Choose a reason for hiding this comment

amisevsk commented Sep 22, 2021

amisevsk commented Sep 22, 2021

openshift-ci bot commented Sep 22, 2021

amisevsk Sep 20, 2021 •

edited

Loading

sleshchenko Sep 20, 2021 •

edited

Loading