pkg/operator: lay down ClusterOperator for MCO #386

runcom · 2019-02-06T14:59:56Z

Closes #383

Signed-off-by: Antonio Murdaca runcom@linux.com

runcom · 2019-02-06T15:00:03Z

/hold

runcom · 2019-02-06T15:06:05Z

Also, I'm still learning on how to deploy this kind of changes from the installer to test all of this out (I'd love any guidance if you have time 👼 )

abhinavdahiya · 2019-02-06T15:06:34Z

The operator needs to create and mange the clusteroperator object.

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md

bparees · 2019-02-06T15:09:05Z

this repo needs an e2e-aws job so you don't break e2e-aws w/ a merge here.

runcom · 2019-02-06T15:10:04Z

this repo needs an e2e-aws job so you don't break e2e-aws w/ a merge here.

probably a Prow issue since other PRs are running it

bparees · 2019-02-06T15:11:01Z

The operator needs to create and mange the clusteroperator object.

the operator doesn't need to create it because the CVO is going to create it from the manifest, no?

as for managing it, i assumed this operator logic already was updating a clusteroperator object, but sure if it's not that obviously needs to happen too.. without that, the PR will never pass e2e-aws.

cgwalters · 2019-02-06T15:15:56Z

@abhinavdahiya this came out of https://url.corp.redhat.com/dfe5ca9

Related to this, if your operator is not creating a ClusterOperator resource as part of the CVO-managed manifest content, the CVO is not aware of your operator and will not wait for it to have an available=true condition before considering the install (and upgrade) complete.

I didn't go and verify Ben's assertion but...it sounds truthy to me because clearly our operator has been failing and it hasn't been blocking installs.

runcom · 2019-02-06T15:20:05Z

related to Colin's comment, indeed the MCO fails but CVO is ok, whether if another operator (wired in with a CO object) does fail, then the CVO reports failure as well.

NAME                                 VERSION                      AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config                                                    False       True          True      3h

Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-02-06T11:34:02Z
  Generation:          1
  Resource Version:    143623
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 1a4d0c0f-2a03-11e9-9841-0a9111a1212a
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-02-06T11:34:02Z
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-02-06T11:34:02Z
    Message:               Progressing towards 3.11.0-576-gb4ddac16-dirty
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-02-06T11:39:47Z
    Message:               Failed when progressing towards 3.11.0-576-gb4ddac16-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Reason:                error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Status:                True
    Type:                  Failing
  Extension:
    Master:         pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.
    Worker:         3 out of 3 nodes have updated to latest configuration worker-45815544f3a9a9a1ac4cc79405de4973
  Related Objects:  <nil>
  Versions:         <nil>
Events:             <none>

abhinavdahiya · 2019-02-06T15:26:42Z

If you read https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#how-should-i-include-clusteroperator-custom-resource-in-manifests

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for. The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

bparees · 2019-02-06T15:31:07Z

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for.

that strikes me as a strange decision. Everything else in manifests is explicitly created/applied by the CVO, correct? Why not also create the clusteroperator resource? (obviously it would be empty, but so what?). Why the special case?

(since our operator started out creating the clusteroperator programmatically, and i'm guessing this operator does the same, it's probably moot, but it's likely to catch someone out at some point).

runcom · 2019-02-06T15:35:39Z

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for. The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

so, the code in this repo (and Ben's as well) is actually initializing the CO already programmatically and that should remain? just laying down the file makes sure the CVO is aware of that right (besides Ben's comment above as to why this exception)?

cgwalters · 2019-02-06T15:39:34Z

Clearly installs are not gating on our operator status today. We're in agreement that the should be right?

The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

(Is there code somewhere in the CVO which explicitly skips them? I'm not seeing it offhand.)

OK so if we're in agreement that installs should gate on our operator, why is that not happening today and how do we fix it?

runcom · 2019-02-06T15:49:09Z

OK so if we're in agreement that installs should gate on our operator, why is that not happening today and how do we fix it?

my understanding is that we still need to ship that CO yaml so the CVO can monitor it and gate on us (since the CO management is done on our side already)

runcom · 2019-02-06T16:05:55Z

we may be hitting some aws limit on e2e I guess

level=error msg="\t* aws_vpc.new_vpc: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached."

level=error msg="\t* aws_s3_bucket.ignition: Error creating S3 bucket: TooManyBuckets: You have attempted to create more buckets than allowed"

runcom · 2019-02-07T08:50:58Z

/retest

runcom · 2019-02-07T10:15:43Z

/test e2e-aws-op

runcom · 2019-02-07T10:19:31Z

pushed a dummy commit to check tests are failing if we fail to report Available status, I still think we need to instrument the e2e to check for clusterversion to be ok for the machine-config operator though, even if the installer should now gate on us if we fail something right?

runcom · 2019-02-08T20:49:44Z

/retest

runcom · 2019-02-08T20:51:46Z

alright, this has started to pass now, latest commit named x bumps the timeout on the resources we're waiting on. @cgwalters @abhinavdahiya would you take another pass at this?

abhinavdahiya · 2019-02-08T21:34:36Z

this is missing why we are checking for those conditions...
installer currently already waits for available CVO, that means available MCO...

this needs to have better title and message

otherwise looks good.

so that the CVO is aware of our ClusterOperator and can gate on us being Available. Signed-off-by: Antonio Murdaca <runcom@linux.com>

Basically use the same pattern everywhere as we were getting weird logs like: time="2019-02-08T10:35:23Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config has not yet reported success" time="2019-02-08T10:39:38Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed when progressing towards 3.11.0-587-g0e44a773-dirty because: error syncing: timed out waiting for the condition" w/o knowing what was actually failing. Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-02-08T22:00:36Z

dropped 4108e90 and reworded the last commit as well, let's see what the CI says now

cgwalters · 2019-02-08T22:06:45Z

installer currently already waits for available CVO, that means available MCO...

But I keep circling back to the fact that clearly installs and e2e-aws in general were passing even if the MCO was in failing state. Fixing that should be the primary goal - which hopefully this is doing?

runcom · 2019-02-08T22:09:30Z

But I keep circling back to the fact that clearly installs and e2e-aws in general were passing even if the MCO was in failing state. Fixing that should be the primary goal - which hopefully this is doing?

yes, I had previously added a test commit which failed to converge to Available for the MCO and the CVO was indeed failing correctly

what you're saying @cgwalters was about the missing ClusterOperator which now tells the CVO to watch for us - previously, w/o that, the CVO wasn't checking on us (this is my understanding)

cgwalters · 2019-02-08T22:28:23Z

installer currently already waits for available CVO, that means available MCO...

I'm going to cite this example again:
openshift/installer#1189 (comment)

That PR went through just fine, but the MCO had Available: False and Failing: True. Now...it does seem likely that we only flipped to unavailable+failing after some time in available+progressing.

So I think the important bit is really this part of the CVO docs

It then waits for the instance in the cluster until ... .status.conditions report available, not progressing and not failed

The "not progressing" part here being key.

what you're saying @cgwalters was about the missing ClusterOperator which now tells the CVO to watch for us - previously, w/o that, the CVO wasn't checking on us (this is my understanding)

edit: Yeah...I think so. But I need to read the CVO code a bit more to feel like I know.

runcom · 2019-02-08T23:21:08Z

/test e2e-aws-op

abhinavdahiya · 2019-02-08T23:37:41Z

The previous error for e2e-aws-op
Validate Response ec2/CreateVpc failed, not retrying, error RequestLimitExceeded: Request limit exceeded
openshift_install.log

time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.354Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateVpc Details:"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Date: Fri, 08 Feb 2019 22:09:38 GMT"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>11dcca80-bd04-463e-9359-6dff5be0652e</RequestID></Response>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateVpc failed, not retrying, error RequestLimitExceeded: Request limit exceeded."

runcom · 2019-02-09T10:08:13Z

/retest

runcom · 2019-02-09T10:08:46Z

/hold cancel

runcom · 2019-02-09T11:11:53Z

This is green now, I'm re-pushing to verify this again

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-02-09T12:29:10Z

Alrighty, all green

cgwalters · 2019-02-09T13:52:49Z

pkg/operator/sync.go

@@ -329,7 +329,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(config renderConfig) error
 		return err
 	}
 	var lastErr error
-	if err := wait.Poll(time.Second, 5*time.Minute, func() (bool, error) {
+	if err := wait.Poll(time.Second, 10*time.Minute, func() (bool, error) {


I feel like these timeouts may still be too short. But we can revisit that later.

well, the installer waits up to 30 mins anyway, we could bump up to that if needed I guess

cgwalters · 2019-02-09T13:53:06Z

/lgtm

openshift-ci-robot · 2019-02-09T13:53:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 6, 2019

runcom force-pushed the wire-co-to-cvo branch from 202e486 to e6e9415 Compare February 6, 2019 15:04

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2019

openshift-ci-robot requested review from abhinavdahiya and wking February 6, 2019 15:21

runcom force-pushed the wire-co-to-cvo branch from e6e9415 to d4ab96d Compare February 6, 2019 15:51

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 6, 2019

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 7, 2019

runcom force-pushed the wire-co-to-cvo branch from 8fe082c to 009a8c4 Compare February 7, 2019 11:35

runcom force-pushed the wire-co-to-cvo branch from 230aa5c to 2aef1e8 Compare February 8, 2019 21:59

runcom added 2 commits February 8, 2019 22:59

pkg/operator: lay down ClusterOperator for MCO

ee0e8e3

so that the CVO is aware of our ClusterOperator and can gate on us being Available. Signed-off-by: Antonio Murdaca <runcom@linux.com>

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 8, 2019

runcom force-pushed the wire-co-to-cvo branch from 2aef1e8 to cfb1056 Compare February 8, 2019 21:59

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2019

pkg/operator: wait longer for resources to be available to CVO

34300cc

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the wire-co-to-cvo branch from cfb1056 to 34300cc Compare February 9, 2019 11:12

runcom mentioned this pull request Feb 9, 2019

WIP: test/e2e: Validate that no nodes went degraded #319

Closed

cgwalters reviewed Feb 9, 2019

View reviewed changes

openshift-ci-robot assigned cgwalters Feb 9, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2019

openshift-merge-robot merged commit 287c861 into openshift:master Feb 9, 2019

runcom deleted the wire-co-to-cvo branch February 9, 2019 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/operator: lay down ClusterOperator for MCO #386

pkg/operator: lay down ClusterOperator for MCO #386

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019

runcom commented Feb 6, 2019

abhinavdahiya commented Feb 6, 2019

bparees commented Feb 6, 2019

runcom commented Feb 6, 2019

bparees commented Feb 6, 2019

cgwalters commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019 •

edited

Loading

abhinavdahiya commented Feb 6, 2019

bparees commented Feb 6, 2019

runcom commented Feb 6, 2019

cgwalters commented Feb 6, 2019

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 7, 2019

runcom commented Feb 7, 2019

runcom commented Feb 7, 2019 •

edited

Loading

runcom commented Feb 8, 2019

runcom commented Feb 8, 2019

abhinavdahiya commented Feb 8, 2019

runcom commented Feb 8, 2019

cgwalters commented Feb 8, 2019

runcom commented Feb 8, 2019 •

edited

Loading

cgwalters commented Feb 8, 2019 •

edited

Loading

runcom commented Feb 8, 2019

abhinavdahiya commented Feb 8, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

cgwalters Feb 9, 2019

runcom Feb 9, 2019

cgwalters commented Feb 9, 2019

openshift-ci-robot commented Feb 9, 2019

pkg/operator: lay down ClusterOperator for MCO #386

pkg/operator: lay down ClusterOperator for MCO #386

Conversation

runcom commented Feb 6, 2019 • edited Loading

runcom commented Feb 6, 2019

runcom commented Feb 6, 2019

abhinavdahiya commented Feb 6, 2019

bparees commented Feb 6, 2019

runcom commented Feb 6, 2019

bparees commented Feb 6, 2019

cgwalters commented Feb 6, 2019 • edited Loading

runcom commented Feb 6, 2019 • edited Loading

abhinavdahiya commented Feb 6, 2019

bparees commented Feb 6, 2019

runcom commented Feb 6, 2019

cgwalters commented Feb 6, 2019

runcom commented Feb 6, 2019 • edited Loading

runcom commented Feb 6, 2019 • edited Loading

runcom commented Feb 7, 2019

runcom commented Feb 7, 2019

runcom commented Feb 7, 2019 • edited Loading

runcom commented Feb 8, 2019

runcom commented Feb 8, 2019

abhinavdahiya commented Feb 8, 2019

runcom commented Feb 8, 2019

cgwalters commented Feb 8, 2019

runcom commented Feb 8, 2019 • edited Loading

cgwalters commented Feb 8, 2019 • edited Loading

runcom commented Feb 8, 2019

abhinavdahiya commented Feb 8, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

runcom commented Feb 9, 2019

cgwalters Feb 9, 2019

Choose a reason for hiding this comment

runcom Feb 9, 2019

Choose a reason for hiding this comment

cgwalters commented Feb 9, 2019

openshift-ci-robot commented Feb 9, 2019

runcom commented Feb 6, 2019 •

edited

Loading

cgwalters commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 6, 2019 •

edited

Loading

runcom commented Feb 7, 2019 •

edited

Loading

runcom commented Feb 8, 2019 •

edited

Loading

cgwalters commented Feb 8, 2019 •

edited

Loading