Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/operator: lay down ClusterOperator for MCO #386

Merged
merged 3 commits into from
Feb 9, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented Feb 6, 2019

Closes #383

Signed-off-by: Antonio Murdaca runcom@linux.com

@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 6, 2019
@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2019
@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

Also, I'm still learning on how to deploy this kind of changes from the installer to test all of this out (I'd love any guidance if you have time 👼 )

@abhinavdahiya
Copy link
Contributor

The operator needs to create and mange the clusteroperator object.

https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md

@bparees
Copy link

bparees commented Feb 6, 2019

this repo needs an e2e-aws job so you don't break e2e-aws w/ a merge here.

@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

this repo needs an e2e-aws job so you don't break e2e-aws w/ a merge here.

probably a Prow issue since other PRs are running it

@bparees
Copy link

bparees commented Feb 6, 2019

The operator needs to create and mange the clusteroperator object.

the operator doesn't need to create it because the CVO is going to create it from the manifest, no?

as for managing it, i assumed this operator logic already was updating a clusteroperator object, but sure if it's not that obviously needs to happen too.. without that, the PR will never pass e2e-aws.

@cgwalters
Copy link
Member

cgwalters commented Feb 6, 2019

@abhinavdahiya this came out of https://url.corp.redhat.com/dfe5ca9

Related to this, if your operator is not creating a ClusterOperator resource as part of the CVO-managed manifest content, the CVO is not aware of your operator and will not wait for it to have an available=true condition before considering the install (and upgrade) complete.

I didn't go and verify Ben's assertion but...it sounds truthy to me because clearly our operator has been failing and it hasn't been blocking installs.

@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

related to Colin's comment, indeed the MCO fails but CVO is ok, whether if another operator (wired in with a CO object) does fail, then the CVO reports failure as well.

NAME                                 VERSION                      AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config                                                    False       True          True      3h
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-02-06T11:34:02Z
  Generation:          1
  Resource Version:    143623
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 1a4d0c0f-2a03-11e9-9841-0a9111a1212a
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-02-06T11:34:02Z
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-02-06T11:34:02Z
    Message:               Progressing towards 3.11.0-576-gb4ddac16-dirty
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-02-06T11:39:47Z
    Message:               Failed when progressing towards 3.11.0-576-gb4ddac16-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Reason:                error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Status:                True
    Type:                  Failing
  Extension:
    Master:         pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.
    Worker:         3 out of 3 nodes have updated to latest configuration worker-45815544f3a9a9a1ac4cc79405de4973
  Related Objects:  <nil>
  Versions:         <nil>
Events:             <none>

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2019
@abhinavdahiya
Copy link
Contributor

If you read https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#how-should-i-include-clusteroperator-custom-resource-in-manifests

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for. The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

@bparees
Copy link

bparees commented Feb 6, 2019

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for.

that strikes me as a strange decision. Everything else in manifests is explicitly created/applied by the CVO, correct? Why not also create the clusteroperator resource? (obviously it would be empty, but so what?). Why the special case?

(since our operator started out creating the clusteroperator programmatically, and i'm guessing this operator does the same, it's probably moot, but it's likely to catch someone out at some point).

@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

It clearly states in the NOTE Also note that the ClusterOperator resource in /manifests is only a communication mechanism, to tell the ClusterVersionOperator, which ClusterOperator resource to wait for. The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

so, the code in this repo (and Ben's as well) is actually initializing the CO already programmatically and that should remain? just laying down the file makes sure the CVO is aware of that right (besides Ben's comment above as to why this exception)?

@cgwalters
Copy link
Member

Clearly installs are not gating on our operator status today. We're in agreement that the should be right?

The ClusterVersionOperator does not create the ClusterOperator resource, this and updating it is the responsibility of the respective operator.

(Is there code somewhere in the CVO which explicitly skips them? I'm not seeing it offhand.)

OK so if we're in agreement that installs should gate on our operator, why is that not happening today and how do we fix it?

@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

OK so if we're in agreement that installs should gate on our operator, why is that not happening today and how do we fix it?

my understanding is that we still need to ship that CO yaml so the CVO can monitor it and gate on us (since the CO management is done on our side already)

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 6, 2019
@runcom
Copy link
Member Author

runcom commented Feb 6, 2019

we may be hitting some aws limit on e2e I guess

level=error msg="\t* aws_vpc.new_vpc: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached."

level=error msg="\t* aws_s3_bucket.ignition: Error creating S3 bucket: TooManyBuckets: You have attempted to create more buckets than allowed"

@runcom
Copy link
Member Author

runcom commented Feb 7, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 7, 2019

/test e2e-aws-op

@runcom
Copy link
Member Author

runcom commented Feb 7, 2019

pushed a dummy commit to check tests are failing if we fail to report Available status, I still think we need to instrument the e2e to check for clusterversion to be ok for the machine-config operator though, even if the installer should now gate on us if we fail something right?

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 7, 2019
@runcom
Copy link
Member Author

runcom commented Feb 8, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 8, 2019

alright, this has started to pass now, latest commit named x bumps the timeout on the resources we're waiting on. @cgwalters @abhinavdahiya would you take another pass at this?

@abhinavdahiya
Copy link
Contributor

this is missing why we are checking for those conditions...
installer currently already waits for available CVO, that means available MCO...

this needs to have better title and message

otherwise looks good.

so that the CVO is aware of our ClusterOperator and can gate on us being
Available.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Basically use the same pattern everywhere as we were getting weird logs
like:

time="2019-02-08T10:35:23Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config has not yet reported success"
time="2019-02-08T10:39:38Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed when progressing towards 3.11.0-587-g0e44a773-dirty because: error syncing: timed out waiting for the condition"

w/o knowing what was actually failing.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 8, 2019
@runcom
Copy link
Member Author

runcom commented Feb 8, 2019

dropped 4108e90 and reworded the last commit as well, let's see what the CI says now

@cgwalters
Copy link
Member

installer currently already waits for available CVO, that means available MCO...

But I keep circling back to the fact that clearly installs and e2e-aws in general were passing even if the MCO was in failing state. Fixing that should be the primary goal - which hopefully this is doing?

@runcom
Copy link
Member Author

runcom commented Feb 8, 2019

But I keep circling back to the fact that clearly installs and e2e-aws in general were passing even if the MCO was in failing state. Fixing that should be the primary goal - which hopefully this is doing?

yes, I had previously added a test commit which failed to converge to Available for the MCO and the CVO was indeed failing correctly

what you're saying @cgwalters was about the missing ClusterOperator which now tells the CVO to watch for us - previously, w/o that, the CVO wasn't checking on us (this is my understanding)

@cgwalters
Copy link
Member

cgwalters commented Feb 8, 2019

installer currently already waits for available CVO, that means available MCO...

I'm going to cite this example again:
openshift/installer#1189 (comment)

That PR went through just fine, but the MCO had Available: False and Failing: True. Now...it does seem likely that we only flipped to unavailable+failing after some time in available+progressing.

So I think the important bit is really this part of the CVO docs

It then waits for the instance in the cluster until ... .status.conditions report available, not progressing and not failed

The "not progressing" part here being key.

what you're saying @cgwalters was about the missing ClusterOperator which now tells the CVO to watch for us - previously, w/o that, the CVO wasn't checking on us (this is my understanding)

edit: Yeah...I think so. But I need to read the CVO code a bit more to feel like I know.

@runcom
Copy link
Member Author

runcom commented Feb 8, 2019

/test e2e-aws-op

@abhinavdahiya
Copy link
Contributor

The previous error for e2e-aws-op
Validate Response ec2/CreateVpc failed, not retrying, error RequestLimitExceeded: Request limit exceeded
openshift_install.log

time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.354Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateVpc Details:"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Date: Fri, 08 Feb 2019 22:09:38 GMT"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>11dcca80-bd04-463e-9359-6dff5be0652e</RequestID></Response>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateVpc failed, not retrying, error RequestLimitExceeded: Request limit exceeded."

@runcom
Copy link
Member Author

runcom commented Feb 9, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 9, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2019
@runcom
Copy link
Member Author

runcom commented Feb 9, 2019

This is green now, I'm re-pushing to verify this again

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom
Copy link
Member Author

runcom commented Feb 9, 2019

Alrighty, all green

@@ -329,7 +329,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(config renderConfig) error
return err
}
var lastErr error
if err := wait.Poll(time.Second, 5*time.Minute, func() (bool, error) {
if err := wait.Poll(time.Second, 10*time.Minute, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these timeouts may still be too short. But we can revisit that later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the installer waits up to 30 mins anyway, we could bump up to that if needed I guess

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 287c861 into openshift:master Feb 9, 2019
@runcom runcom deleted the wire-co-to-cvo branch February 9, 2019 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants