-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/operator: lay down ClusterOperator for MCO #386
Conversation
/hold |
202e486
to
e6e9415
Compare
Also, I'm still learning on how to deploy this kind of changes from the installer to test all of this out (I'd love any guidance if you have time 👼 ) |
The operator needs to create and mange the clusteroperator object. |
this repo needs an e2e-aws job so you don't break e2e-aws w/ a merge here. |
probably a Prow issue since other PRs are running it |
the operator doesn't need to create it because the CVO is going to create it from the manifest, no? as for managing it, i assumed this operator logic already was updating a clusteroperator object, but sure if it's not that obviously needs to happen too.. without that, the PR will never pass e2e-aws. |
@abhinavdahiya this came out of https://url.corp.redhat.com/dfe5ca9
I didn't go and verify Ben's assertion but...it sounds truthy to me because clearly our operator has been failing and it hasn't been blocking installs. |
related to Colin's comment, indeed the MCO fails but CVO is ok, whether if another operator (wired in with a CO object) does fail, then the CVO reports failure as well.
|
It clearly states in the NOTE |
that strikes me as a strange decision. Everything else in manifests is explicitly created/applied by the CVO, correct? Why not also create the clusteroperator resource? (obviously it would be empty, but so what?). Why the special case? (since our operator started out creating the clusteroperator programmatically, and i'm guessing this operator does the same, it's probably moot, but it's likely to catch someone out at some point). |
so, the code in this repo (and Ben's as well) is actually initializing the CO already programmatically and that should remain? just laying down the file makes sure the CVO is aware of that right (besides Ben's comment above as to why this exception)? |
Clearly installs are not gating on our operator status today. We're in agreement that the should be right?
(Is there code somewhere in the CVO which explicitly skips them? I'm not seeing it offhand.) OK so if we're in agreement that installs should gate on our operator, why is that not happening today and how do we fix it? |
my understanding is that we still need to ship that CO yaml so the CVO can monitor it and gate on us (since the CO management is done on our side already) |
e6e9415
to
d4ab96d
Compare
we may be hitting some aws limit on e2e I guess
|
/retest |
/test e2e-aws-op |
pushed a dummy commit to check tests are failing if we fail to report Available status, I still think we need to instrument the e2e to check for clusterversion to be ok for the machine-config operator though, even if the installer should now gate on us if we fail something right? |
8fe082c
to
009a8c4
Compare
/retest |
alright, this has started to pass now, latest commit named |
230aa5c
to
2aef1e8
Compare
so that the CVO is aware of our ClusterOperator and can gate on us being Available. Signed-off-by: Antonio Murdaca <runcom@linux.com>
Basically use the same pattern everywhere as we were getting weird logs like: time="2019-02-08T10:35:23Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config has not yet reported success" time="2019-02-08T10:39:38Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed when progressing towards 3.11.0-587-g0e44a773-dirty because: error syncing: timed out waiting for the condition" w/o knowing what was actually failing. Signed-off-by: Antonio Murdaca <runcom@linux.com>
2aef1e8
to
cfb1056
Compare
dropped 4108e90 and reworded the last commit as well, let's see what the CI says now |
But I keep circling back to the fact that clearly installs and e2e-aws in general were passing even if the MCO was in failing state. Fixing that should be the primary goal - which hopefully this is doing? |
yes, I had previously added a test commit which failed to converge to Available for the MCO and the CVO was indeed failing correctly what you're saying @cgwalters was about the missing ClusterOperator which now tells the CVO to watch for us - previously, w/o that, the CVO wasn't checking on us (this is my understanding) |
I'm going to cite this example again: That PR went through just fine, but the MCO had So I think the important bit is really this part of the CVO docs
The "not progressing" part here being key.
edit: Yeah...I think so. But I need to read the CVO code a bit more to feel like I know. |
/test e2e-aws-op |
The previous error for e2e-aws-op time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.354Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateVpc Details:"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Date: Fri, 08 Feb 2019 22:09:38 GMT"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>11dcca80-bd04-463e-9359-6dff5be0652e</RequestID></Response>"
time="2019-02-08T22:09:38Z" level=debug msg="2019-02-08T22:09:38.497Z [DEBUG] plugin.terraform-provider-aws: 2019/02/08 22:09:38 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateVpc failed, not retrying, error RequestLimitExceeded: Request limit exceeded." |
/retest |
/hold cancel |
This is green now, I'm re-pushing to verify this again |
Signed-off-by: Antonio Murdaca <runcom@linux.com>
cfb1056
to
34300cc
Compare
Alrighty, all green |
@@ -329,7 +329,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(config renderConfig) error | |||
return err | |||
} | |||
var lastErr error | |||
if err := wait.Poll(time.Second, 5*time.Minute, func() (bool, error) { | |||
if err := wait.Poll(time.Second, 10*time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these timeouts may still be too short. But we can revisit that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the installer waits up to 30 mins anyway, we could bump up to that if needed I guess
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Closes #383
Signed-off-by: Antonio Murdaca runcom@linux.com