Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootkube: Inject bootstrap MachineConfigs into cluster #1189

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

This is an ugly fix for
openshift/machine-config-operator#367

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 4, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cgwalters
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: flaper87

If they are not already assigned, you can assign the PR to them by writing /assign @flaper87 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

# Copy the bootstrap MCs to inject into the target cluster
# Yes this is a brutal hack, need to improve the MCC bootstrap above
# 9a so we're after 99 - should change the others to 50- or something?
for x in /etc/mcs/bootstrap/machine-configs/*.yaml; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these come from in the first place? Looks like /etc/mcs has only been mentioned in 6da1973 (#879), but I don't see any installer code obviously populating it. Maybe it's here? If so, is there a reason you couldn't address this entirely within openshift/machine-config-operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the current code is here openshift/machine-config-operator#367 (comment)

but I don't see any installer code obviously populating it.

AIUI it's the static pod which we are just removing above this code:
/etc/kubernetes/manifests/machineconfigoperator-bootstrap-pod.yaml

If so, is there a reason you couldn't address this entirely within openshift/machine-config-operator?

Maybe - I didn't write this code and am still learning things here, so alternative suggestions appreciated!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand things though...that bootstrap pod is what's serving out the Ignition to the masters. We don't have masters online until after echo "etcd cluster up. Killing etcd certificate signer..." (right?)

I believe we won't even have the machineconfigs CRD in the cluster until the MCO comes online. So we will end up waiting in openshift.service for that - similar to how we wait for the cluster API objects and inject those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I don't see any installer code obviously populating it.

AIUI it's the static pod which we are just removing above this code:
/etc/kubernetes/manifests/machineconfigoperator-bootstrap-pod.yaml

So can we update that code to make these alterations? It's maybe here calling here?

So we will end up waiting in openshift.service for that - similar to how we wait for the cluster API objects and inject those.

I'm not sure how this comes into this pull request, but I'm pushing to get openshift.service and openshift.sh functionality moved into cluster-bootstrap. See #1147, which I'm going to reroll after cluster-bootstrap picks up openshift/library-go#220.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can we update that code to make these alterations?

Mmm...this code needs to generate content to inject into the target cluster. The static pod is running on the bootstrap before the cluster, right?

but I'm pushing to get openshift.service and openshift.sh functionality moved into cluster-bootstrap.

That makes sense to me...I was surprised at the low-tech nature of openshift.sh.

@abhinavdahiya
Copy link
Contributor

This is an ugly fix for
openshift/machine-config-operator#367

Why do you believe the bootstrap mcc and long-running mcc are creating 2 different final machineconfigs?

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2019
@cgwalters
Copy link
Member Author

Why do you believe the bootstrap mcc and long-running mcc are creating 2 different final machineconfigs?

Let's discuss that in openshift/machine-config-operator#367
Clearly a lot of people are hitting this.

@cgwalters
Copy link
Member Author

cgwalters commented Feb 4, 2019

Clearly a lot of people are hitting this.

Just to emphasize this, look at e.g. this recent PR to the installer which passed CI just fine...yet if you drill down into the clusteroperator status you'll notice the MCO is degraded:

"master": "pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.",

And if you look at the MCD logs in openshift-machine-config-operator_machine-config-daemon-82kdq_machine-config-daemon.log.gz:

I0202 02:27:31.338962 11501 daemon.go:875] While getting MachineConfig master-abdb3e951a9d508a63687694fe12b0cc, got: machineconfigs.machineconfiguration.openshift.io "master-abdb3e951a9d508a63687694fe12b0cc" not found. Retrying...

@cgwalters
Copy link
Member Author

We need to get to the point where we're gating e2e-aws on nodes not being degraded (which is an interesting topic...I'd been adding tests for this to the MCO repo but do we have a way to extend e2e-aws with tests that live in other git repos besides origin)?

I fully admit this PR is a hack but I'm trying to make some progress on this as it's blocking my work on OS updates.

@cgwalters
Copy link
Member Author

/test e2e-aws

1 similar comment
@cgwalters
Copy link
Member Author

/test e2e-aws

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-openstack c8d4ff6 link /test e2e-openstack
ci/prow/e2e-libvirt c8d4ff6 link /test e2e-libvirt
ci/prow/launch-aws c8d4ff6 link /test launch-aws
ci/prow/e2e-aws c8d4ff6 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

While I think this could still make sense...we don't need this right now since we did end up identifying the underlying problem in #1194

@cgwalters
Copy link
Member Author

Reopening this; just hit the failure case again here: openshift/machine-config-operator#363 (comment)

I know this isn't the most beautiful code but it should let us avoid this problem; we will be able to see the config drift.

@abhinavdahiya
Copy link
Contributor

I know this isn't the most beautiful code but it should let us avoid this problem;

Open to any PRs that use mc* render to handle this like other operators. no inline.

@deads2k
Copy link
Contributor

deads2k commented Jan 16, 2020

We have hit this in the field caused by a cluster-admin not getting all the options quite right. As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case and carries no negative side-effects in the common case. We should get this into a state where it (or something like it) can merge.

/reopen

@openshift-ci-robot
Copy link
Contributor

@deads2k: Failed to re-open PR: state cannot be changed. The install-mcd-bootstrap branch was force-pushed or recreated.

In response to this:

We have hit this in the field caused by a cluster-admin not getting all the options quite right. As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case and carries no negative side-effects in the common case. We should get this into a state where it (or something like it) can merge.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

Amazingly, this PR rebases cleanly. However, looking at it now with months more of experience, I think we can clean this up...

@cgwalters
Copy link
Member Author

Moved this to #2936

@runcom
Copy link
Member

runcom commented Jan 16, 2020

As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case

I don't fully agree with that, in the event there's a drift we're installing something which is what the user wants but then the MCO will reconcile to something completely not intended but the cluster can still function and nobody notices, it should be fatal instaed

@deads2k
Copy link
Contributor

deads2k commented Jan 16, 2020

As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case

I don't fully agree with that, in the event there's a drift we're installing something which is what the user wants but then the MCO will reconcile to something completely not intended but the cluster can still function and nobody notices, it should be fatal instaed

reply in #2936 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants