Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for enabling single-node-openshift with workers #996

Merged
merged 35 commits into from
Feb 23, 2022

Conversation

omertuc
Copy link
Contributor

@omertuc omertuc commented Jan 5, 2022

No description provided.

## Summary

This enhancemnet aims to enable adding workers to a single-node cluster by
dealing with a "floating ingress" issue encountered in UPI-installed single
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only "floating ingress" issue? other operators are fine with it? Shouldn't we just set all the "control-plane" operators to run on master in case of SNO?

Copy link
Contributor Author

@omertuc omertuc Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other operators get pods on the worker node, why would that be a problem? Why would we want to prevent that?

With ingress it's a problem because SNO users don't use a load balancer (as in day1 they have no motivation to do so) so they might get surprised when ingress floats and their DNS entries become wrong once workers are added


## Alternatives

- Implement this change in the OpenShift Cluster Ingress Operator rather than the installer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already looking at it in order to decide the replica count
openshift/cluster-ingress-operator#519

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but as mentioned, it's not sufficient to just look at the replica count, we need to look at the platform as well. Because in cloud with a LB or on baremetal with keepalived there's no point in pinning to the master pool.

Anyway, whether the IngerssController change happens in the installer or in the ingress operator is pretty arbitrary, I don't really care either way - I'm mostly interested to discuss whether the idea makes sense in general

control-plane and infrastructure topology fields, because those have the same value
whether you're on AWS (which doesn't suffer from the problems described in this enhancement)
or a UPI bootstrap-in-place installation. Maybe it could also look at the "platform"
field in the infrastructure CR, then only do this if the platform is "none".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the fact that SNO on AWS platform isn't supported, we should probably strive to the same Ingress behavior in all platforms.
So if we decide to pin the ingress to the control plane node in none platform we should do the same for other platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the fact that SNO on AWS platform isn't supported

It might be supported in the future so I wouldn't like to dismiss it from the discussion entirely

we should probably strive to the same Ingress behavior in all platforms.
So if we decide to pin the ingress to the control plane node in none platform we should do the same for other platforms.

Not sure that I agree with this. I'd rather ingress to be as flexible as possible when the circumstances allow it, and only force it to a smaller subset of nodes (the SNO node) when it's necessary.

I'd argue that it's more important that SNO remains similar to multi-node as much as possible, rather than remain similar to itself across different platforms. And by pinning just when it's necessary you're closer to the former than the latter

@omertuc
Copy link
Contributor Author

omertuc commented Jan 6, 2022

I see, so it's not about how you do the bootstrap it's about the platform.
Consider rephrasing bootstrap-in-place to None platform

Yes. I'll revise the enhancement to make this clearer

@omertuc
Copy link
Contributor Author

omertuc commented Jan 6, 2022

Enhancement updated

- "@tsorya"
- TBD
approvers:
- TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have someone from the cluster infra team included as reviewer at least. @JoelSpeed can help identify someone.

Which other teams are affected by this work? @sinnykumari do you anticipate any MCO impact?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann of course. We wanted to go over it first internally (even I didnt have time to review it so far) and than share it broadly for feedback and comments.
I will affect also networking and installer teams. We'll ask for more feedback early next week.

@fakebanana
Copy link

A SNO cluster's control plane can't scale beyond a single node - how do we ensure that adding additional worker nodes (and presumedly more pods to manage) won't overload the control plane? What test do we have to identify the failure modes which obtain when the resources allocated to the control plane are exceeded. Do we have tests that exercise the systems which recover the control plane when these limits are exceeded, and the alerts which should fire when these systems fail to operate as intended (and do the alerts fire if the control plane is down?)

@omertuc
Copy link
Contributor Author

omertuc commented Jan 17, 2022

A SNO cluster's control plane can't scale beyond a single node - how do we ensure that adding additional worker nodes (and presumedly more pods to manage) won't overload the control plane?

That's a good point.

For now according to https://issues.redhat.com/browse/MGMT-8414 the goal is addressing the addition of a small number of worker nodes (this should have been and will be clarified in the enhancement goals / non goals section).

I think we'll have to test those limits in practice, and maybe set a limit on how many pods are allowed on these additional nodes (either hard limit or documentation limit, i.e. "You may add additional worker nodes but the total number of pods across all nodes in the cluster should not exceed current advertised single node limits"). I'll give it some thought and address it in the enhancement.

What test do we have to identify the failure modes which obtain when the resources allocated to the control plane are exceeded.

The current plan is simply launching a single control-plane node cluster, adding a few worker nodes to it, and then running conformance tests.

I'll search for prior art with regards to testing the control-plane by pushing it to its limit, and see if we can do something similar for this kind of topology.

Do we have tests that exercise the systems which recover the control plane when these limits are exceeded,

I'm not aware of any such systems. I'll check to see how things are currently handled in standard multi-node clusters.

and the alerts which should fire when these systems fail to operate as intended

I'm not aware of specific alerts relating to control plane being overloaded, I'm assuming if such alerts exist, there's no reason they wouldn't apply to the SNO + workers topology

(and do the alerts fire if the control plane is down?)

Depends on what exactly goes wrong and the state of the cluster - prometheus / alertmanager can arbitrarily be running on the control plane node or one of the other additional worker nodes. As long as they both are able to keep running I don't see a reason why alerts wouldn't fire.

Thanks for your comment, it gave me some things I'll need to think about, not sure all of them will be addressed in this enhancement though

run on the single control plane node, and as a result any
`*.apps.<cluster>.<base>` DNS entries which originally pointed at the single
control plane node will remain correct even in the face of newly added worker
nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any scale concerns here, since this means we'll only ever have a single router replica, no matter how many workers are added?

for that matter any other scale concerns with a single replica control plane topology and an unknown number of workers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, but Im not sure whether it's different from multi node, right? We dont have a way to block in in multi node scenario, mostly it's a documentation. I would assume we'll have to do it here as well. Also, I dont think we should recomend to go ever handfull number of workers. If people want scale, they should go with HA deployment. I would suggest to add it to the docs.


- As an OpenShift cluster administrator, I want to add worker nodes to my
existing single control-plane node cluster, so that it'll be able to meet
growing computation demands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the trade-off point here? At some point if you have enough resources to run extra workers, why are you not running a full HA control plane topology?

How many workers do we expect people to add to a "single node" install, before we would have told them "if you're going to do that, you should just run a traditional HA cluster"?

Copy link
Contributor Author

@omertuc omertuc Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't have answers to these questions, I suggest taking a look at MGMT-8414 and the documents linked in it.

Some things that I think may make this make more sense -

  • If my understanding of the motivations behind the requirements is correct, the most common use case is only a single additional worker being added, and in that case it's not enough to form a HA control plane. Situations with 2 or 3 additional workers would be less common

  • 3 machines behaving as 1 control plane + 2 workers are not equivalent to 3 machines that all act as both control plane and workers (compact cluster) - you pay for the HA of the control plane with, for example, a lot of memory for maintaining 3 k8s api servers rather than just 1.

  • I wouldn't be surprised if users actually did want to simply transform their 1 control plane + 2 workers cluster into a simple HA compact cluster if given the option to, but this is not something we support and I believe is a harder technical challenge to tackle - users would like to avoid re-installing their existing 1 + 2 cluster into a compact cluster - so for now, 1 + 2 is a good simpler middle-ground to answer user needs before we begin supporting live control plane expansion.

"Infrastructure Topology" is set to "SingleReplica". This will ensure that the
`router-default` deployment created by the Cluster Ingress Operator will always
run on the single control plane node, and as a result any
`*.apps.<cluster>.<base>` DNS entries which originally pointed at the single
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How worker will have those dns values? or we don't need them for workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need them for workers. The user is supposed to make sure DNS is configured "globally" and pointing to the control-plane node before trying to add workers.
In the case of Assisted, we'll do our best to do it automatically whenever possible with dnsmasq tricks.

@romfreiman
Copy link

@omertuc I think lets add non-goal to deal with the solution scalability. The goal of this type of deployment is not to stress the cluster, but more of provide some extra capacity for the workload. wdyt?

@omertuc omertuc changed the title Enhancement for enabling single-node-openshift day2 workers expansion Enhancement for enabling single-node-openshift with workers Jan 18, 2022
@romfreiman
Copy link

@romfreiman
Copy link

@sdodson

@omertuc
Copy link
Contributor Author

omertuc commented Jan 18, 2022

/test markdownlint

@omertuc
Copy link
Contributor Author

omertuc commented Jan 18, 2022

Linter failures due to #1011

- "@romfreiman"
- "@eranco74"
- "@tsorya"
- TBD
Copy link
Contributor

@bparees bparees Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please make it clear from this section what teams are going to be impacted, either by having to implement something for this now, or support the implementation in the future? the teams should be tagged for review with an indication of what aspect of it they will be involved in.


Similarly, in the Assisted Installer, the user is able to complete the
installation without needing to define any DNS entries. This is currently made
possible by injecting a `MachineConfig` manifest targeting the "master" pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that only for single-node deployments, or is the same configuration always added by the assisted installer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only SNO

It's proposed that the only scenarios in which the "Infrastrcture Topology"
parameter will be "HighlyAvailable" are when installing a single control-plane
node cluster in the cloud with one or more day-1 workers. Cloud deployments
have a load balancer so they can benefit from having highly-available ingress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently offer formal support for single-node deployments in the cloud. Is that expected to change as part of this work?

If the reason for using the HA setting is the presence of a load balancer, would we want to make it possible to configure the infrastructure topology for HA on metal (or platform none) deployments if there is also a load balancer there?

Copy link

@romfreiman romfreiman Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently offer formal support for single-node deployments in the cloud. Is that expected to change as part of this work?

nope. Cloud support is out of scope. But we did want to address the flow regardless.

Regrading HA for infrastructure topology - I personally think that it should always be SingleReplica. I think if customers need any kind of HA, they should move to 3nodes + workers cluster. Also, I dont think we should differentiate between on prem and cloud envs unless it's a real must

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrading HA for infrastructure topology - I personally think that it should always be SingleReplica. I think if customers need any kind of HA, they should move to 3nodes + workers cluster. Also, I dont think we should differentiate between on prem and cloud envs unless it's a real must

Being consistent is going to make it easier to reason about the system, so I support that as well, especially if we're still considering formal cloud support out of scope.

Copy link
Contributor Author

@omertuc omertuc Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this point in the third bullet in the "Alternatives section".

Some caveats -

  • Setting SingleReplica in those scenarios would mean Control plane topology == infrastructure topology under all circumstances, these two parameters would always be identical unless we one day allow users to choose / change them.

  • Making this consistent across the types of single node deployments means we're losing consistency with regular HA OCP deployments

  • Cloud deployments would lose the HA ingress benefits they can get from having a load balancer

Anyway I don't think the decision here is too critical, I don't mind either way. I'll update the enhancement to go with SingleReplica everywhere if you confirm you're okay with the above caveats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control plane topology can also be "external", in the Hypershift case. So the infrastructure and control plane fields wouldn't always be the same.

platform solves the issue described in this enhancement with virtual IP
addresses/keepalived. This approach was dismissed due to much higher
development efforts and additional processes that would need to run on the
already resource constrained single control-plane node. Furthermore, even once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which additional processes would be running? What other implementation work is needed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume @omertuc meant keepalived, haproxy, .... all the on prem networking.

- "@staebler"
- "@derekwaynecarr"
approvers:
- TBD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann @sdodson who should we add as approvers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an API change here, so we need an API approver to look at it. @mfojtik can you help identify someone with time to review this?

I would expect to see someone from the networking team familiar with ingress, too. @knobunc can you help find a reviewer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- "@derekwaynecarr"
approvers:
- TBD
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah who's the approver from network-edge side?

Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand the defaulting behavior correctly. Otherwise, I just have a few stylistic suggestions and typo fixes. The overall approach looks good to me.
/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2022
omertuc and others added 11 commits February 23, 2022 10:38
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2022

@omertuc: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Miciah
Copy link
Contributor

Miciah commented Feb 23, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2022
@openshift-merge-robot openshift-merge-robot merged commit 08ffe91 into openshift:master Feb 23, 2022
@dhellmann
Copy link
Contributor

This hadn't been fully reviewed, so should not have merged. @omertuc proposed a revert in #1040 and will propose another PR to resubmit the doc so we can get all of the teams involved to sign off before merging it.

omertuc added a commit to omertuc/enhancements that referenced this pull request Feb 23, 2022
Reopening openshift#996 since it got accidentally merged without sufficient
reviews
omertuc added a commit to omertuc/enhancements that referenced this pull request Feb 28, 2022
Reopening openshift#996 since it got accidentally merged without sufficient
reviews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.