-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for enabling single-node-openshift with workers #996
Conversation
3a1527e
to
e37740b
Compare
## Summary | ||
|
||
This enhancemnet aims to enable adding workers to a single-node cluster by | ||
dealing with a "floating ingress" issue encountered in UPI-installed single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it only "floating ingress" issue? other operators are fine with it? Shouldn't we just set all the "control-plane" operators to run on master in case of SNO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other operators get pods on the worker node, why would that be a problem? Why would we want to prevent that?
With ingress it's a problem because SNO users don't use a load balancer (as in day1 they have no motivation to do so) so they might get surprised when ingress floats and their DNS entries become wrong once workers are added
|
||
## Alternatives | ||
|
||
- Implement this change in the OpenShift Cluster Ingress Operator rather than the installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already looking at it in order to decide the replica count
openshift/cluster-ingress-operator#519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but as mentioned, it's not sufficient to just look at the replica count, we need to look at the platform as well. Because in cloud with a LB or on baremetal with keepalived there's no point in pinning to the master pool.
Anyway, whether the IngerssController change happens in the installer or in the ingress operator is pretty arbitrary, I don't really care either way - I'm mostly interested to discuss whether the idea makes sense in general
control-plane and infrastructure topology fields, because those have the same value | ||
whether you're on AWS (which doesn't suffer from the problems described in this enhancement) | ||
or a UPI bootstrap-in-place installation. Maybe it could also look at the "platform" | ||
field in the infrastructure CR, then only do this if the platform is "none". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beside the fact that SNO on AWS platform isn't supported, we should probably strive to the same Ingress behavior in all platforms.
So if we decide to pin the ingress to the control plane node in none platform we should do the same for other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beside the fact that SNO on AWS platform isn't supported
It might be supported in the future so I wouldn't like to dismiss it from the discussion entirely
we should probably strive to the same Ingress behavior in all platforms.
So if we decide to pin the ingress to the control plane node in none platform we should do the same for other platforms.
Not sure that I agree with this. I'd rather ingress to be as flexible as possible when the circumstances allow it, and only force it to a smaller subset of nodes (the SNO node) when it's necessary.
I'd argue that it's more important that SNO remains similar to multi-node as much as possible, rather than remain similar to itself across different platforms. And by pinning just when it's necessary you're closer to the former than the latter
Yes. I'll revise the enhancement to make this clearer |
Enhancement updated |
- "@tsorya" | ||
- TBD | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have someone from the cluster infra team included as reviewer at least. @JoelSpeed can help identify someone.
Which other teams are affected by this work? @sinnykumari do you anticipate any MCO impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhellmann of course. We wanted to go over it first internally (even I didnt have time to review it so far) and than share it broadly for feedback and comments.
I will affect also networking and installer teams. We'll ask for more feedback early next week.
A SNO cluster's control plane can't scale beyond a single node - how do we ensure that adding additional worker nodes (and presumedly more pods to manage) won't overload the control plane? What test do we have to identify the failure modes which obtain when the resources allocated to the control plane are exceeded. Do we have tests that exercise the systems which recover the control plane when these limits are exceeded, and the alerts which should fire when these systems fail to operate as intended (and do the alerts fire if the control plane is down?) |
That's a good point. For now according to https://issues.redhat.com/browse/MGMT-8414 the goal is addressing the addition of a small number of worker nodes (this should have been and will be clarified in the enhancement goals / non goals section). I think we'll have to test those limits in practice, and maybe set a limit on how many pods are allowed on these additional nodes (either hard limit or documentation limit, i.e. "You may add additional worker nodes but the total number of pods across all nodes in the cluster should not exceed current advertised single node limits"). I'll give it some thought and address it in the enhancement.
The current plan is simply launching a single control-plane node cluster, adding a few worker nodes to it, and then running conformance tests. I'll search for prior art with regards to testing the control-plane by pushing it to its limit, and see if we can do something similar for this kind of topology.
I'm not aware of any such systems. I'll check to see how things are currently handled in standard multi-node clusters.
I'm not aware of specific alerts relating to control plane being overloaded, I'm assuming if such alerts exist, there's no reason they wouldn't apply to the SNO + workers topology
Depends on what exactly goes wrong and the state of the cluster - prometheus / alertmanager can arbitrarily be running on the control plane node or one of the other additional worker nodes. As long as they both are able to keep running I don't see a reason why alerts wouldn't fire. Thanks for your comment, it gave me some things I'll need to think about, not sure all of them will be addressed in this enhancement though |
run on the single control plane node, and as a result any | ||
`*.apps.<cluster>.<base>` DNS entries which originally pointed at the single | ||
control plane node will remain correct even in the face of newly added worker | ||
nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any scale concerns here, since this means we'll only ever have a single router replica, no matter how many workers are added?
for that matter any other scale concerns with a single replica control plane topology and an unknown number of workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, but Im not sure whether it's different from multi node, right? We dont have a way to block in in multi node scenario, mostly it's a documentation. I would assume we'll have to do it here as well. Also, I dont think we should recomend to go ever handfull number of workers. If people want scale, they should go with HA deployment. I would suggest to add it to the docs.
|
||
- As an OpenShift cluster administrator, I want to add worker nodes to my | ||
existing single control-plane node cluster, so that it'll be able to meet | ||
growing computation demands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the trade-off point here? At some point if you have enough resources to run extra workers, why are you not running a full HA control plane topology?
How many workers do we expect people to add to a "single node" install, before we would have told them "if you're going to do that, you should just run a traditional HA cluster"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't have answers to these questions, I suggest taking a look at MGMT-8414 and the documents linked in it.
Some things that I think may make this make more sense -
-
If my understanding of the motivations behind the requirements is correct, the most common use case is only a single additional worker being added, and in that case it's not enough to form a HA control plane. Situations with 2 or 3 additional workers would be less common
-
3 machines behaving as 1 control plane + 2 workers are not equivalent to 3 machines that all act as both control plane and workers (compact cluster) - you pay for the HA of the control plane with, for example, a lot of memory for maintaining 3 k8s api servers rather than just 1.
-
I wouldn't be surprised if users actually did want to simply transform their 1 control plane + 2 workers cluster into a simple HA compact cluster if given the option to, but this is not something we support and I believe is a harder technical challenge to tackle - users would like to avoid re-installing their existing 1 + 2 cluster into a compact cluster - so for now, 1 + 2 is a good simpler middle-ground to answer user needs before we begin supporting live control plane expansion.
"Infrastructure Topology" is set to "SingleReplica". This will ensure that the | ||
`router-default` deployment created by the Cluster Ingress Operator will always | ||
run on the single control plane node, and as a result any | ||
`*.apps.<cluster>.<base>` DNS entries which originally pointed at the single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How worker will have those dns values? or we don't need them for workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need them for workers. The user is supposed to make sure DNS is configured "globally" and pointing to the control-plane node before trying to add workers.
In the case of Assisted, we'll do our best to do it automatically whenever possible with dnsmasq tricks.
@omertuc I think lets add non-goal to deal with the solution scalability. The goal of this type of deployment is not to stress the cluster, but more of provide some extra capacity for the workload. wdyt? |
/test markdownlint |
Linter failures due to #1011 |
- "@romfreiman" | ||
- "@eranco74" | ||
- "@tsorya" | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please make it clear from this section what teams are going to be impacted, either by having to implement something for this now, or support the implementation in the future? the teams should be tagged for review with an indication of what aspect of it they will be involved in.
|
||
Similarly, in the Assisted Installer, the user is able to complete the | ||
installation without needing to define any DNS entries. This is currently made | ||
possible by injecting a `MachineConfig` manifest targeting the "master" pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that only for single-node deployments, or is the same configuration always added by the assisted installer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only SNO
It's proposed that the only scenarios in which the "Infrastrcture Topology" | ||
parameter will be "HighlyAvailable" are when installing a single control-plane | ||
node cluster in the cloud with one or more day-1 workers. Cloud deployments | ||
have a load balancer so they can benefit from having highly-available ingress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently offer formal support for single-node deployments in the cloud. Is that expected to change as part of this work?
If the reason for using the HA setting is the presence of a load balancer, would we want to make it possible to configure the infrastructure topology for HA on metal (or platform none) deployments if there is also a load balancer there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently offer formal support for single-node deployments in the cloud. Is that expected to change as part of this work?
nope. Cloud support is out of scope. But we did want to address the flow regardless.
Regrading HA for infrastructure topology - I personally think that it should always be SingleReplica. I think if customers need any kind of HA, they should move to 3nodes + workers cluster. Also, I dont think we should differentiate between on prem and cloud envs unless it's a real must
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regrading HA for infrastructure topology - I personally think that it should always be SingleReplica. I think if customers need any kind of HA, they should move to 3nodes + workers cluster. Also, I dont think we should differentiate between on prem and cloud envs unless it's a real must
Being consistent is going to make it easier to reason about the system, so I support that as well, especially if we're still considering formal cloud support out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this point in the third bullet in the "Alternatives section".
Some caveats -
-
Setting SingleReplica in those scenarios would mean Control plane topology == infrastructure topology under all circumstances, these two parameters would always be identical unless we one day allow users to choose / change them.
-
Making this consistent across the types of single node deployments means we're losing consistency with regular HA OCP deployments
-
Cloud deployments would lose the HA ingress benefits they can get from having a load balancer
Anyway I don't think the decision here is too critical, I don't mind either way. I'll update the enhancement to go with SingleReplica everywhere if you confirm you're okay with the above caveats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The control plane topology can also be "external", in the Hypershift case. So the infrastructure and control plane fields wouldn't always be the same.
platform solves the issue described in this enhancement with virtual IP | ||
addresses/keepalived. This approach was dismissed due to much higher | ||
development efforts and additional processes that would need to run on the | ||
already resource constrained single control-plane node. Furthermore, even once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which additional processes would be running? What other implementation work is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume @omertuc meant keepalived, haproxy, .... all the on prem networking.
- "@staebler" | ||
- "@derekwaynecarr" | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhellmann @sdodson who should we add as approvers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "@derekwaynecarr" | ||
approvers: | ||
- TBD | ||
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Miciah who's the approver from network-edge side?
Co-authored-by: Doug Hellmann <dhellmann@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure I understand the defaulting behavior correctly. Otherwise, I just have a few stylistic suggestions and typo fixes. The overall approach looks good to me.
/approve
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
Co-authored-by: Miciah Dashiel Butler Masters <mmasters@redhat.com>
@omertuc: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Reopening openshift#996 since it got accidentally merged without sufficient reviews
Reopening openshift#996 since it got accidentally merged without sufficient reviews
No description provided.