-
Notifications
You must be signed in to change notification settings - Fork 294
Allow workers to sit inside private subnets within an existing VPC #44
Comments
@pieterlange @mumoshu the original pull request does two things:
The reason using the existing |
I have a feeling we should separate out the various use cases as I've kind of mixed them together since the change was relatively small to cope with all of this. |
Are you referring to the route to
I believe that ultimately this problem can be solved without constraining operator choices by implementing node pools. The controllers would be in a different pool. Workers in each AZ could be in different pools, one pool per AZ. Or they can be in one pool and use one routetable. |
Yeah, our private subnet route tables look something like:
Having looked at it again just now I think there is no way to switch traffic based on the source CIDR in a single route table. So in other words, we can't use a single route table while still routing the worker traffic from each individual AZ to the appropriate NAT Gateway in the same AZ. I think generally we'd want to have a route table per AZ to mirror a typical multi-AZ NAT Gateway setup. The original pull was a simple version on the basis of a whole AZ going down, however if we are talking about service level outages per AZ then if a single NAT Gateway is down I'm not sure of how the failover could/should work. Probably something along the lines of what you mention with lambda/manual. |
The NAT gateways themselves are HA (or so amazon says). The issue currently is if the gateway lives in AZ-a and that entire AZ goes down, AZ-b and AZ-c lose connectivity to the internet until a new NAT gateway is booted in one of those AZ's and the route table is updated.
👍 |
Yep agreed. From the docs: Do you have more details on the node pools? I'd like to have a look over it. |
@c-knowles I've announced my POC for node pools in #46 (comment) yesterday! Could it be something that can be a foundation to address this issue? With the poc, we now have separate subnet(s) for each node pool(=a set of worker nodes). P.S. I don't recommend reading each commit in the node-pools branch for now, mainly because those are really dirty and not ready for reviews 😆 |
Cool, thanks. I was going to review it soon for this use case. Prior to that, what I would say is the current setup we are using from code at coreos/coreos-kubernetes#716 uses a different subnet per AZ because each AZ has it's own NAT Gateway. I think control over different AZs within a tool like kube-aws will usually come down wanting control over what gets set per AZs (with sensible defaults). |
@mumoshu After reviewing node pools further, I think this use case should be a matter of:
I think that's it. I'm going to try it out tomorrow. My only reservation would be that it's a quite complex setup for what I think is a very common scenario. I also could not see a way to disable the new separate etcd node creation, I don't really want/need one in the dev cluster but I guess that's not supported any more? |
@c-knowles Really looking forward to your results! Regarding the separate etcd nodes: Yes, collocating etcd and control plane in a single node is not supported. I also believe that the uniform architectures for the dev and the prod envs would be good in several points including e.g. dev-prod parity, less user confusion/support, less code complexity. If you're suffering from cost, I guess you can use smaller instance types for etcd nodes in a dev env and then wasted cost would be minimum. |
I tend to agree to a certain level. Our dev cluster is where we development cluster changes/upgrades and we run automated tests against it for our apps and deployment mechanisms. We may even have a few of those at any time. We actually have a staging cluster after dev as well. The cost currently went up 50% minimum for any of those clusters (min was 2 nodes, now it's 3). I'd like to support external etcd as well such as compose.io. Updates/feedback so far:
|
@c-knowles 50% increase is serious. I'd definitely like to make it more cost effective in another ways. |
Could do. Is there any simple way to provide at least one controller? e.g. some ASG/fleet automated scaling rule+action like "if all spots are going to shutdown, increment on demand ASG from min 0 to min 1". Or alternatively leave that for now and assume at least one spot will come back? |
@c-knowles AFAIK there's no way to coordinate an ASG and a spot fleet like that. However, I guess setting very low TargetCapacity(maybe 1) for a spot fleet and enabling the "diversified" strategy to diversify your instance types will work like that. For example, if you've chosen 1 unit = 3.75GB of memory:
would ideally bring up only 1 m3.medium in peaceful days and if and only if the spot fleet loses to bids, a larger instance would be brought up. c3.large, c4.large, m3.large, m4.large is approximately 2x more expensive than m3.medium. Not tested though 😉 |
That may be ok to start with. Possible future changes could be termination notices linked to an ASG action/lambda. BTW, if you are available in the #sig-aws k8s Slack it may be useful to chat so we can keep it out of the issues. |
@c-knowles I guess in such case my kube-spot-termination-notice-handler in combination with cluster-autoscaler would fit. My kube-spot-termination-notice-handler basically
However, I'm not yet sure if cluster-autoscaler supports scaling down an ASG to zero nodes, which I believe required for your use-case. |
I have good news! It's possible to fulfil this use case with only minor modifications to Assuming adding a couple more node pools works, we only need minor mods for this. I have a few feedback items which I will start to put some pull requests in for. I may need some initial feedback on those items to start on them. |
@mumoshu on my feedback above, I want to check a few things before doing the pulls if you could provide a very quick comment on each:
|
@c-knowles First of all, thanks for your feedbacks! 1: No. Anyways, It would be nice to make them configurable in 2: I'm not yet sure it copying instance type is good. I guess you should explicitly select appropriate instance type regardless of what is selected in the main cluster. Copying stack tags would be basically good as I assume stack tags are used to identify whole the cluster resources not only main cluster or node pool. Not 👍 for now but definitely like to discuss more! 3: I'd like to keep cluster.yaml simple. Only keys for required configuration with no default values would be commented out in cluster.yaml. I'd rather like to fix node pools' cluster.yaml so that it properly comments out Basically 👍 but I'd rather like to fix node pools' cluster.yaml instead of main cluster's. 4: Indeed. For now, I guess we should at least error out when different AZs are specified in multiple subnets, instead of just removing all the 5: Yes! I'd appreciate it if you could add it but I'm willing to do myself. For example, 6: Thanks for your attention to documentation! I'd greatly appreciate it. Adding the brand-new documentation named like |
Submitted #142 for the ASG definitions. Getting the defaults to work in golang is more tricky than I'd hoped but it works, will see what you think. |
@c-knowles I believe that almost all the TODOs related to this issue are addressed via recently merged PRs. Can we close this now? |
ping @c-knowles |
Use case proposal:
Similar to kubernetes/kops#428.
It was part of what coreos/coreos-kubernetes#716 was intended to cover. Pre-filing this issue so we can move discussion out of #35.
The text was updated successfully, but these errors were encountered: