adds AWS infra for instantiating and destroying all baseline substratus components #170

brandonjbjelland · 2023-08-06T01:33:24Z

What is this change?

This change adds a script that we (and users) can use to create a complete substratus environment on a new AWS account.

Why make this change?

Caveats, questions, TBD

~~Q: Are the local/ephemeral SSDs important for any sort of workload we support? Trying to understand if that's critical here too.~~ - size is important but local SSDs probably not. Leaving this out for now.

Unknown: I added karpenter but haven't throughly tested how well it works at auto-provisioning. GPU backed noe

Caveat: The features we get enabled on a GKE cluster through simple flags are incredibly fussy on EKS and I don't have them working. I'll dig further (maybe I've missed some events) but I think we might need to manage these on our own. e.g. I've never seen these fail on GKE:

~Equivalent features baked into EKS configuration fail consistently. I've seen timeouts across coredns, vpc-cni, ebs-csi regardless of the order of deployment or how I do it. So far I've tried:

using the EKS module flags
breaking out into eks_add_on resources (docs) and
using a purpose-built module.

They all fail. We need to attach some of the IRSA roles created in the code here to those resources however we instantiate them so if that's out of scope of terraform, some additional outputs will be necessary (i.e., arns of the roles).~

With time, I'll add to this...

an equivalent to gcp-up.sh and gcp-down.sh ✅
an aws-up Dockerfile ✅
make targets ✅
docker hub push actions ✅ - not needed. rolled into same container image

…ptional VPC

BOsterbuhr · 2023-08-06T07:35:02Z

Hey @brandonjbjelland and team, I had a quick thought I wanted to pass along so I don't forget about it.

Need to explore further if this configuration will actually work or if it's node auto-provisioning is smart enough in this case. I suspect EKS is not going to make good instance type/accelerator choices on our behalf.

Karpenter may be a good option to get the intelligent auto-provisioning you are looking for. Karpenter docs on provisioners.
The install should be easy if you use the blueprints addons module but as you mentioned the addons aren't always so straightforward. You could always do a helm install of Karpenter if the addon path doesn't work.

I'll double-check what I ended up doing to get coredns, vpc-cni, and ebs-csi working consistently in my terraform.

I am looking forward to testing Substratus out on Monday.

samos123 · 2023-08-06T07:40:33Z

Are the local/ephemeral SSDs important for any sort of workload we support? Trying to understand if that's critical here too.

You got a point there, I don't see a critical need yet, so we can skip them for now.

samos123 · 2023-08-06T08:25:34Z

Spinning up EKS clusters with Terraform is such a pain, maybe by design? I wasn't expecting it to be so complex. The eksctl tool does seem to make it easier:
https://eksctl.io/usage/gpu-support/
https://eksctl.io/usage/eksctl-karpenter/

Not sure if it will fit all our needs or takes away too much flexibility. I personally prefer terraform but seeing the struggle and complexity of EKS, it might be fine to consider something like eksctl for development. Note I have never tried eksctl myself so it could be trash.

In the end we expect most users to already have a K8s cluster when they want to use Substratus and those end-users will choose their own tooling of choice to manage EKS + nodegroups. So the main purpose of the bundled installer/EKS cluster creator is mostly development and PoC phase to get rolling quickly with minimal issues.

brandonjbjelland · 2023-08-06T16:57:35Z

@BOsterbuhr thanks for the pointer! Karpenter looks viable here ~~and could very well be a common autoscaler we could use across other providers.~~ but not useful outside of AWS: kubernetes-sigs/karpenter#741

Sidebar: if you're hoping to use substratus on AWS, we're just getting started on adding support. Running on GCP is the paved path we have today.

BOsterbuhr · 2023-08-06T18:52:05Z

In my experience if you are just trying to get dev/PoC support for AWS quickly then eksctl is good choice.

As for Karpenter, yeah unfortunately AWS doesn't seem to be in a rush to support other cloud providers.

Sidebar: I'll test on GCP first to get a better understanding of everything and will watch for AWS support.

brandonjbjelland · 2023-08-08T04:34:11Z

In my experience if you are just trying to get dev/PoC support for AWS quickly then eksctl is good choice.

We discussed internally today and arrived at this same consensus. In short, delivering a cluster is not our value add, where we should spend time, or put in a lot of code (which inevitably will rot, have feature requests itself, etc.). We just need the simplest possible way to get a minimum cluster up in each supported provider for someone starting at zero - that may or may not be through terraform. Here it seems eksctl is the right-sized approach. 👍

As for Karpenter, yeah unfortunately AWS doesn't seem to be in a rush to support other cloud providers.

Though we're being very careful with the dependencies we take on, I don't know that it changes our calculation here - karpenter def presents itself as the simplest way to auto-scale EKS on hydrogenous on hardware with minimal overhead. We very much feel incentivized to avoid having a long list of node groups for the different flavors of GPU-supported instances if that's our alternative.

Sidebar: I'll test on GCP first to get a better understanding of everything and will watch for AWS support.

Thank you! 🙏 Any feedback is highly appreciated, @BOsterbuhr ! ❤️

install/kubernetes/karpenter-provisioner.yaml.tpl

install/Dockerfile

Makefile

install/Dockerfile

install/scripts/aws-up.sh

install/kubernetes/aws/eks-cluster.yaml.tpl

…stations

install/scripts/aws-up.sh

install/scripts/aws-down.sh

nstogner

Good stuff! Added some comments, but none of them are big deals

samos123 · 2023-08-10T02:16:17Z

Karpenter seems to have Karpenter specific node labels you have to use in your pod spec. This might require some more design discussion and require making use_karpenter a flag or expose it on Substratus resources somehow?

BOsterbuhr · 2023-08-10T02:34:26Z

Karpenter seems to have Karpenter specific node labels you have to use in your pod spec.

Can you explain what you mean? Your pods shouldn't have to even know Karpenter exists.

samos123 · 2023-08-10T02:43:12Z

Let's say you want to run a pod on A100 GPU then how would you ensure the pod gets scheduled on a node that has A100 GPU on Karpenter vs non Karpenter? You might have nodes with T4, V100 and A100 in the same cluster.

Note I might be totally wrong since I haven't used Karpenter myself. I was reading this: https://karpenter.sh/preview/concepts/scheduling/

That doc made me believe in order for Karpenter to create a node that has A100 I would have to set nodeSelector in the pod to karpenter.k8s.aws/instance-gpu-name = a100 OR use affinity rules.

I got a GCP background so this is my first time seriously looking into Karpenter. For reference, I'm hoping there is a label like cloud.google.com/gke-accelerator in EKS that works for both Karpenter and non-Karpenter: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#multiple_gpus

BOsterbuhr · 2023-08-10T03:25:28Z

Let's say you want to run a pod on A100 GPU then how would you ensure the pod gets scheduled on a node that has A100 GPU on Karpenter vs non Karpenter? You might have nodes with T4, V100 and A100 in the same cluster.

Oh ok, that makes sense, so instead of making the end user understand that you are using Karpenter you could just create a provisioner with a User-Defined Label as a requirement and then have your end-user use that label as a node selector.
Another requirement you would have then in that provisioner would be karpenter.k8s.aws/instance-gpu-name = a100
I am far from an expert in Karpenter but I believe you ideally want to keep most things at the provisioner level.

samos123 · 2023-08-10T04:23:53Z

The issue is that there doesn't seem to be a node label that exposes the GPU type on AWS unless you use Karpenter, however at the same time we also don't want to depend on Karpenter and ensure Substratus works well without Karpenter. A key principle of Substratus is to minimize dependencies so it's easier to get Substratus to run in any EKS cluster.

Actually, I might be wrong all together and should just get a GPU node on EKS to verify myself. Seems there is in fact a label that would have the info we're looking for: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L38C28-L38C31

BOsterbuhr · 2023-08-10T04:42:30Z

Actually, I might be wrong all together and should just get a GPU node on EKS to verify myself. Seems there is in fact a label that would have the info we're looking for

yeah you're right it looks like that is where the google label you mentioned is coming from as well https://github.com/kubernetes/autoscaler/blob/fc5870f8eaf850dd1e18a5884a7491168dc5d8a0/cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go#L37

The only issue I could see is one I think you all brought up previously; when using the Kubernetes autoscaler you have to manage a separate node group of each different instance type. But that may be worth it if you don't want any dependencies.

adds an AWS module for instantiating all substratus components + an o…

1222b2f

…ptional VPC

brandonjbjelland requested review from samos123 and nstogner August 6, 2023 01:40

brandonjbjelland marked this pull request as draft August 7, 2023 09:28

brandonjbjelland changed the title ~~adds an AWS module for instantiating all substratus components + an optional VPC~~ WIP: adds an AWS module for instantiating all substratus components + an optional VPC Aug 7, 2023

paired back terraform install bits. aws-up started

7fd3104

brandonjbjelland force-pushed the feat/add-aws-infra branch from 532de8b to 7fd3104 Compare August 7, 2023 18:33

brandonjbjelland force-pushed the feat/add-aws-infra branch from b5b0cda to 0eb7cba Compare August 8, 2023 08:23

adding infra via eksctl

4ef5f84

brandonjbjelland force-pushed the feat/add-aws-infra branch from 0eb7cba to 4ef5f84 Compare August 8, 2023 08:24

brandonjbjelland added 2 commits August 8, 2023 01:59

updated dockerfile to install eksctl and work with common architectures

c10bb46

added a karpenter AWSNodeTemplate

03a444d

brandonjbjelland force-pushed the feat/add-aws-infra branch from 0f41505 to 03a444d Compare August 8, 2023 09:07

working with dirs relative to scripts

83d2ef0

BOsterbuhr reviewed Aug 8, 2023

View reviewed changes

install/kubernetes/karpenter-provisioner.yaml.tpl Outdated Show resolved Hide resolved

brandonjbjelland changed the title ~~WIP: adds an AWS module for instantiating all substratus components + an optional VPC~~ adds a getting-started AWS script for instantiating and destroying all substratus components + a VPC Aug 8, 2023

brandonjbjelland force-pushed the feat/add-aws-infra branch 4 times, most recently from 6c0458a to b90e48f Compare August 8, 2023 22:48

brandonjbjelland commented Aug 8, 2023

View reviewed changes

install/Dockerfile Outdated Show resolved Hide resolved

brandonjbjelland force-pushed the feat/add-aws-infra branch from b90e48f to 0ae8e30 Compare August 8, 2023 22:59

brandonjbjelland marked this pull request as ready for review August 8, 2023 23:23

brandonjbjelland force-pushed the feat/add-aws-infra branch from 0ae8e30 to c6b53ea Compare August 8, 2023 23:25

bugfix

e846c81

brandonjbjelland force-pushed the feat/add-aws-infra branch from e95dcc2 to e846c81 Compare August 9, 2023 08:47

nstogner suggested changes Aug 9, 2023

View reviewed changes

migrated tools install to a dedicated script that should work on work…

f57fc8b

…stations

brandonjbjelland force-pushed the feat/add-aws-infra branch from 761f548 to f57fc8b Compare August 9, 2023 17:12

brandonjbjelland added 3 commits August 9, 2023 10:37

migrated to lowercase vars

2ede193

consistent makefile target naming

db6e229

dropping probably not needed aws-down steps

0431615

brandonjbjelland force-pushed the feat/add-aws-infra branch from 05b491a to 0dde133 Compare August 9, 2023 18:09

improved caching on docker build. dropping some karpenter configs

5149163

brandonjbjelland force-pushed the feat/add-aws-infra branch from 0dde133 to 5149163 Compare August 9, 2023 18:10

nstogner reviewed Aug 9, 2023

View reviewed changes

install/scripts/aws-up.sh Outdated Show resolved Hide resolved

nstogner reviewed Aug 9, 2023

View reviewed changes

install/scripts/aws-down.sh Show resolved Hide resolved

nstogner reviewed Aug 9, 2023

View reviewed changes

install/scripts/aws-down.sh Show resolved Hide resolved

nstogner previously approved these changes Aug 9, 2023

View reviewed changes

samos123 previously approved these changes Aug 10, 2023

View reviewed changes

everything is working consistently. shipping it

76ff2c9

brandonjbjelland dismissed stale reviews from samos123 and nstogner via 76ff2c9 August 10, 2023 05:19

Merge branch 'main' into feat/add-aws-infra

e7ca133

samos123 approved these changes Aug 10, 2023

View reviewed changes

brandonjbjelland merged commit 8542355 into main Aug 10, 2023

brandonjbjelland deleted the feat/add-aws-infra branch August 10, 2023 06:02

brandonjbjelland mentioned this pull request Aug 10, 2023

Optimize for BYO cluster: create outside process to manage the creation of our KSA-bound cloud IAM principals (sci-${provider}) #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds AWS infra for instantiating and destroying all baseline substratus components #170

adds AWS infra for instantiating and destroying all baseline substratus components #170

brandonjbjelland commented Aug 6, 2023 •

edited

Loading

BOsterbuhr commented Aug 6, 2023

samos123 commented Aug 6, 2023

samos123 commented Aug 6, 2023 •

edited

Loading

brandonjbjelland commented Aug 6, 2023 •

edited

Loading

BOsterbuhr commented Aug 6, 2023

brandonjbjelland commented Aug 8, 2023

nstogner left a comment

samos123 commented Aug 10, 2023

BOsterbuhr commented Aug 10, 2023

samos123 commented Aug 10, 2023 •

edited

Loading

BOsterbuhr commented Aug 10, 2023

samos123 commented Aug 10, 2023

BOsterbuhr commented Aug 10, 2023

adds AWS infra for instantiating and destroying all baseline substratus components #170

adds AWS infra for instantiating and destroying all baseline substratus components #170

Conversation

brandonjbjelland commented Aug 6, 2023 • edited Loading

What is this change?

Why make this change?

Caveats, questions, TBD

With time, I'll add to this...

BOsterbuhr commented Aug 6, 2023

samos123 commented Aug 6, 2023

samos123 commented Aug 6, 2023 • edited Loading

brandonjbjelland commented Aug 6, 2023 • edited Loading

BOsterbuhr commented Aug 6, 2023

brandonjbjelland commented Aug 8, 2023

nstogner left a comment

Choose a reason for hiding this comment

samos123 commented Aug 10, 2023

BOsterbuhr commented Aug 10, 2023

samos123 commented Aug 10, 2023 • edited Loading

BOsterbuhr commented Aug 10, 2023

samos123 commented Aug 10, 2023

BOsterbuhr commented Aug 10, 2023

brandonjbjelland commented Aug 6, 2023 •

edited

Loading

samos123 commented Aug 6, 2023 •

edited

Loading

brandonjbjelland commented Aug 6, 2023 •

edited

Loading

samos123 commented Aug 10, 2023 •

edited

Loading