Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New federation member on AWS backed by Pangeo #2449

Open
7 tasks
sgibson91 opened this issue Nov 28, 2022 · 9 comments
Open
7 tasks

New federation member on AWS backed by Pangeo #2449

sgibson91 opened this issue Nov 28, 2022 · 9 comments

Comments

@sgibson91
Copy link
Member

sgibson91 commented Nov 28, 2022

As discussed in jupyterhub/team-compass#501, Pangeo want to simplify their Binder infrastructure by removing dask-gateway. This effectively makes it an "ordinary" binder and, therefore, a great candidate as a federation member. Especially since this will be deployed to AWS, which we don't currently cover. Since 2i2c are working with Pangeo (via a grant held at Columbia) to operationalise their cloud infrastructure, I will be taking the lead on the technical aspect of deployment, and this issue will be tracking that.

Things to do

  • Setup new terraform config for AWS
  • Deploy the cluster to an AWS account
  • Create helm config for a new binder deployment
  • Deploy a new Binder instance to the AWS cluster
  • Add the new cluster to a mybinder.org sub-domain
  • Add new cluster/binder to deploy.py and CI/CD
  • Add the new cluster to the federation
@yuvipanda
Copy link
Contributor

yuvipanda commented Nov 29, 2022

I read through https://github.com/jupyterhub/mybinder.org-deploy/blob/master/terraform/ovh/main.tf, and based on that here is the list of AWS resources that I think need to be created:

  1. A VPC + subnets + security groups for the cluster to live in
  2. An EKS cluster control plane
  3. A 'core node' managed nodegroup
  4. A 'user node' managed nodegroup
  5. An ECR Container registry
  6. An IAM user that can be used by CI/CD to authenticate to the cluster for automated deployment
  7. IRSA set up to allow cluster-autoscaler to work
  8. Adding the cluster-autoscaler to the support chart, so the autoscaler can do its thing

There are two ways to do this:

  1. Use AWS Terraform modules - particularly https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest and https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest
  2. Use raw AWS terraform primitives, like https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_cluster.

Both are valid ways to proceed, depending on the amount of complexity we want to sustain. In general, I personally prefer option (2), as I think it is clearer and has fewer levels of complex abstractions. It also matches how we have done it on GKE right now (https://github.com/jupyterhub/mybinder.org-deploy/blob/master/terraform/modules/mybinder/resource.tf).

However, I think in this specific case, (1) is also a workable alternative! This is primarily because the AWS setup for Kubernetes is far more fiddly than what exists for GKE, and that sucks. A lot of copypasta and 'gotcha's need to be managed. This is the precise place where terraform modules shine, so perhaps using those here is a good idea. https://github.com/hashicorp/learn-terraform-provision-eks-cluster has a fairly good, bare bones setup that does 1-4 in the list above easily. Maybe we can start there and go from there? The one modification I'd make to the example is that it has public and private subnets, and hence NAT enabled. We can survive with just a public subnet, and hence don't need NAT.

@sgibson91
Copy link
Member Author

sgibson91 commented Nov 29, 2022

Thanks @yuvipanda! I think I had a hybrid idea between (1) and (2) which was to use the official aws provider for as much as possible, and then the module for the vpc network, because I think networking is the trickiest part. But perhaps that's unnecessary.

@sgibson91
Copy link
Member Author

I think there's something missing from this list as well. Whatever the AWS "speak" is for something that will allow the cluster to push and pull images from the ECR. A role?

@manics
Copy link
Member

manics commented Nov 29, 2022

An IAM user that can be used by CI/CD to authenticate to the cluster for automated deployment

Can you use an IAM role and GitHub OIDC instead? This avoids needing hard-coded secrets, you'll get a temporary token every time.
https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services
It's been working well for me when deploying infrastructure on AWS.

@manics
Copy link
Member

manics commented Nov 29, 2022

Also jupyterhub/binderhub#1055 (Add AWS ECR support (round two)) is probably relevant, I was going to review it but never had time 😞. It includes IAM instructions.

@sgibson91
Copy link
Member Author

Can you use an IAM role and GitHub OIDC instead?

Let's keep this as a stretch goal. I'm learning and so want to focus on the MVP for now. This feels like something that can be switched to later on after we've got the basic infrastructure setup first.

Also jupyterhub/binderhub#1055 (Add AWS ECR support (round two)) is probably relevant, I was going to review it but never had time 😞. It includes IAM instructions.

Thank you

@yuvipanda
Copy link
Contributor

@sgibson91 at least in this case, I think the EKS setup is probably more complicated than the network stuff. But totally ok to start from using modules for VPC and go from there.

@manics
Copy link
Member

manics commented Nov 29, 2022

I've previously deployed a dev EKS cluster with https://github.com/hashicorp/learn-terraform-provision-eks-cluster
It took a bit of digging into the EKS module code when I wanted to customise it though.

@sgibson91
Copy link
Member Author

I opened this PR with my work so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants