Skip to content
This repository was archived by the owner on Mar 22, 2024. It is now read-only.

owu-1/kubeflow-aws-cheap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attempt to make the cheapest Kubernetes cluster with GPU nodes on AWS. Kubernetes is self-managed and everything is running on spot EC2 instances. This project is experimental.

The instructions below assume you have AWS CLI with a profile setup (AWS CLI) (SSO profile) (IAM profile), Terraform, kOps, argocd (don't use 2.9.x) todo: remove, yq, jq, kubelogin.

Create a new file in cloud-computing/variables.sh based on cloud-computing/variables.example.sh. You can ignore kops_aws_profile for now since the IAM user will be created with terraform first. You can also ignore cognito_user_pool_id cognito_client_id cognito_client_secret until the values are known after terraform is run.

Control plane nodes support AMD64/ARM64, 1 vCPU, 4 GiB memory (requests at CPU: 980m, memory: 676Mi). CPU nodes only support AMD64 because of deployKF (read "CPU Architecture" note). CPU nodes support 4 vCPU and 8 GiB memory (requests at CPU: 3530m, memory: 6088Mi).

Ensure that the terraform_state_bucket exists in the specified AWS region.

Enter the terraform environment by running source cloud-computing/terraform/env.sh.

Initialise terraform by running cloud-computing/terraform/init.sh.

Terraform will setup the necessary resources for kops. To provision the resources run terraform apply.

Create an access key for the kops IAM user. Setup a AWS CLI profile for it and enter the profile into cloud-computing/variables.sh as kops_aws_profile.

Find the values for cognito_user_pool_id cognito_client_id cognito_client_secret and apply them to cloud-computing/variables.sh.

Enter the kops environment by running source cloud-computing/kops/env.sh.

Create the cluster configuration by running cloud-computing/kops/create-cluster.sh.

Deploy the cluster resources by running kops update cluster --name $cluster_domain_name --yes

Configure kubectl for OIDC by running cloud-computing/kops/set-credentials.sh and by following the instructions it outputs.

Connect to the control plane instance via the SSH in the EC2 UI and run

# todo: restrict access more https://developer.okta.com/blog/2021/11/08/k8s-api-server-oidc

kubectl apply -f - <<EOF
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: oidc-cluster-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: Group
  name: oidc:admin
EOF

Install ArgoCD and the deployKF plugin by following these instructions in section Install Plugin - New ArgoCD

Apply app-of-apps in ArgoCD by running cloud-computing/deploykf/generate-and-apply-manifest.sh.

Deploy the nginx ingress controller by running cloud-computing/deploykf/deploy-nginx-ingress-controller.sh

You can use a script provided by deployKF to automatically sync the ArgoCD applications, however the script keeps getting stuck. Also it waits 60 seconds for user input multiple times to ask whether you want to sync with pruning. Instead, the web ui can be used to sync manually.

Get ArgoCD initial password by running kubectl -n argocd get secret/argocd-initial-admin-secret \ -o jsonpath="{.data.password}" | base64 -d

Port forward ArgoCD kubectl port-forward svc/argocd-server -n argocd 8080:443

Edit security group nodes.$cluster_domain_name to include the deploy https port as inbound rule.

After app-of-apps is synced, create Kubernetes secret for Kubeflow Pipelines for bucket access by running kubectl create secret generic bucket-creds-backend -n kubeflow --from-literal=AWS_ACCESS_KEY_ID=insert-aws-access-key-id --from-literal=AWS_SECRET_ACCESS_KEY=insert-secret-access-key

To add the AWS Cognito OIDC configuration run cloud-computing/deploykf/apply-aws-cognito-config.sh.

Follow instructions from here under Sync Applications - ArgoCD Web UI.

Example cloud-computing/variables.sh file

#!/bin/bash

# AWS options to create the infrastructure with
terraform_aws_profile=root
kops_aws_profile=kops
# todo: name aws region variables better
aws_region=ap-northeast-2
aws_avalibility_region=ap-northeast-2a
identity_provider_aws_region=ap-northeast-2

# Domain names
base_domain_name=example.com
cluster_domain_name="kubernetes.$base_domain_name"

# Subdomain used for deployKF
deploykf_domain_name="deploykf.$cluster_domain_name"

# Admin email
admin_email="admin@$base_domain_name"
admin_email_routing_destination=user@example.net

# S3 bucket for terraform state
terraform_state_bucket=terraform-state-abc123

# S3 buckets for kubernetes cluster
state_store_bucket=state-store-abc123
oidc_store_bucket=oidc-store-abc123

# S3 bucket for kubeflow
kubeflow_pipelines_bucket=kubeflow-pipelines-abc123

# Ports used for deployKF
# todo: explain nodeport range
deploykf_http_port=30000 # ingress does not expose this port
deploykf_https_port=30001

# Control plane machine settings
control_plane_image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-20240228
control_plane_machine_type=m7g.medium
control_plane_volume_size=20

# CPU node machine settings
cpu_node_image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240228
cpu_node_machine_type=c5a.xlarge
cpu_node_volume_size=30

# GPU node machine settings
gpu_node_image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-20240228
gpu_node_machine_type=g5g.xlarge
gpu_node_volume_size=20

# Cloudflare
# todo: might be giving unnecessary permissions. double check
# Permissions
# Account | Email Routing Addresses | Edit
# Account | Email Routing Addresses | Read
# Zone    | Email Routing Rules     | Edit
# Zone    | Email Routing Rules     | Read
# Zone    | Zone Settings           | Edit
# Zone    | Zone Settings           | Read
# Zone    | Zone                    | Edit
# Zone    | Zone                    | Read
# Zone    | DNS                     | Edit
# Zone    | DNS                     | Read
# Account Resources
# Include | All accounts
# Zone Resources
# Include | All zones from an account | account-name
cloudflare_account_id=abcdef123
cloudflare_api_token=abcdef123

# Cognito
# todo: automatically fill this in after terraform apply
cognito_user_pool_id=ap-northeast-2_abcdef123
cognito_client_id=abcdef123
cognito_client_secret=abcdef123
cognito_domain_name="cognito.$base_domain_name"

You can edit cloud-computing/kops/cluster.tmpl.yml to adjust scaling min/max and change instances to on-demand for more stability. Also if you want T instances to run in standard mode, uncomment cpuCredits: standard for your instance group.

About

Experimenting with AI stuff

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published