diff --git a/docs/about/usage-tracking.md b/docs/about/usage-tracking.md index f62cdac86b..f7e9d7c0d7 100644 --- a/docs/about/usage-tracking.md +++ b/docs/about/usage-tracking.md @@ -12,8 +12,8 @@ Usage tracking for Kubeflow on AWS collects the instance ID used by one of the w Usage tracking is activated by default. If you deactivated usage tracking for your Kubeflow deployment and would like to activate it after the fact, you can do so at any time with the following command: -``` -kustomize build distributions/aws/aws-telemetry | kubectl apply -f - +```bash +kustomize build awsconfigs/common/aws-telemetry | kubectl apply -f - ``` ### Deactivate usage tracking @@ -23,11 +23,11 @@ kustomize build distributions/aws/aws-telemetry | kubectl apply -f - You can deactivate usage tracking by skipping the telemetry component installation in one of two ways: 1. For single line installation, comment out the [`aws-telemetry` line](https://github.com/awslabs/kubeflow-manifests/blob/main/docs/deployment/vanilla/kustomization.yaml#L57) in the `kustomization.yaml` file of your choosing: - ``` + ```bash # ./../aws-telemetry ``` 2. For individual component installation, **do not** install the `aws-telemetry` component: - ``` + ```bash # AWS Telemetry - This is an optional component. kustomize build awsconfigs/common/aws-telemetry | kubectl apply -f - ``` diff --git a/docs/deployment/README.md b/docs/deployment/README.md deleted file mode 100644 index d46bc28f86..0000000000 --- a/docs/deployment/README.md +++ /dev/null @@ -1,92 +0,0 @@ -# Kubeflow on AWS - -## Deployment Options - -In this directory you can find instructions for deploying Kubeflow on Amazon Elastic Kubernetes Service (Amazon EKS). Depending upon your use case you may choose to integrate your deployment with different AWS services. Following are various deployment options: - -### Components configured for RDS and S3 -Installation steps can be found [here](rds-s3) - -### Components configured for Cognito -Installation steps can be found [here](cognito) - -### Components configured for Cognito, RDS and S3 -Installation steps can be found [here](cognito-rds-s3) - -### Vanilla -Installation steps can be found [here](vanilla) - -## Add Ons - Services/Components that can be integrated with a Kubeflow deployment - -### Using EFS with Kubeflow -Installation steps can be found [here](add-ons/storage/efs) - -### Using FSx for Lustre with Kubeflow -Installation steps can be found [here](add-ons/storage/fsx-for-lustre) - - -### CloudWatch Logging and Container Insights -Amazon EKS offers Container Insights using Amazon CloudWatch which monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real time. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications. FluentBit is used as the DaemonSet to send logs to CloudWatch Logs. Install AWS CloudWatch by following their [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html). - -Installation steps can be found [here](add-ons/cloudwatch/README.md) - -## Security - -The scripts in this repository are meant to be used for development/testing purposes. We highly recommend to follow AWS security best practice documentation while provisioning AWS resources. We have added few references below. - -[Security best practices for Amazon Elastic Kubernetes Service (EKS)](https://aws.github.io/aws-eks-best-practices/security/docs/) -[Security best practices for AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/best-practices.html) -[Security best practices for Amazon Relational Database Service (RDS)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.Security.html) -[Security best practices for Amazon Simple Storage Service (S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html) -[Security in Amazon Route53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/security.html) -[Security in Amazon Certificate Manager (ACM)](https://docs.aws.amazon.com/acm/latest/userguide/security.html) -[Security best practices for Amazon Cognito user pools](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html) -[Security in Amazon Elastic Load Balancing (ELB)](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/security.html) - -## Usage Tracking - -AWS uses customer feedback and usage information to improve the quality of the services and software we offer to customers. We have added usage data collection to the AWS Kubeflow distribution in order to better understand customer usage and guide future improvements. Usage tracking for Kubeflow is activated by default, but is entirely voluntary and can be deactivated at any time. - -Usage tracking for Kubeflow on AWS collects the instance ID used by one of the worker nodes in a customer’s cluster. This data is sent back to AWS once per day. Usage tracking only collects the EC2 instance ID where Kubeflow is running and does not collect or export any other data to AWS. If you wish to deactivate this tracking, instructions are below. - -### How to activate usage tracking - -Usage tracking is activated by default. If you deactivated usage tracking for your Kubeflow deployment and would like to activate it after the fact, you can do so at any time with the following command: - -- ``` - kustomize build distributions/aws/aws-telemetry | kubectl apply -f - - ``` - -### How to deactivate usage tracking - -**Before deploying Kubeflow:** - -You can deactivate usage tracking by skipping the telemetry component installation in one of two ways: - -1. For single line installation, comment out the `aws-telemetry` line in the `kustomization.yaml` file. e.g. in [cognito-rds-s3 kustomization.yaml](cognito-rds-s3/kustomization.yaml#L59) file: - ``` - # ./../aws-telemetry - ``` -1. For individual component installation, **do not** install the `aws-telemetry` component: - ``` - # AWS Telemetry - This is an optional component. See usage tracking documentation for more information - kustomize build distributions/aws/aws-telemetry | kubectl apply -f - - ``` -**After deploying Kubeflow:** - -To deactivate usage tracking on an existing deployment, delete the `aws-kubeflow-telemetry` cronjob with the following command: - -``` -kubectl delete cronjob -n kubeflow aws-kubeflow-telemetry -``` - -### Information collected by usage tracking - -* **Instance ID** - We collect the instance ID used by one of the worker nodes in the customer’s EKS cluster. This collection occurs once per day. - -### Learn more - -The telemetry data we collect is in accordance with AWS data privacy policies. For more information, see the following: - -* [AWS Service Terms](https://aws.amazon.com/service-terms/) -* [Data Privacy](https://aws.amazon.com/compliance/data-privacy-faq/) diff --git a/docs/deployment/_index.md b/docs/deployment/_index.md new file mode 100644 index 0000000000..7c58ddf55b --- /dev/null +++ b/docs/deployment/_index.md @@ -0,0 +1,5 @@ ++++ +title = "Deployment" +description = "Deploy Kubeflow on AWS" +weight = 10 ++++ diff --git a/docs/deployment/add-ons/_index.md b/docs/deployment/add-ons/_index.md new file mode 100644 index 0000000000..b7dc394217 --- /dev/null +++ b/docs/deployment/add-ons/_index.md @@ -0,0 +1,5 @@ ++++ +title = "Add-ons" +description = "Add-on installation guides for Kubeflow on AWS" +weight = 70 ++++ diff --git a/docs/deployment/add-ons/cloudwatch/README.md b/docs/deployment/add-ons/cloudwatch/guide.md similarity index 51% rename from docs/deployment/add-ons/cloudwatch/README.md rename to docs/deployment/add-ons/cloudwatch/guide.md index 21744437a4..487338b4ed 100644 --- a/docs/deployment/add-ons/cloudwatch/README.md +++ b/docs/deployment/add-ons/cloudwatch/guide.md @@ -1,8 +1,12 @@ -# CloudWatch ContainerInsights on EKS ++++ +title = "CloudWatch" +description = "Set up CloudWatch ContainerInsights on Amazon EKS" +weight = 30 ++++ ## Verify Prerequisites -The EKS Cluster will need IAM service account roles associated with CloudWatchAgentServerPolicy attached. - ``` +The EKS cluster will need IAM service account roles associated with CloudWatchAgentServerPolicy attached. + ```bash export CLUSTER_NAME=<> export CLUSTER_REGION=<> @@ -10,11 +14,10 @@ eksctl utils associate-iam-oidc-provider --region=$CLUSTER_REGION --cluster=$CLU eksctl create iamserviceaccount --name cloudwatch-agent --namespace amazon-cloudwatch --cluster $CLUSTER_NAME --region $CLUSTER_REGION cloudwatch-agent --approve --override-existing-serviceaccounts --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy eksctl create iamserviceaccount --name fluent-bit --namespace amazon-cloudwatch --cluster $CLUSTER_NAME --region $CLUSTER_REGION --approve --override-existing-serviceaccounts --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy ``` - ## Install -To install an optimized quick start configuration enter the following command. -``` +To install an optimized QuickStart configuration, enter the following command: +```bash FluentBitHttpPort='2020' FluentBitReadFromHead='Off' [[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On' @@ -22,42 +25,36 @@ FluentBitReadFromHead='Off' curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${CLUSTER_NAME}'/;s/{{region_name}}/'${CLUSTER_REGION}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl apply -f - ``` -To verify the installation you can run the following command to see that metrics have been created. Note that it may take up to 15 minutes for the metrics to populate. - -``` +To verify the installation, you can run the `list-metrics` command and check that metrics have been created. It may take up to 15 minutes for the metrics to populate. +```bash aws cloudwatch list-metrics --namespace ContainerInsights --region $CLUSTER_REGION ``` -An example of the logs which will be available after installation are the logs of the pods on your cluster. This way the pod logs can still be accessed past their default storage time. Also allows for an easy way to view logs for all pods on your cluster without having to directly connect to your EKS cluster. +An example of the logs that will be available after installation are the logs of the Pods on your cluster. This way, the Pod logs can still be accessed past their default storage time. This also allows for an easy way to view logs for all Pods on your cluster without having to directly connect to your EKS cluster. The logs can be accessed by through CloudWatch log groups ![cloudwatch](./images/cloudwatch-logs.png) +To view individual Pod logs, select `/aws/containerinsights/YOUR_CLUSTER_NAME/application`. ![application](./images/cloudwatch-application-logs.png) -To view individual pod logs select /aws/containerinsights/YOUR_CLUSTER_NAME/application ![application](./images/cloudwatch-application-logs.png) - +The following image is an example of the `jupyter-web-app` Pod logs available through CloudWatch. ![jupyter-logs](./images/cloudwatch-pod-logs.png) -Here is an example of the jupyter-web-app pods logs available through CloudWatch ![jupyter-logs](./images/cloudwatch-pod-logs.png) +For a full list of metrics that are provided by default, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html). - -An example of the metrics that will be available after installation are pod_network_tx_bytes. The full list of metrics that are provided by default can be found [here](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html) - -The metrics are grouped by varying different parameters such as Cluster,Namespace,PodName +The metrics are grouped by varying parameters such as Cluster, Namespace, or PodName. ![cloudwatch-metrics](./images/cloudwatch-metrics.png) -An example of the graphed metrics for the istio-system namespace which deals with internet traffic +The following image is an example of the graphed metrics for the `istio-system` namespace that deals with internet traffic. ![cloudwatch-namespace-metrics](./images/cloudwatch-namespace-metrics.png) -The following guide provides instructions on viewing CloudWatch metrics https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html the metric namespace to select is ContainerInsights - -You can see the full list of logs and metrics through https://console.aws.amazon.com/cloudwatch/ - +See [Viewing available metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html) for more information on CloudWatch metrics. Select the ContainerInsights metric namespace. +You can see the full list of logs and metrics through the [Amazon CloudWatch AWS Console](https://console.aws.amazon.com/cloudwatch/). ## Uninstall -To uninstall CloudWatch ContainerInsights enter the following command. -``` +To uninstall CloudWatch ContainerInsights, enter the following command: +```bash curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/;s/{{region_name}}/'${LogRegion}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl delete -f - ``` -## Additional Information -Full documentation and additional configuration options are available through EKS [documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html) \ No newline at end of file +## Additional information +For full documentation and additional configuration options, see [Quick Start setup for Container Insights on Amazon EKS and Kubernetes](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html). \ No newline at end of file diff --git a/docs/deployment/add-ons/load-balancer/README.md b/docs/deployment/add-ons/load-balancer/README.md deleted file mode 100644 index 3b081d05e7..0000000000 --- a/docs/deployment/add-ons/load-balancer/README.md +++ /dev/null @@ -1,192 +0,0 @@ -# Exposing Kubeflow over Load Balancer - -This tutorial shows how to expose Kubeflow over a load balancer on AWS. - -## Before you begin - -Follow this guide only if you are **not** using `Cognito` as the authentication provider in your deployment. Cognito integrated deployment is configured with AWS Load Balancer controller by default to create an ingress managed application load balancer and exposes Kubeflow via a hosted domain. - -## Background - -Kubeflow does not offer a generic solution for connecting to Kubeflow over a load balancer because this process is highly dependent on your environment/cloud provider. On AWS, we use the [AWS Load Balancer controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/) which satisfies the Kubernetes [Ingress resource](https://kubernetes.io/docs/concepts/services-networking/ingress/) to create an [Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) (ALB). When you create a Kubernetes `Ingress`, an ALB is provisioned that load balances application traffic. - -In order to connect to Kubeflow using a LoadBalancer, we need to setup HTTPS. The reason is that many of the Kubeflow web apps (e.g., Tensorboard Web App, Jupyter Web App, Katib UI) use [Secure Cookies](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies#restrict_access_to_cookies), so accessing Kubeflow with HTTP over a non-localhost domain does not work. - -To secure the traffic and use HTTPS, we must associate a Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificate with the load balancer. [AWS Certificate Manager](https://aws.amazon.com/certificate-manager/) is a service that lets you easily provision, manage, and deploy public and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates for use with AWS services and your internal connected resources. To create a certificate for use with the load balancer, [you must specify a domain name](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/create-https-listener.html#https-listener-certificates) i.e. certificates cannot be created for ALB DNS. You can register your domain using any domain service provider such as [Route53](https://aws.amazon.com/route53/), GoDoddy etc. - -## Prerequisites - -1. Kubeflow deployment on EKS with Dex as auth provider(default in [Vanilla](../../vanilla/README.md) Kubeflow). -1. Installed the tools mentioned in [prerequisite section of this](../../vanilla/README.md#prerequisites) document on the client machine. -1. Verify you are connected to right cluster, cluster has compute and the aws region is set to the region of cluster. - 1. Verify cluster name and region are exported - ``` - echo $CLUSTER_REGION - echo $CLUSTER_NAME - ``` - 1. Display the current cluster kubeconfig points to - ``` - kubectl config current-context - aws eks describe-cluster --name $CLUSTER_NAME - ``` -1. Verify the current directory is the root of the repository by running the `pwd` command. The output should be `` directory - - -## Create Load Balancer - -To make it easy to create the load balancer, you can use the [script provided in this section](#automated-script). If you prefer to use the automated scripts, you need to only execute the steps in the [automated script section](#automated-script). Read the following sections in this guide to understand what happens when you run the script or execute all the steps if you prefer to do it manually/hands-on. - -### Create Domain and Certificates - -As explained in the background section, you need a registered domain and TLS certificate to use HTTPS with load balancer. Since your top level domain(e.g. `example.com`) could have been registered at any service provider, for uniformity and taking advantage of the integration provided between Route53, ACM and Application Load Balancer, you will create a separate [sudomain](https://en.wikipedia.org/wiki/Subdomain) (e.g. `platform.example.com`) to host Kubeflow and a corresponding hosted zone in Route53 to route traffic for this subdomain. To get TLS support, you will need certificates for both the root domain(`*.example.com`) and subdomain(`*.platform.example.com`) in the region where your platform will be running(i.e. EKS cluster region). - -#### Create a subdomain - -1. Register a domain in any domain provider like [Route 53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-register.html) or GoDaddy.com etc. Lets assume this domain is `example.com`. It is handy to have a domain managed by Route53 to deal with all the DNS records you will have to add (wildcard for ALB DNS, validation for the certificate manager, etc) -1. Goto Route53 and create a subdomain to host kubeflow: - 1. Create a hosted zone for the desired subdomain e.g. `platform.example.com`. - 1. Copy the value of NS type record from the subdomain hosted zone (`platform.example.com`) - 1. ![subdomain-NS](./files/subdomain-NS.png) - 1. Create a `NS` type of record in the root `example.com` hosted zone for the subdomain `platform.example.com`. - 1. ![root-domain-NS-creating-NS](./files/root-domain-NS-creating-NS.png) - 1. Following is a screenshot of the record after creation in `example.com` hosted zone. - 1. ![root-domain-NS-created](./files/root-domain-NS-created.png) - -From this point onwards, we will be creating/updating the DNS records **only in the subdomain**. All the screenshots of hosted zone in the following sections/steps of this guide are for the subdomain. -#### Create certificates for domain - -Create the certificates for the domains in the region where your platform will be running(i.e. EKS cluster region) by following [this document](https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html#request-public-console) in the specified order. - -> **Note:** -> - The ceritificates are valid only after successful [validation of domain ownership](https://docs.aws.amazon.com/acm/latest/userguide/domain-ownership-validation.html) - - Following is a screenshot showing a certificate has been issued. Note: Status turns to `Issued` after few minutes of validation. - - ![successfully-issued-certificate](./files/successfully-issued-certificate.png) -> - If you choose DNS validation for the validation of the certificates, you will be asked to create a CNAME type record in the hosted zone. - - Following is a screenshot of CNAME record of the certificate in `platform.example.com` hosted zone for DNS validation: - - ![DNS-record-for-certificate-validation](./files/DNS-record-for-certificate-validation.png) - -1. Create a certificate for `*.example.com` in the region where your platform will be running -1. Create a certificate for `*.platform.example.com` in the region where your platform will be running - -### Configure Ingress - -1. Export the ARN of the certificate created for `*.platform.example.com`: - 1. `export certArn=<>` -1. Configure the parameters for [ingress](../../../../awsconfigs/common/istio-ingress/overlays/https/params.env) with the certificate ARN of the subdomain - 1. ``` - printf 'certArn='$certArn'' > awsconfigs/common/istio-ingress/overlays/https/params.env - ``` -### Configure Load Balancer Controller - -Setup resources required for the load balancer controller: - -1. Make sure all the subnets(public and private) corresponding to the EKS cluster are tagged according to the `Prerequisites` section in this [document](https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html). Ignore the requirement to have an existing ALB provisioned on the cluster. We will be deploying load balancer controller version 1.1.5 in the later section. - 1. Specifically check if the following tags exist on the subnets: - 1. `kubernetes.io/cluster/cluster-name` (replace `cluster-name` with your cluster name e.g. `kubernetes.io/cluster/my-k8s-cluster`). Add this tag in both private and public subnets. If you created the cluster using eksctl, you might be missing only this tag. Use the following command to tag all subnets by substituting the value of `TAG_VALUE` variable(`owned` or `shared`). Use `shared` as tag value if you have more than one cluster using the subnets: - - ``` - export TAG_VALUE=<> - export CLUSTER_SUBNET_IDS=$(aws ec2 describe-subnets --region $CLUSTER_REGION --filters Name=tag:alpha.eksctl.io/cluster-name,Values=$CLUSTER_NAME --output json | jq -r '.Subnets[].SubnetId') - for i in "${CLUSTER_SUBNET_IDS[@]}" - do - aws ec2 create-tags --resources ${i} --tags Key=kubernetes.io/cluster/${CLUSTER_NAME},Value=${TAG_VALUE} - done - ``` - 1. `kubernetes.io/role/internal-elb`. Add this tag only to private subnets - 1. `kubernetes.io/role/elb`. Add this tag only to public subnets -1. Load balancer controller will use [IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)(IRSA) to access AWS services. An OIDC provider must exist for your cluster to use IRSA. Create an OIDC provider and associate it with for your EKS cluster by running the following command if your cluster doesn’t already have one: - 1. ``` - eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --approve - ``` -1. Create an IAM role with [these permissions](../../../../awsconfigs/infra_configs/iam_alb_ingress_policy.json) for the load balancer controller to use via a service account to access AWS services. - 1. ``` - export LBC_POLICY_NAME=alb_ingress_controller_${CLUSTER_REGION}_${CLUSTER_NAME} - export LBC_POLICY_ARN=$(aws iam create-policy --policy-name $LBC_POLICY_NAME --policy-document file://awsconfigs/infra_configs/iam_alb_ingress_policy.json --output text --query 'Policy.Arn') - eksctl create iamserviceaccount --name aws-load-balancer-controller --namespace kube-system --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --attach-policy-arn ${LBC_POLICY_ARN} --override-existing-serviceaccounts --approve - ``` -1. Configure the parameters for [load balancer controller](../../../../awsconfigs/common/aws-alb-ingress-controller/base/params.env) with the cluster name - 1. ``` - printf 'clusterName='$CLUSTER_NAME'' > awsconfigs/common/aws-alb-ingress-controller/base/params.env - ``` - -### Build Manifests and deploy components -Run the following command to build and install the components specified in this [kustomize](./kustomization.yaml) file. -``` -while ! kustomize build docs/deployment/add-ons/load-balancer | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done -``` - -### Update the domain with ALB address - -1. Check if ALB is provisioned. It takes around 3-5 minutes - 1. ``` - kubectl get ingress -n istio-system istio-ingress - NAME CLASS HOSTS ADDRESS PORTS AGE - istio-ingress * xxxxxx-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com 80 15d - ``` - 2. If `ADDRESS` is empty after a few minutes, check the logs of controller by following [this guide](https://www.kubeflow.org/docs/distributions/aws/troubleshooting-aws/#alb-fails-to-provision) -1. When ALB is ready, copy the DNS name of that load balancer and create a CNAME entry to it in Route53 under subdomain (`platform.example.com`) for `*.platform.example.com` - 1. ![subdomain-*.platform-record](./files/subdomain-*.platform-record.png) - -1. The central dashboard should now be available at `https://kubeflow.platform.example.com`. Open a browser and navigate to this URL. - -### Automated script - -1. Install dependencies for the script - ``` - cd tests/e2e - pip install -r requirements.txt - ``` -1. Substitute values in `tests/e2e/utils/load_balancer/config.yaml`. - 1. Registed root domain in `route53.rootDomain.name`. Lets assume this domain is `example.com` - 1. If your domain is managed in route53, enter the Hosted zone ID found under Hosted zone details in `route53.rootDomain.hostedZoneId`. Skip this step if your domain is managed by other domain provider. - 1. Name of the sudomain you want to host Kubeflow (e.g. `platform.example.com`) in `route53.subDomain.name`. - 1. Cluster name and region where kubeflow is deployed in `cluster.name` and `cluster.region` (e.g. us-west-2) respectively. - 1. The config file will look something like: - 1. ``` - cluster: - name: kube-eks-cluster - region: us-west-2 - route53: - rootDomain: - hostedZoneId: XXXX - name: example.com - subDomain: - name: platform.example.com - ``` -1. Run the script to create the resources - 1. ``` - PYTHONPATH=.. python utils/load_balancer/setup_load_balancer.py - ``` -1. The script will update the config file with the resource names/ids/ARNs it created. Following is a sample: - 1. ``` - kubeflow: - alb: - dns: xxxxxx-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com - serviceAccount: - name: alb-ingress-controller - namespace: kubeflow - policyArn: arn:aws:iam::123456789012:policy/alb_ingress_controller_kube-eks-clusterxxx - cluster: - name: kube-eks-cluster - region: us-west-2 - route53: - rootDomain: - certARN: arn:aws:acm:us-west-2:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af - hostedZoneId: XXXXX - name: example.com - subDomain: - certARN: arn:aws:acm:us-west-2:123456789012:certificate/d1d7b641c238-4bc7-f525-b7bf-373cc726 - hostedZoneId: XXXXX - name: platform.example.com - ``` -1. The central dashboard should now be available at `https://kubeflow.platform.example.com`. Open a browser and navigate to this URL. - -## Clean up - -To delete the resources created in this guide, run the following commands from the root of repository: -Make sure you have the configuration file created by the script in `tests/e2e/utils/load_balancer/config.yaml`. If you did not use the script, plugin the name/ARN/id of the resources you created in the configuration file by referring the sample in Step 4 of [previous section](#automated-script) - -``` -cd tests/e2e -PYTHONPATH=.. python utils/load_balancer/lb_resources_cleanup.py -cd - -``` \ No newline at end of file diff --git a/docs/deployment/add-ons/load-balancer/guide.md b/docs/deployment/add-ons/load-balancer/guide.md new file mode 100644 index 0000000000..565542cd00 --- /dev/null +++ b/docs/deployment/add-ons/load-balancer/guide.md @@ -0,0 +1,196 @@ ++++ +title = "Load Balancer" +description = "Expose Kubeflow over Load Balancer on AWS" +weight = 40 ++++ + +This tutorial shows how to expose Kubeflow over a load balancer on AWS. + +## Before you begin + +Follow this guide only if you are **not** using `Cognito` as the authentication provider in your deployment. Cognito-integrated deployment is configured with the AWS Load Balancer controller by default to create an ingress-managed Application Load Balancer and exposes Kubeflow via a hosted domain. + +## Background + +Kubeflow does not offer a generic solution for connecting to Kubeflow over a Load Balancer because this process is highly dependent on your environment and cloud provider. On AWS, we use the [AWS Load Balancer (ALB) controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/), which satisfies the Kubernetes [Ingress resource](https://kubernetes.io/docs/concepts/services-networking/ingress/) to create an [Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) (ALB). When you create a Kubernetes `Ingress`, an ALB is provisioned that load balances application traffic. + +In order to connect to Kubeflow using a Load Balancer, we need to setup HTTPS. Many of the Kubeflow web apps (e.g. Tensorboard Web App, Jupyter Web App, Katib UI) use [Secure Cookies](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies#restrict_access_to_cookies), so accessing Kubeflow with HTTP over a non-localhost domain does not work. + +To secure the traffic and use HTTPS, we must associate a Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificate with the Load Balancer. [AWS Certificate Manager](https://aws.amazon.com/certificate-manager/) is a service that lets you easily provision, manage, and deploy public and private SSL/TLS certificates for use with AWS services and your internal connected resources. To create a certificate for use with the Load Balancer, [you must specify a domain name](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/create-https-listener.html#https-listener-certificates) (i.e. certificates cannot be created for ALB DNS). You can register your domain using any domain service provider such as [Route53](https://aws.amazon.com/route53/), or GoDoddy. + +## Prerequisites +This guide assumes that you have: +1. A Kubeflow deployment on EKS with Dex as auth provider (the default setup in the [Vanilla](/docs/deployment/vanilla/guide/) deployment of Kubeflow on AWS). +1. Installed the tools mentioned in the [general prerequisites](/docs/deployment/prerequisites/) guide on the client machine. +1. Verified that you are connected to the right cluster, that the cluster has compute, and that the AWS region is set to the region of your cluster. + 1. Verify that your cluster name and region are exported: + ```bash + echo $CLUSTER_REGION + echo $CLUSTER_NAME + ``` + 1. Display the current cluster that kubeconfig points to: + ```bash + kubectl config current-context + aws eks describe-cluster --name $CLUSTER_NAME + ``` +1. Verify that the current directory is the root of the repository by running the `pwd` command. The output should be ``. + +## Create Load Balancer + +To make it easy to create the Load Balancer, you can use the [script provided in this section](#automated-script). If you prefer to use the automated scripts, you only need to follow the steps in the [automated script section](#automated-script). Read the following sections in this guide to understand what happens when you run the automated script. This guide also walks you through all of the setup steps if you prefer to do things manually. + +### Create domain and certificates + +You need a registered domain and TLS certificate to use HTTPS with Load Balancer. Since your top level domain (e.g. `example.com`) can be registered at any service provider, for uniformity and taking advantage of the integration provided between Route53, ACM, and Application Load Balancer, you will create a separate [sudomain](https://en.wikipedia.org/wiki/Subdomain) (e.g. `platform.example.com`) to host Kubeflow and a corresponding hosted zone in Route53 to route traffic for this subdomain. To get TLS support, you will need certificates for both the root domain (`*.example.com`) and subdomain (`*.platform.example.com`) in the region where your platform will run (your EKS cluster region). + +#### Create a subdomain + +1. Register a domain in any domain provider like [Route 53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-register.html) or GoDaddy. For this guide, we assume that this domain is `example.com`. It is handy to have a domain managed by Route53 to deal with all the DNS records that you will have to add (wildcard for ALB DNS, validation for the certificate manager, etc). +1. Goto Route53 and create a subdomain to host Kubeflow: + 1. Create a hosted zone for the desired subdomain e.g. `platform.example.com`. + 1. Copy the value of NS type record from the subdomain hosted zone (`platform.example.com`) + 1. ![subdomain-NS](./files/subdomain-NS.png) + 1. Create an `NS` type of record in the root `example.com` hosted zone for the subdomain `platform.example.com`. + 1. ![root-domain-NS-creating-NS](./files/root-domain-NS-creating-NS.png) + 1. The following image is a screenshot of the record after creation in `example.com` hosted zone. + 1. ![root-domain-NS-created](./files/root-domain-NS-created.png) + +From this point onwards, we will create and update the DNS records **only in the subdomain**. All of the images of the hosted zone in the following steps of this guide are for the subdomain. + +#### Create certificates for domain + +To create the certificates for the domains in the region where your platform will run (i.e. EKS cluster region), follow the steps in the [Request a public certificate using the console](https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html#request-public-console) guide. + +> Note: The certificates are valid only after successful [validation of domain ownership](https://docs.aws.amazon.com/acm/latest/userguide/domain-ownership-validation.html). + +The following image is a screenshot showing that a certificate has been issued. +> Note: Status turns to `Issued` after a few minutes of validation. +![successfully-issued-certificate](./files/successfully-issued-certificate.png) + +If you choose DNS validation for the validation of the certificates, you will be asked to create a CNAME type record in the hosted zone. The following image is a screenshot of the CNAME record of the certificate in the `platform.example.com` hosted zone for DNS validation: +![DNS-record-for-certificate-validation](./files/DNS-record-for-certificate-validation.png) + +1. Create a certificate for `*.example.com` in the region where your platform will run. +1. Create a certificate for `*.platform.example.com` in the region where your platform will run. + +### Configure Ingress + +1. Export the ARN of the certificate created for `*.platform.example.com`: + 1. `export certArn=<>` +1. Configure the parameters for [ingress](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/istio-ingress/overlays/https/params.env) with the certificate ARN of the subdomain. + 1. ```bash + printf 'certArn='$certArn'' > awsconfigs/common/istio-ingress/overlays/https/params.env + ``` +### Configure Load Balancer controller + +Set up resources required for the Load Balancer controller: + +1. Make sure that all the subnets(public and private) corresponding to the EKS cluster are tagged according to the `Prerequisites` section in the [Application load balancing on Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html) guide. Ignore the requirement to have an existing ALB provisioned on the cluster. We will deploy Load Balancer controller version 1.1.5 later on. + 1. Check if the following tags exist on the subnets: + 1. `kubernetes.io/cluster/cluster-name` (replace `cluster-name` with your cluster name e.g. `kubernetes.io/cluster/my-k8s-cluster`). Add this tag in both private and public subnets. If you created the cluster using `eksctl`, you might be missing only this tag. Use the following command to tag all subnets by substituting the value of `TAG_VALUE` variable(`owned` or `shared`). Use `shared` as the tag value if you have more than one cluster using the subnets: + - ```bash + export TAG_VALUE=<> + export CLUSTER_SUBNET_IDS=$(aws ec2 describe-subnets --region $CLUSTER_REGION --filters Name=tag:alpha.eksctl.io/cluster-name,Values=$CLUSTER_NAME --output json | jq -r '.Subnets[].SubnetId') + for i in "${CLUSTER_SUBNET_IDS[@]}" + do + aws ec2 create-tags --resources ${i} --tags Key=kubernetes.io/cluster/${CLUSTER_NAME},Value=${TAG_VALUE} + done + ``` + 1. `kubernetes.io/role/internal-elb`. Add this tag only to private subnets + 1. `kubernetes.io/role/elb`. Add this tag only to public subnets +1. The Load balancer controller will use [IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)(IRSA) to access AWS services. An OIDC provider must exist for your cluster to use IRSA. Create an OIDC provider and associate it with your EKS cluster by running the following command if your cluster doesn’t already have one: + 1. ```bash + eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --approve + ``` +1. Create an IAM role with [the necessary permissions](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/infra_configs/iam_alb_ingress_policy.json) for the Load Balancer controller to use via a service account to access AWS services. + 1. ```bash + export LBC_POLICY_NAME=alb_ingress_controller_${CLUSTER_REGION}_${CLUSTER_NAME} + export LBC_POLICY_ARN=$(aws iam create-policy --policy-name $LBC_POLICY_NAME --policy-document file://awsconfigs/infra_configs/iam_alb_ingress_policy.json --output text --query 'Policy.Arn') + eksctl create iamserviceaccount --name aws-load-balancer-controller --namespace kube-system --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --attach-policy-arn ${LBC_POLICY_ARN} --override-existing-serviceaccounts --approve + ``` +1. Configure the parameters for [load balancer controller](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/aws-alb-ingress-controller/base/params.env) with the cluster name + 1. ```bash + printf 'clusterName='$CLUSTER_NAME'' > awsconfigs/common/aws-alb-ingress-controller/base/params.env + ``` + +### Build Manifests and deploy components +Run the following command to build and install the components specified in the Load Balancer [kustomize](https://github.com/awslabs/kubeflow-manifests/blob/main/docs/deployment/add-ons/load-balancer/kustomization.yaml) file. +```bash +while ! kustomize build docs/deployment/add-ons/load-balancer | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done +``` + +### Update the domain with ALB address + +1. Check if ALB is provisioned. This may take a few minutes. + 1. ``` + kubectl get ingress -n istio-system istio-ingress + NAME CLASS HOSTS ADDRESS PORTS AGE + istio-ingress * xxxxxx-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com 80 15d + ``` + 2. If `ADDRESS` is empty after a few minutes, check the logs of controller by following [this guide](https://www.kubeflow.org/docs/distributions/aws/troubleshooting-aws/#alb-fails-to-provision) +1. When ALB is ready, copy the DNS name of that load balancer and create a CNAME entry to it in Route53 under subdomain (`platform.example.com`) for `*.platform.example.com` + 1. ![subdomain-*.platform-record](./files/subdomain-*.platform-record.png) + +1. The central dashboard should now be available at `https://kubeflow.platform.example.com`. Open a browser and navigate to this URL. + +### Automated script + +1. Install dependencies for the script + ```bash + cd tests/e2e + pip install -r requirements.txt + ``` +1. Substitute values in `tests/e2e/utils/load_balancer/config.yaml`. + 1. Register root domain in `route53.rootDomain.name`. For this example, assume that this domain is `example.com`. + 1. If your domain is managed in Route53, enter the Hosted zone ID found under Hosted zone details in `route53.rootDomain.hostedZoneId`. Skip this step if your domain is managed by other domain provider. + 1. Name of the sudomain that you want to use to host Kubeflow (e.g. `platform.example.com`) in `route53.subDomain.name`. + 1. Cluster name and region where Kubeflow is deployed in `cluster.name` and `cluster.region` (e.g. `us-west-2`), respectively. + 1. The Config file will look something like: + 1. ```yaml + cluster: + name: kube-eks-cluster + region: us-west-2 + route53: + rootDomain: + hostedZoneId: XXXX + name: example.com + subDomain: + name: platform.example.com + ``` +1. Run the script to create the resources. + 1. ```bash + PYTHONPATH=.. python utils/load_balancer/setup_load_balancer.py + ``` +1. The script will update the Config file with the resource names, IDs, and ARNs that it created. Refer to the following example for more information: + 1. ```yaml + kubeflow: + alb: + dns: xxxxxx-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com + serviceAccount: + name: alb-ingress-controller + namespace: kubeflow + policyArn: arn:aws:iam::123456789012:policy/alb_ingress_controller_kube-eks-clusterxxx + cluster: + name: kube-eks-cluster + region: us-west-2 + route53: + rootDomain: + certARN: arn:aws:acm:us-west-2:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af + hostedZoneId: XXXXX + name: example.com + subDomain: + certARN: arn:aws:acm:us-west-2:123456789012:certificate/d1d7b641c238-4bc7-f525-b7bf-373cc726 + hostedZoneId: XXXXX + name: platform.example.com + ``` +1. The central dashboard should now be available at `https://kubeflow.platform.example.com`. Open a browser and navigate to this URL. + +## Clean up + +To delete the resources created in this guide, run the following commands from the root of your repository: +> Note: Make sure that you have the configuration file created by the script in `tests/e2e/utils/load_balancer/config.yaml`. If you did not use the script, plug in the name, ARN, or ID of the resources that you created in the configuration file by referring to the sample in Step 4 of the [previous section](#automated-script). +```bash +cd tests/e2e +PYTHONPATH=.. python utils/load_balancer/lb_resources_cleanup.py +cd - +``` \ No newline at end of file diff --git a/docs/deployment/add-ons/storage/efs/README.md b/docs/deployment/add-ons/storage/efs/guide.md similarity index 77% rename from docs/deployment/add-ons/storage/efs/README.md rename to docs/deployment/add-ons/storage/efs/guide.md index 0cca91e5f2..f34c79bf88 100644 --- a/docs/deployment/add-ons/storage/efs/README.md +++ b/docs/deployment/add-ons/storage/efs/guide.md @@ -1,55 +1,61 @@ -# Using Amazon EFS as Persistent Storage with Kubeflow ++++ +title = "EFS" +description = "Use Amazon EFS as persistent storage with Kubeflow on AWS" +weight = 10 ++++ This guide describes how to use Amazon EFS as Persistent storage on top of an existing Kubeflow deployment. ## 1.0 Prerequisites -1. For this README, we will assume that you already have an EKS Cluster with Kubeflow installed since the EFS CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. You can follow any of the other guides to complete these steps - choose one of the [AWS managed service integrated offering](../../../README.md#deployment-options) or [vanilla distribution](../../../vanilla/README.md). +For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the [deployment options](/docs/deployment/) and [general prerequisites](/docs/deployment/vanilla/guide/) for more information. -**Important :** -You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56` or if you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. +1. Check that you have the necessary [prerequisites](/docs/deployment/vanilla/guide/). -2. At this point, you have likely cloned this repo and checked out the right branch. Let's save this path to help us naviagte to different paths in the rest of this doc - -``` +> Important: You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56` or if you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. + +2. At this point, you have likely cloned the necessary repository and checked out the right branch. Save this path to help naviagte to different paths in the rest of this guide. +```bash export GITHUB_ROOT=$(pwd) export GITHUB_STORAGE_DIR="$GITHUB_ROOT/docs/deployment/add-ons/storage/" ``` -3. Make sure the following environment variables are set. -``` +3. Make sure that the following environment variables are set. +```bash export CLUSTER_NAME= export CLUSTER_REGION= ``` -4. Also, based on your setup, export the name of the user namespace you are planning to use - -``` +4. Based on your setup, export the name of the user namespace you are planning to use. +```bash export PVC_NAMESPACE=kubeflow-user-example-com ``` -5. And finally, choose a name for the EFS claim that we will create. In this guide we will use this variable as the name for the PV as well the PVC. -``` + +5. Choose a name for the EFS claim that we will create. In this guide we will use this variable as the name for the PV as well the PVC. +```bash export CLAIM_NAME= ``` - -## 2.0 Setup EFS +## 2.0 Set up EFS You can either use Automated or Manual setup to set up the resources required. If you choose the manual route, you get another choice between **static and dynamic provisioning**, so pick whichever suits you. On the other hand, for the automated script we currently only support **dynamic provisioning**. Whichever combination you pick, be sure to continue picking the appropriate sections through the rest of this guide. ### 2.1 [Option 1] Automated setup The script automates all the manual resource creation steps but is currently only available for **Dynamic Provisioning** option. It performs the required cluster configuration, creates an EFS file system and it also takes care of creating a storage class for dynamic provisioning. Once done, move to section 3.0. -1. Run the following commands from the `tests/e2e` directory as - -``` +1. Run the following commands from the `tests/e2e` directory: +```bash cd $GITHUB_ROOT/tests/e2e ``` -2. Install the script dependencies -``` +2. Install the script dependencies. +```bash pip install -r requirements.txt ``` -3. Run the automated script as follows - +3. Run the automated script. -Note: If you want the script to create a new security group for EFS, specify a name here. On the other hand, if you want to use an existing Security group, you can specify that name too. We have used the same name as the claim we are going to create - -``` +> Note: If you want the script to create a new security group for EFS, specify a name here. On the other hand, if you want to use an existing Security group, you can specify that name too. We have used the same name as the claim we are going to create. + +```bash export SECURITY_GROUP_TO_CREATE=$CLAIM_NAME python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --efs_file_system_name $CLAIM_NAME --efs_security_group_name $SECURITY_GROUP_TO_CREATE @@ -57,12 +63,13 @@ python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME 4. The script above takes care of creating the `StorageClass (SC)` which is a cluster scoped resource. In order to create the `PersistentVolumeClaim (PVC)` you can either use the yaml file provided in this directory or use the Kubeflow UI directly. The PVC needs to be in the namespace you will be accessing it from. Replace the `kubeflow-user-example-com` namespace specified the below with the namespace for your kubeflow user and edit the `efs/dynamic-provisioning/pvc.yaml` file accordingly. -``` +```bash yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml ``` + #### **Advanced customization** The script applies some default values for the file system name, performance mode etc. If you know what you are doing, you can see which options are customizable by executing `python utils/auto-efs-setup.py --help`. @@ -73,14 +80,14 @@ If you prefer to manually setup each component then you can follow this manual g export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) ``` -#### 1. Install the EFS CSI Driver -We recommend installing the EFS CSI Driver v1.3.4 directly from the [the aws-efs-csi-driver github repo](https://github.com/kubernetes-sigs/aws-efs-csi-driver) as follows - +#### 1. Install the EFS CSI driver +We recommend installing the EFS CSI Driver v1.3.4 directly from the [the aws-efs-csi-driver github repo](https://github.com/kubernetes-sigs/aws-efs-csi-driver) as follows: ``` kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=tags/v1.3.4" ``` -You can confirm that EFS CSI Driver was installed into the default kube-system namespace for you. You can check using the following command - +You can confirm that EFS CSI Driver was installed into the default kube-system namespace for you. You can check using the following command: ``` kubectl get csidriver @@ -88,23 +95,23 @@ NAME ATTACHREQUIRED PODINFOONMOUNT MODES AGE efs.csi.aws.com false false Persistent 5d17h ``` -#### 2. Create the IAM Policy for the CSI Driver +#### 2. Create the IAM Policy for the CSI driver The CSI driver's service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account `efs-csi-controller-sa` with an IAM Role which has the required permissions. -1. Download the IAM policy document from GitHub as follows - +1. Download the IAM policy document from GitHub as follows. ``` curl -o iam-policy-example.json https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/v1.3.4/docs/iam-policy-example.json ``` -2. Create the policy - +2. Create the policy. ``` aws iam create-policy \ --policy-name AmazonEKS_EFS_CSI_Driver_Policy \ --policy-document file://iam-policy-example.json ``` -3. Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows - +3. Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows: ``` eksctl create iamserviceaccount \ @@ -117,20 +124,20 @@ eksctl create iamserviceaccount \ --region $CLUSTER_REGION ``` -4. You can verify by describing the specified service account to check if it has been correctly annotated - +4. You can verify by describing the specified service account to check if it has been correctly annotated: ``` kubectl describe -n kube-system serviceaccount efs-csi-controller-sa ``` -#### 3. Manually Create an Instance of the EFS Filesystem +#### 3. Manually create an instance of the EFS filesystem Please refer to the official [AWS EFS CSI Document](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-filesystem) for detailed instructions on creating an EFS filesystem. -Note: For this README, we have assumed that you are creating your EFS Filesystem in the same VPC as your EKS Cluster. +> Note: For this guide, we assume that you are creating your EFS Filesystem in the same VPC as your EKS Cluster. #### Choose between dynamic and static provisioning In the following section, you have to choose between setting up [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/) or setting up static provisioning. -#### 4. [Option 1] Dynamic Provisioning +#### 4. [Option 1] Dynamic provisioning 1. Use the `$file_system_id` you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the `dynamic-provisioning/sc.yaml` file by chaning `` with your `fs-xxxxxx` file system id. You can also change it using the following command : ``` file_system_id=$file_system_id yq e '.parameters.fileSystemId = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml @@ -155,7 +162,7 @@ kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml Note : The `StorageClass` is a cluster scoped resource which means we only need to do this step once per cluster. #### 4. [Option 2] Static Provisioning -[Using this sample from official AWS Docs](https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods) we have provided the required spec files in the sample subdirectory but you can create the PVC another way. +Using [this sample](https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods), we provided the required spec files in the sample subdirectory. However, you can create the PVC another way. 1. Use the `$file_system_id` you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the last line of the static-provisioning/pv.yaml file to specify the `volumeHandle` field to point to your EFS filesystem. Replace `$file_system_id` if it is not already set. ``` @@ -176,7 +183,7 @@ kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml ``` -### 2.3 Check your Setup +### 2.3 Check your setup Use the following commands to ensure all resources have been deployed as expected and the PersistentVolume is correctly bound to the PersistentVolumeClaim ``` # Only for Static Provisioning @@ -194,12 +201,13 @@ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE efs-claim Bound efs-pv 5Gi RWX efs-sc 5d16h ``` -## 3.0 Using EFS Storage in Kubeflow +## 3.0 Using EFS storage in Kubeflow In the following two sections we will be using this PVC to create a notebook server with Amazon EFS mounted as the workspace volume, download training data into this filesystem and then deploy a TFJob to train a model using this data. -### 3.1 Connect to the Kubeflow Dashboard +### 3.1 Connect to the Kubeflow dashboard Once you have everything setup, Port Forward as needed and Login to the Kubeflow dashboard. At this point, you can also check the `Volumes` tab in Kubeflow and you should be able to see your PVC is available for use within Kubeflow. -For more details on how to access your Kubeflow dashboard, refer to one of the deployment READMEs based on your setup. If you used the vanilla deployment, you can follow this [README](https://github.com/awslabs/kubeflow-manifests/tree/main/docs/deployment/vanilla#connect-to-your-kubeflow-cluster). + +For more details on how to access your Kubeflow dashboard, refer to one of the [deployment option guides](/docs/deployment/) based on your setup. If you used the vanilla deployment, see [Connect to your Kubeflow cluster](/docs/deployment/install/vanilla/guide/#connect-to-your-kubeflow-cluster). ### 3.2 Changing the default Storage Class After installing Kubeflow, you can change the default Storage Class from `gp2` to the efs storage class you created during the setup. For instance, if you followed the automatic or manual steps, you should have a storage class named `efs-sc`. You can check your storage classes by running `kubectl get sc`. @@ -220,7 +228,7 @@ kubectl patch storageclass efs-sc -p '{"metadata": {"annotations":{"storageclass Note: As mentioned, make sure to change your default storage class only after you have completed your Kubeflow deployment. The default Kubeflow components may not work well with a different storage class. -### 3.3 Note about Permissions +### 3.3 Note about permissions This step may not be necessary but you might need to specify some additional directory permissions on your worker node before you can use these as mount points. By default, new Amazon EFS file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon EFS file system permissions to allow other users to modify the file system. The set-permission-job.yaml is an example of how you could set these permissions to be able to use the efs as your workspace in your kubeflow notebook. Modify it accordingly if you run into similar permission issues with any other job pod. ``` @@ -231,16 +239,16 @@ yq e '.spec.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM kubectl apply -f $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml ``` -### 3.4 Using existing EFS volume as workspace or data volume for a notebook +### 3.4 Use existing EFS volume as workspace or data volume for a Notebook -Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. We'll assume you created a PVC with the name `efs-claim` via Kubeflow Volumes UI or via the manual setup step [Static Provisioning](./README.md#4-option-2-static-provisioning). For our example here, we are using the AWS Optimized Tensorflow 2.6 CPU image provided in the notebook configuration options - `public.ecr.aws/c9e4w0g3/notebook-servers/jupyter-tensorflow`. Additionally, use the existing `efs-claim` volume as the workspace volume at the default mount point `/home/jovyan`. The server might take a few minutes to come up. +Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. We'll assume you created a PVC with the name `efs-claim` via Kubeflow Volumes UI or via the manual setup step [Static Provisioning](#4-option-2-static-provisioning). For our example here, we are using the AWS Optimized Tensorflow 2.6 CPU image provided in the Notebook configuration options (`public.ecr.aws/c9e4w0g3/notebook-servers/jupyter-tensorflow`). Additionally, use the existing `efs-claim` volume as the workspace volume at the default mount point `/home/jovyan`. The server might take a few minutes to come up. -In case the server does not start up in the expected time, do make sure to check - +In case the server does not start up in the expected time, do make sure to check: 1. The Notebook Controller Logs 2. The specific notebook server instance pod's logs -### 3.6 Using EFS volume for a TrainingJob using TFJob Operator +### 3.6 Use EFS volume for a TrainingJob using TFJob Operator The following section re-uses the PVC and the Tensorflow Kubeflow Notebook created in the previous steps to download a dataset to the EFS Volume. Then we spin up a TFjob which runs a image classification job using the data from the shared volume. Source: https://www.tensorflow.org/tutorials/load_data/images @@ -258,8 +266,8 @@ data_dir = tf.keras.utils.get_file(origin=dataset_url, data_dir = pathlib.Path(data_dir) ``` -#### 2. Build and Push the Docker image -In the `training-sample` directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the `$IMAGE_URI` to your registry and specify an appropriate tag - +#### 2. Build and push the Docker image +In the `training-sample` directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the `$IMAGE_URI` to your registry and specify an appropriate tag. ``` export IMAGE_URI= cd training-sample @@ -270,16 +278,16 @@ docker push $IMAGE_URI cd - ``` -#### 3. Configure the tfjob spec file +#### 3. Configure the TFjob spec file Once the docker image is built, replace the `` in the `tfjob.yaml` file, line #17. ``` yq e '.spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = env(IMAGE_URI)' -i training-sample/tfjob.yaml ``` -Also, specify the name of the PVC you created - +Also, specify the name of the PVC you created. ``` yq e '.spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i training-sample/tfjob.yaml ``` -Make sure to run it in the same namespace as the claim - +Make sure to run it in the same namespace as the claim: ``` yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i training-sample/tfjob.yaml ``` @@ -290,14 +298,14 @@ At this point, we are ready to train the model using the `training-sample/traini kubectl apply -f training-sample/tfjob.yaml ``` -In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs as - +In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs. ``` kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f ``` ## 4.0 Cleanup -This section cleans up the resources created in this README, to cleanup other resources such as the Kubeflow deployment, please refer to the high level README files. +This section cleans up the resources created in this guide. To clean up other resources, such as the Kubeflow deployment, see [Uninstall Kubeflow](/docs/deployment/uninstall-kubeflow/). ### 4.1 Clean up the TFJob ``` @@ -305,26 +313,26 @@ kubectl delete tfjob -n $PVC_NAMESPACE image-classification-pvc ``` ### 4.2 Delete the Kubeflow Notebook -Login to the dashboard to stop and/or terminate any kubeflow notebooks you created for this session or use the following command - +Login to the dashboard to stop and/or terminate any kubeflow notebooks you created for this session or use the following command: ``` kubectl delete notebook -n $PVC_NAMESPACE ``` -Use the following command to delete the permissions job - +Use the following command to delete the permissions job: ``` kubectl delete pod -n $PVC_NAMESPACE $CLAIM_NAME ``` -### 4.3 Delete PVC, PV and SC in the following order +### 4.3 Delete PVC, PV, and SC in the following order ``` kubectl delete pvc -n $PVC_NAMESPACE $CLAIM_NAME kubectl delete pv efs-pv kubectl delete sc efs-sc ``` -### 4.4 Delete the EFS mount targets, filesystem and security group +### 4.4 Delete the EFS mount targets, filesystem, and security group Use the steps in this [AWS Guide](https://docs.aws.amazon.com/efs/latest/ug/delete-efs-fs.html) to delete the EFS filesystem that you created. -## 5.0 Known Issues: -1. When you rerun the `eksctl create iamserviceaccount` to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In such a case you may need to do one or both of the following - - a. Delete the cloudformation stack associated with this add-on role. - b. Delete the `efs-csi-controller-sa` service account and then re-run the required steps. If you used the auto-script, you can rerun it by specifying the same `filesystem-name` such that a new one is not created. +## 5.0 Known issues +1. When you re-run the `eksctl create iamserviceaccount` to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In this case, you may need to do one or both of the following: + a. Delete the CloudFormation stack associated with this add-on role. + b. Delete the `efs-csi-controller-sa` service account and then re-run the required steps. If you used the auto-script, you can re-run it by specifying the same `filesystem-name` such that a new one is not created. diff --git a/docs/deployment/add-ons/storage/fsx-for-lustre/README.md b/docs/deployment/add-ons/storage/fsx-for-lustre/guide.md similarity index 73% rename from docs/deployment/add-ons/storage/fsx-for-lustre/README.md rename to docs/deployment/add-ons/storage/fsx-for-lustre/guide.md index bd55a360cc..534df0d825 100644 --- a/docs/deployment/add-ons/storage/fsx-for-lustre/README.md +++ b/docs/deployment/add-ons/storage/fsx-for-lustre/guide.md @@ -1,31 +1,37 @@ -# Using Amazon FSx as Persistent Storage with Kubeflow ++++ +title = "FSx for Lustre" +description = "Use Amazon FSx as persistent storage with Kubeflow on AWS" +weight = 20 ++++ This guide describes how to use Amazon FSx as Persistent storage on top of an existing Kubeflow deployment. ## 1.0 Prerequisites -1. For this README, we will assume that you already have an EKS Cluster with Kubeflow installed since the FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. You can follow any of the other guides to complete these steps - choose one of the [AWS managed service integrated offering](../../../README.md#deployment-options) or [vanilla distribution](../../../vanilla/README.md). +For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the [deployment options](/docs/deployment/) and [general prerequisites](/docs/deployment/vanilla/guide/) for more information. -**Important :** -You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56` or if you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. +1. Check that you have the necessary [prerequisites](/docs/deployment/vanilla/guide/). -2. At this point, you have likely cloned this repo and checked out the right branch. Let's save this path to help us navigate to different paths in the rest of this doc - -``` +> Important: You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56` or if you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. + +2. At this point, you have likely cloned the necessary repository and checked out the right branch. Save this path to help us navigate to different paths in the rest of this guide. +```bash export GITHUB_ROOT=$(pwd) export GITHUB_STORAGE_DIR="$GITHUB_ROOT/docs/deployment/add-ons/storage/" ``` 3. Make sure the following environment variables are set. -``` +```bash export CLUSTER_NAME= export CLUSTER_REGION= ``` -4. Also, based on your setup, export the name of the user namespace you are planning to use - -``` +4. Based on your setup, export the name of the user namespace you are planning to use. +```bash export PVC_NAMESPACE=kubeflow-user-example-com ``` -5. And finally, choose a name for the fsx claim that we will create. In this guide we will use this variable as the name for the PV as well the PVC. -``` + +5. Choose a name for the FSx claim that we will create. In this guide, we will use this variable as the name for the PV as well the PVC. +```bash export CLAIM_NAME= ``` @@ -35,7 +41,7 @@ You can either use Automated or Manual setup. We currently only support **Static ### 2.1 [Option 1] Automated setup The script automates all the manual resource creation steps but is currently only available for **Static Provisioning** option. It performs the required cluster configuration, creates an FSx file system and it also takes care of creating a storage class for static provisioning. Once done, move to section 3.0. -1. Run the following commands from the `tests/e2e` directory as - +1. Run the following commands from the `tests/e2e` directory: ``` cd $GITHUB_ROOT/tests/e2e ``` @@ -69,13 +75,13 @@ The script applies some default values for the file system name, performance mod If you prefer to manually setup each component then you can follow this manual guide. #### 1. Install the FSx CSI Driver -We recommend installing the FSx CSI Driver v0.7.1 directly from the [the aws-fsx-csi-driver github repo](https://github.com/kubernetes-sigs/aws-fsx-csi-driver) as follows - +We recommend installing the FSx CSI Driver v0.7.1 directly from the [the aws-fsx-csi-driver GitHub repository](https://github.com/kubernetes-sigs/aws-fsx-csi-driver) as follows: ``` kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=tags/v0.7.1" ``` -You can confirm that FSx CSI Driver was installed using the following command - +You can confirm that FSx CSI Driver was installed using the following command: ``` kubectl get csidriver -A @@ -86,14 +92,14 @@ fsx.csi.aws.com false false Persistent 14s #### 2. Create the IAM Policy for the CSI Driver The CSI driver's service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account `fsx-csi-controller-sa` with an IAM Role which has the required permissions. -1. Create the policy using the json file provided as follows - +1. Create the policy using the json file provided as follows: ``` aws iam create-policy \ --policy-name Amazon_FSx_Lustre_CSI_Driver \ --policy-document file://fsx-for-lustre/fsx-csi-driver-policy.json ``` -2. Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows - +2. Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows: ``` eksctl create iamserviceaccount \ @@ -106,17 +112,17 @@ eksctl create iamserviceaccount \ --override-existing-serviceaccounts ``` -3. You can verify by describing the specified service account to check if it has been correctly annotated - +3. You can verify by describing the specified service account to check if it has been correctly annotated: ``` kubectl describe -n kube-system serviceaccount fsx-csi-controller-sa ``` -#### 3. Create an Instance of the FSx Filesystem -Please refer to the official [AWS FSx CSI Document](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started-step1.html) for detailed instructions on creating an FSx filesystem. +#### 3. Create an instance of the FSx Filesystem +Please refer to the official [AWS FSx CSI documentation](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started-step1.html) for detailed instructions on creating an FSx filesystem. -Note: For this README, we have assumed that you are creating your FSx Filesystem in the same VPC as your EKS Cluster. +Note: For this guide, we assume that you are creating your FSx Filesystem in the same VPC as your EKS Cluster. -#### 4. Static Provisioning +#### 4. Static provisioning [Using this sample from official Kubeflow Docs](https://www.kubeflow.org/docs/distributions/aws/customizing-aws/storage/#amazon-fsx-for-lustre) 1. Use the AWS Console to get the filesystem id of the FSx volume you want to use. You could also use the following command to list all the volumes available in your region. Either way, make sure that `file_system_id` is set. @@ -155,7 +161,7 @@ kubectl apply -f $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pv.yaml kubectl apply -f $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml ``` -### 2.3 Check your Setup +### 2.3 Check your setup Use the following commands to ensure all resources have been deployed as expected and the PersistentVolume is correctly bound to the PersistentVolumeClaim ``` kubectl get pv @@ -171,14 +177,14 @@ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fsx-claim Bound fsx-pv 1200Gi RWX 83s ``` -## 3.0 Using FSx Storage in Kubeflow +## 3.0 Using FSx storage in Kubeflow In the following two sections we will be using this PVC to create a notebook server with Amazon FSx mounted as the workspace volume, download training data into this filesystem and then deploy a TFJob to train a model using this data. -### 3.1 Connect to the Kubeflow Dashboard +### 3.1 Connect to the Kubeflow dashboard Once you have everything setup, Port Forward as needed and Login to the Kubeflow dashboard. At this point, you can also check the `Volumes` tab in Kubeflow and you should be able to see your PVC is available for use within Kubeflow. -For more details on how to access your Kubeflow dashboard, refer to one of the deployment READMEs based on your setup. If you used the vanilla deployment, you can follow this [README](https://github.com/awslabs/kubeflow-manifests/tree/main/docs/deployment/vanilla#connect-to-your-kubeflow-cluster). +For more details on how to access your Kubeflow dashboard, refer to one of the [deployment option guides](/docs/deployment/) based on your setup. If you used the vanilla deployment, see [Connect to your Kubeflow cluster](/docs/deployment/install/vanilla/guide/#connect-to-your-kubeflow-cluster). -### 3.2 Note about Permissions +### 3.2 Note about permissions This step may not be necessary but you might need to specify some additional directory permissions on your worker node before you can use these as mount points. By default, new Amazon FSx file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon FSx file system permissions to allow other users to modify the file system. The set-permission-job.yaml is an example of how you could set these permissions to be able to use the fsx as your workspace in your kubeflow notebook. Modify it accordingly if you run into similar permission issues with any other job pod. ``` @@ -190,9 +196,9 @@ kubectl apply -f $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml ``` ### 3.2 Using FSx volume as workspace or data volume for a notebook server -Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. For our example here, we are using the AWS Optimized Tensorflow 2.6 CPU image provided in the notebook configuration options - **`public.ecr.aws/c9e4w0g3/notebook-servers/jupyter-tensorflow`**. Additionally, use the existing PVC as the workspace volume at the default mount point `/home/jovyan`. The server might take a few minutes to come up. +Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. For our example here, we are using the AWS-optimized Tensorflow 2.6 CPU image provided in the Notebook configuration options (`public.ecr.aws/c9e4w0g3/notebook-servers/jupyter-tensorflow`). Additionally, use the existing PVC as the workspace volume at the default mount point `/home/jovyan`. The server might take a few minutes to come up. -In case the server does not start up in the expected time, do make sure to check - +In case the server does not start up in the expected time, do make sure to check: 1. The Notebook Controller Logs 2. The specific notebook server instance pod's logs @@ -214,8 +220,8 @@ data_dir = tf.keras.utils.get_file(origin=dataset_url, data_dir = pathlib.Path(data_dir) ``` -### 2. Build and Push the Docker image -In the `training-sample` directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the `$IMAGE_URI` to your registry and specify an appropriate tag - +### 2. Build and push the Docker image +In the `training-sample` directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the `$IMAGE_URI` to your registry and specify an appropriate tag: ``` export IMAGE_URI= cd training-sample @@ -226,35 +232,35 @@ docker push $IMAGE_URI cd - ``` -### 3. Configure the tfjob spec file +### 3. Configure the TFjob spec file Once the docker image is built, replace the `` in the `tfjob.yaml` file, line #17. ``` yq e '.spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = env(IMAGE_URI)' -i training-sample/tfjob.yaml ``` -Also, specify the name of the PVC you created - +Also, specify the name of the PVC you created: ``` export CLAIM_NAME=fsx-claim yq e '.spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i training-sample/tfjob.yaml ``` -Make sure to run it in the same namespace as the claim - +Make sure to run it in the same namespace as the claim: ``` yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i training-sample/tfjob.yaml ``` ### 4. Create the TFjob and use the provided commands to check the training logs -At this point, we are ready to train the model using the `training-sample/training.py` script and the data available on the shared volume with the Kubeflow TFJob operator as - +At this point, we are ready to train the model using the `training-sample/training.py` script and the data available on the shared volume with the Kubeflow TFJob operator. ``` kubectl apply -f training-sample/tfjob.yaml ``` -In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs as - +In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs. ``` kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f ``` ## 4.0 Cleanup -This section cleans up the resources created in this README, to cleanup other resources such as the Kubeflow deployment, please refer to the high level README files. +This section cleans up the resources created in this guide. To clean up other resources, such as the Kubeflow deployment, see [Uninstall Kubeflow](/docs/deployment/uninstall-kubeflow/). ### 4.1 Clean up the TFJob ``` @@ -262,7 +268,7 @@ kubectl delete tfjob -n $PVC_NAMESPACE image-classification-pvc ``` ### 4.2 Delete the Kubeflow Notebook -Login to the dashboard to stop and/or terminate any kubeflow notebooks you created for this session or use the following command - +Log in to the dashboard to stop and/or terminate any Kubeflow Notebooks that you created for this session or use the following commands: ``` kubectl delete notebook -n $PVC_NAMESPACE ``` @@ -270,7 +276,7 @@ kubectl delete notebook -n $PVC_NAMESPACE kubectl delete pod -n $PVC_NAMESPACE $CLAIM_NAME ``` -### 4.3 Delete PVC, PV and SC in the following order +### 4.3 Delete PVC, PV, and SC in the following order ``` kubectl delete pvc -n $PVC_NAMESPACE $CLAIM_NAME kubectl delete pv fsx-pv @@ -280,11 +286,12 @@ kubectl delete pv fsx-pv ``` aws fsx delete-file-system --file-system-id $file_system_id ``` -Make sure to delete any other resources you have created such as security groups via the AWS Console or using awscli. +Make sure to delete any other resources that you have created such as security groups via the AWS Console or using the AWS CLI. -## 5.0 Known Issues: - - When you rerun the `eksctl create iamserviceaccount` to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In such a case you may need to do one or both of the following - -1. Delete the cloudformation stack associated with this add-on role. -2. Delete the `fsx-csi-controller-sa` service account and then re-run the required steps. If you used the auto-script, you can rerun it by specifying the same `filesystem-name` such that a new one is not created. +## 5.0 Known issues + + When you re-run the `eksctl create iamserviceaccount` to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In this case, you may need to do one or both of the following: + 1. Delete the CloudFormation stack associated with this add-on role. + 2. Delete the `fsx-csi-controller-sa` service account and then re-run the required steps. If you used the auto-script, you can re-run it by specifying the same `filesystem-name` so that a new one is not created. - - When using an FSx volume in a kubeflow notebook, the same PVC claim can be mounted to the same notebook only once as either the workspace volume or the data volume. Create two seperate PVCs on your FSx volume if you need to attach it twice to the notebook. \ No newline at end of file + When using an FSx volume in a Kubeflow Notebook, the same PVC claim can be mounted to the same Notebook only once as either the workspace volume or the data volume. Create two seperate PVCs on your FSx volume if you need to attach it twice to the Notebook. \ No newline at end of file diff --git a/docs/deployment/cognito-rds-s3/README.md b/docs/deployment/cognito-rds-s3/guide.md similarity index 73% rename from docs/deployment/cognito-rds-s3/README.md rename to docs/deployment/cognito-rds-s3/guide.md index 7598fc921a..e63c5348b5 100644 --- a/docs/deployment/cognito-rds-s3/README.md +++ b/docs/deployment/cognito-rds-s3/guide.md @@ -1,31 +1,35 @@ -# Deploying Kubeflow with Amazon Cognito as idP, RDS and S3 ++++ +title = "Cognito, RDS, and S3" +description = "Deploying Kubeflow with Amazon Cognito, RDS and S3" +weight = 50 ++++ -This guide describes how to deploy Kubeflow on AWS EKS using Cognito as identity provider, RDS for database and S3 for artifact storage. +This guide describes how to deploy Kubeflow on Amazon EKS using Cognito for your identity provider, RDS for your database, and S3 for your artifact storage. ## 1. Prerequisites -Follow the pre-requisites section from [this guide](../prerequisites.md) and setup RDS & S3 from [this guide](../rds-s3/README.md#20-setup-rds-s3-and-configure-secrets) to: +Refer to the [general prerequisites guide](/docs/deployment/prerequisites/) and the [RDS and S3 setup guide](/docs/deployment/rds-s3/guide/) in order to: 1. Install the CLI tools -1. Clone the repo -1. Create an EKS cluster and -1. Create S3 Bucket -1. Create RDS Instance -1. Configure AWS Secrets for RDS and S3 -1. Install AWS Secrets and Kubernetes Secrets Store CSI driver -1. Configure RDS endpoint and S3 bucket name for Kubeflow Pipelines +2. Clone the repositories +3. Create an EKS cluster +4. Create an S3 Bucket +5. Create an RDS Instance +6. Configure AWS Secrets for RDS and S3 +7. Install AWS Secrets and Kubernetes Secrets Store CSI driver +8. Configure an RDS endpoint and an S3 bucket name for Kubeflow Pipelines ## Configure Custom Domain and Cognito -1. Follow the [cognito guide](../cognito/README.md) from [section 1.0(Custom Domain)](../cognito/README.md#10-custom-domain-and-certificates) upto [section 3.0(Configure Ingress)](../cognito/README.md#30-configure-ingress) to: +1. Follow the [Cognito setup guide](/docs/deployment/cognito/guide/) from [Section 1.0 (Custom domain)](/docs/deployment/cognito/guide/#10-custom-domain-and-certificates) up to [Section 3.0 (Configure ingress)](/docs/deployment/cognito/guide/#30-configure-ingress) in order to: 1. Create a custom domain 1. Create TLS certificates for the domain 1. Create a Cognito Userpool 1. Configure Ingress 2. Deploy Kubeflow. Choose one of the two options to deploy kubeflow: - 1. **[Option 1]** Install with a single command + 1. **[Option 1]** Install with a single command: ``` while ! kustomize build docs/deployment/cognito-rds-s3 | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done ``` - 1. **[Option 2]** Install individual components + 1. **[Option 2]** Install individual components: ``` # Kubeflow namespace kustomize build upstream/common/kubeflow-namespace/base | kubectl apply -f - @@ -79,7 +83,7 @@ Follow the pre-requisites section from [this guide](../prerequisites.md) and set # Training Operator kustomize build upstream/apps/training-operator/upstream/overlays/kubeflow | kubectl apply -f - - # AWS Telemetry - This is an optional component. See usage tracking documentation for more information + # AWS Telemetry - This is an optional component. See usage tracking documentation for more information. kustomize build awsconfigs/common/aws-telemetry | kubectl apply -f - # AWS Secret Manager @@ -102,8 +106,8 @@ Follow the pre-requisites section from [this guide](../prerequisites.md) and set # Authservice kustomize build awsconfigs/common/aws-authservice/base | kubectl apply -f - ``` -1. Follow the rest of the cognito guide from [section 5.0(Updating the domain with ALB address)](../cognito/README.md#50-updating-the-domain-with-ALB-address) to: - 1. Add/Update the DNS records in custom domain with the ALB address - 1. Create a user in Cognito user pool +1. Follow the rest of the Cognito guide from [section 5.0 (Updating the domain with ALB address)](/docs/deployment/cognito/guide/#50-updating-the-domain-with-ALB-address) in order to: + 1. Add/Update the DNS records in a custom domain with the ALB address + 1. Create a user in a Cognito user pool 1. Create a profile for the user from the user pool 1. Connect to the central dashboard diff --git a/docs/deployment/cognito/README.md b/docs/deployment/cognito/guide.md similarity index 80% rename from docs/deployment/cognito/README.md rename to docs/deployment/cognito/guide.md index 6f095a9175..1099a972fe 100644 --- a/docs/deployment/cognito/README.md +++ b/docs/deployment/cognito/guide.md @@ -1,18 +1,26 @@ -# Deploying Kubeflow with AWS Cognito as idP - -This guide describes how to deploy Kubeflow on AWS EKS using Cognito as identity provider. Kubeflow uses Istio to manage internal traffic. In this guide we will be creating an Ingress to manage external traffic to the Kubernetes services and an Application Load Balancer(ALB) to provide public DNS and enable TLS authentication at the load balancer. We will also be creating a custom domain to host Kubeflow since certificates(needed for TLS) for ALB's public DNS names are not supported. ++++ +title = "Cognito" +description = "Deploying Kubeflow with AWS Cognito as identity provider" +weight = 30 ++++ + +This guide describes how to deploy Kubeflow on Amazon EKS using Cognito as your identity provider. Kubeflow uses Istio to manage internal traffic. In this guide, we will: +- create an Ingress to manage external traffic to the Kubernetes services +- create an Application Load Balancer (ALB) to provide public DNS +- enable TLS authentication for the Load Balancer +- create a custom domain to host Kubeflow (because the certificates needed for TLS are not supported for ALB's public DNS names) ## Prerequisites -Follow the pre-requisites section from [this guide](../prerequisites.md) +Check to make sure that you have the necessary [prerequisites](/docs/deployment/prerequisites/). ## Background -Read the [background section](../add-ons/load-balancer/README.md#background) in the load balancer guide for information on the requirements for exposing Kubeflow over a load balancer. +Read the [background section](/docs/deployment/add-ons/load-balancer/guide/#background) in the Load Balancer guide for information on the requirements for exposing Kubeflow over a Load Balancer. -Read the [create domain and cerificate section](../add-ons/load-balancer/README.md#create-domain-and-certificates) for information on why we use subdomain for hosting Kubeflow. +Read the [create domain and certificate section](/docs/deployment/add-ons/load-balancer/guide/#create-domain-and-certificates) for information on why we use a subdomain for hosting Kubeflow. ## (Optional) Automated setup -The rest of the sections in this guide walks you through each step for setting up domain, certificates and Cognito userpool using AWS Console and is good for a new user to understand the design and details. If you prefer to use automated scripts and avoid human error for setting up the resources for deploying Kubeflow with Cognito, follow this [README](./README-automated.md) instead. +The rest of the sections in this guide walk you through each step for setting up domain, certificates, and a Cognito userpool using the AWS Console. This guide is intended for a new user to understand the design and details of these setup steps. If you prefer to use automated scripts and avoid human error for setting up the resources for deploying Kubeflow with Cognito, follow the [automated setup guide](https://github.com/awslabs/kubeflow-manifests/blob/main/docs/deployment/cognito/README-automated.md). ## 1.0 Custom domain and certificates @@ -20,6 +28,7 @@ The rest of the sections in this guide walks you through each step for setting u 1. Follow the [Create certificates for domain](../add-ons/load-balancer/README.md#create-certificates-for-domain) section of the load balancer guide to create certificates required for TLS. From this point onwards, we will be creating/updating the DNS records **only in the subdomain**. All the screenshots of hosted zone in the following sections/steps of this guide are for the subdomain. + ## 2.0 Cognito User Pool 1. Create a user pool in Cognito in the same region as your EKS cluster. Type a pool name and choose `Review defaults`. @@ -51,16 +60,16 @@ From this point onwards, we will be creating/updating the DNS records **only in ## 3.0 Configure Ingress -1. Take note of the following values from the previous step or `awsconfigs/infra_configs/scripts/config.yaml` if you used automated guide(./README-automated.md): +1. Take note of the following values from the previous step or `awsconfigs/infra_configs/scripts/config.yaml` if you used automated guide(https://github.com/awslabs/kubeflow-manifests/blob/main/docs/deployment/cognito/README-automated.md): 1. The Pool ARN of the user pool found in Cognito general settings. 1. The App client id, found in Cognito App clients. 1. The custom user pool domain (e.g. `auth.platform.example.com`), found in the Cognito domain name. - 1. The ARN of the certificate from the Certificate Manager in the region where your platform (for the subdomain) in the region where your platform is running. - 1. signOutURL is the domain which you provided as the Sign out URL(s). - 1. CognitoLogoutURL is comprised of your CognitoUserPoolDomain, CognitoAppClientId, and your domain which you provided as the Sign out URL(s). + 1. The ARN of the certificate from the Certificate Manager in the region where your platform (for the subdomain) is running. + 1. signOutURL is the domain that you provided as the Sign out URL(s). + 1. CognitoLogoutURL is comprised of your CognitoUserPoolDomain, CognitoAppClientId, and your domain that you provided as the Sign out URL(s). 1. Export the values: 1. - ``` + ```bash export CognitoUserPoolArn="" export CognitoAppClientId="" export CognitoUserPoolDomain="" @@ -69,7 +78,7 @@ From this point onwards, we will be creating/updating the DNS records **only in export CognitoLogoutURL="https://$CognitoUserPoolDomain/logout?client_id=$CognitoAppClientId&logout_uri=$signOutURL" ``` 1. Substitute values for setting up Ingress. - 1. ``` + 1. ```bash printf ' CognitoUserPoolArn='$CognitoUserPoolArn' CognitoAppClientId='$CognitoAppClientId' @@ -78,21 +87,22 @@ From this point onwards, we will be creating/updating the DNS records **only in ' > awsconfigs/common/istio-ingress/overlays/cognito/params.env ``` 1. Substitute values for setting up AWS authservice. - 1. ``` + 1. ```bash printf ' LOGOUT_URL='$CognitoLogoutURL' ' > awsconfigs/common/aws-authservice/base/params.env ``` 1. Follow the [Configure Load Balancer Controller](../add-ons/load-balancer/README.md#configure-load-balancer-controller) section of the load balancer guide to setup the resources required the load balancer controller. + ## 4.0 Building manifests and deploying Kubeflow -1. Deploy Kubeflow. Choose one of the two options to deploy kubeflow: - 1. **[Option 1]** Install with a single command - ``` +1. Choose one of the two options to deploy kubeflow: + 1. **[Option 1]** Install with a single command: + ```bash while ! kustomize build docs/deployment/cognito | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done ``` - 1. **[Option 2]** Install individual components - ``` + 1. **[Option 2]** Install individual components: + ```bash # Kubeflow namespace kustomize build common/kubeflow-namespace/base | kubectl apply -f - @@ -167,14 +177,14 @@ From this point onwards, we will be creating/updating the DNS records **only in ## 5.0 Updating the domain with ALB address -1. Check if ALB is provisioned. It takes around 3-5 minutes - 1. ``` +1. Check if ALB is provisioned. This may take a few minutes. + 1. ```bash kubectl get ingress -n istio-system Warning: extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress NAME CLASS HOSTS ADDRESS PORTS AGE istio-ingress * ebde55ee-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com 80 15d ``` - 2. If `ADDRESS` is empty after a few minutes, check the logs of alb-ingress-controller by following [this guide](https://www.kubeflow.org/docs/distributions/aws/troubleshooting-aws/#alb-fails-to-provision) + 2. If `ADDRESS` is empty after a few minutes, see [ALB fails to provision](/docs/troubleshooting-aws/#alb-fails-to-provision) in the troubleshooting guide. 1. When ALB is ready, copy the DNS name of that load balancer and create a CNAME entry to it in Route53 under subdomain (`platform.example.com`) for `*.platform.example.com` 1. ![subdomain-*.platform-and-*.default-records](./images/subdomain-*.platform-and-*.default-records.png) 1. Update the type `A` record created in section for `platform.example.com` using ALB DNS name. Change from `127.0.0.1` → ALB DNS name. You have to use alias form under `Alias to application and classical load balancer` and select region and your ALB address. @@ -182,13 +192,13 @@ From this point onwards, we will be creating/updating the DNS records **only in 1. Screenshot of all the record sets in hosted zone for reference 1. ![subdomain-records-summary](./images/subdomain-records-summary.png) -## 6.0 Connecting to Central dashboard +## 6.0 Connecting to central dashboard 1. The central dashboard should now be available at [https://kubeflow.platform.example.com](https://kubeflow.platform.example.com/). Before connecting to the dashboard: - 1. Head over to the Cognito console and create some users in `Users and groups`. These are the users who will login to the central dashboard. + 1. Head over to the Cognito console and create some users in `Users and groups`. These are the users who will log in to the central dashboard. 1. ![cognito-user-pool-created](./images/cognito-user-pool-created.png) - 1. Create a profile for a user created in previous step by [following this guide](https://www.kubeflow.org/docs/components/multi-tenancy/getting-started/#manual-profile-creation). Following is a sample profile for reference: - 1. ``` + 1. Create a Profile for a user by following the steps in the [Manual Profile Creation](https://www.kubeflow.org/docs/components/multi-tenancy/getting-started/#manual-profile-creation). The following is an example Profile for reference: + 1. ```bash apiVersion: kubeflow.org/v1beta1 kind: Profile metadata: @@ -201,5 +211,5 @@ From this point onwards, we will be creating/updating the DNS records **only in # replace with the email of the user name: my_user_email@kubeflow.com ``` -1. Open the central dashboard at [https://kubeflow.platform.example.com](https://kubeflow.platform.example.com/). It will redirect to Cognito for login. Use the credentials of the user for which profile was created in previous step. +1. Open the central dashboard at [https://kubeflow.platform.example.com](https://kubeflow.platform.example.com/). It will redirect to Cognito for login. Use the credentials of the user that you just created a Profile for in previous step. diff --git a/docs/deployment/prerequisites.md b/docs/deployment/prerequisites.md index fe828e5c68..3865bfc21b 100644 --- a/docs/deployment/prerequisites.md +++ b/docs/deployment/prerequisites.md @@ -1,53 +1,60 @@ -# Prerequisites ++++ +title = "Prerequisites" +description = "Everything you need to get started with Kubeflow on AWS" +weight = 20 ++++ -This guide assumes that you have: - -1. Installed the following tools on the client machine - - [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - A command line tool for interacting with AWS services. - - [eksctl](https://eksctl.io/introduction/#installation) - A command line tool for working with EKS clusters. - - [kubectl](https://kubernetes.io/docs/tasks/tools) - A command line tool for working with Kubernetes clusters. - - [yq](https://mikefarah.gitbook.io/yq) - A command line tool for YAML processing. (For Linux environments, use the [wget plain binary installation](https://github.com/mikefarah/yq/#install)) - - [jq](https://stedolan.github.io/jq/download/) - A command line tool for processing JSON. - - [kustomize version 3.2.0](https://github.com/kubernetes-sigs/kustomize/releases/tag/v3.2.0) - A command line tool to customize Kubernetes objects through a kustomization file. - - :warning: Kubeflow is not compatible with the latest versions of of kustomize 4.x. This is due to changes in the order resources are sorted and printed. Please see [kubernetes-sigs/kustomize#3794](https://github.com/kubernetes-sigs/kustomize/issues/3794) and [kubeflow/manifests#1797](https://github.com/kubeflow/manifests/issues/1797). We know this is not ideal and are working with the upstream kustomize team to add support for the latest versions of kustomize as soon as we can. - - [python](https://www.python.org/downloads/) - A programming language used for automated installation scripts. - - [pip](https://pip.pypa.io/en/stable/installation/) - A package installer for python. +## Install the necessary tools +- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - A command line tool for interacting with AWS services. +- [eksctl](https://eksctl.io/introduction/#installation) - A command line tool for working with EKS clusters. +- [kubectl](https://kubernetes.io/docs/tasks/tools) - A command line tool for working with Kubernetes clusters. +- [yq](https://mikefarah.gitbook.io/yq) - A command line tool for YAML processing. (For Linux environments, use the [wget plain binary installation](https://github.com/mikefarah/yq/#install)) +- [jq](https://stedolan.github.io/jq/download/) - A command line tool for processing JSON. +- [kustomize version 3.2.0](https://github.com/kubernetes-sigs/kustomize/releases/tag/v3.2.0) - A command line tool to customize Kubernetes objects through a kustomization file. +> Warning: Kubeflow is not compatible with the latest versions of of kustomize 4.x. This is due to changes in the order that resources are sorted and printed. Please see [kubernetes-sigs/kustomize#3794](https://github.com/kubernetes-sigs/kustomize/issues/3794) and [kubeflow/manifests#1797](https://github.com/kubeflow/manifests/issues/1797). We know that this is not ideal and are working with the upstream kustomize team to add support for the latest versions of kustomize as soon as we can. +- [python](https://www.python.org/downloads/) - A programming language used for automated installation scripts. +- [pip](https://pip.pypa.io/en/stable/installation/) - A package installer for python. -1. Creating an EKS cluster - - If you do not have an existing cluster, run the following command to create an EKS cluster. More details about cluster creation via `eksctl` can be found [here](https://eksctl.io/usage/creating-and-managing-clusters/). - - Various controllers in Kubeflow deployment use IAM roles for service accounts(IRSA). An OIDC provider must exist for your cluster to use IRSA. - - Substitute values for the `CLUSTER_NAME` and `CLUSTER_REGION` in the script below - ``` - export CLUSTER_NAME=$CLUSTER_NAME - export CLUSTER_REGION=$CLUSTER_REGION - ``` +## Create an EKS cluster +If you do not have an existing cluster, run the following command to create an EKS cluster. + +> Note: Various controllers use IAM roles for service accounts (IRSA). An [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) must exist for your cluster to use IRSA. + +Change the values for the `CLUSTER_NAME` and `CLUSTER_REGION` environment variables: +```bash +export CLUSTER_NAME=$CLUSTER_NAME +export CLUSTER_REGION=$CLUSTER_REGION +``` + +Run the following command to create an EKS cluster: +```bash +eksctl create cluster \ +--name ${CLUSTER_NAME} \ +--version 1.20 \ +--region ${CLUSTER_REGION} \ +--nodegroup-name linux-nodes \ +--node-type m5.xlarge \ +--nodes 5 \ +--nodes-min 5 \ +--nodes-max 10 \ +--managed \ +--with-oidc +``` +If you are using an existing EKS cluster, create an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) and associate it with for your EKS cluster with the following command: +```bash +eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \ +--region ${CLUSTER_REGION} --approve +``` +More details about cluster creation via `eksctl` can be found in the [Creating and managing clusters](https://eksctl.io/usage/creating-and-managing-clusters/) guide. - - Run the following command to create an EKS Cluster - ``` - eksctl create cluster \ - --name ${CLUSTER_NAME} \ - --version 1.20 \ - --region ${CLUSTER_REGION} \ - --nodegroup-name linux-nodes \ - --node-type m5.xlarge \ - --nodes 5 \ - --nodes-min 5 \ - --nodes-max 10 \ - --managed \ - --with-oidc - ``` - - **Note:** If you are using an existing cluster, Create an OIDC provider and associate it with for your EKS cluster by running the following command: - ``` - eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \ - --region ${CLUSTER_REGION} --approve - ``` +## Clone the repository +Clone the [`awslabs/kubeflow-manifest`](https://github.com/awslabs/kubeflow-manifests) and the [`kubeflow/manifests`](https://github.com/kubeflow/manifests) repositories and check out the release branches of your choosing. -1. Clone the `awslabs/kubeflow-manifest` repo, `kubeflow/manifests` repo and checkout the release branches. - - Substitute the value for `KUBEFLOW_RELEASE_VERSION`(e.g. v1.4.1) and `AWS_RELEASE_VERSION`(e.g. v1.4.1-aws-b1.0.0) with the tag or branch you want to use below. Read more about [releases and versioning](../../community/releases.md#releases-and-versioning) policy if you are unsure about what these values should be. - ``` - export KUBEFLOW_RELEASE_VERSION=<> - export AWS_RELEASE_VERSION=<> - git clone https://github.com/awslabs/kubeflow-manifests.git && cd kubeflow-manifests - git checkout ${AWS_RELEASE_VERSION} - git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream - ``` +Substitute the value for `KUBEFLOW_RELEASE_VERSION`(e.g. v1.4.1) and `AWS_RELEASE_VERSION`(e.g. v1.4.1-aws-b1.0.0) with the tag or branch you want to use below. Read more about [releases and versioning](/docs/about/releases/) if you are unsure about what these values should be. +```bash +export KUBEFLOW_RELEASE_VERSION=<> +export AWS_RELEASE_VERSION=<> +git clone https://github.com/awslabs/kubeflow-manifests.git && cd kubeflow-manifests +git checkout ${AWS_RELEASE_VERSION} +git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream +``` \ No newline at end of file diff --git a/docs/deployment/rds-s3/README.md b/docs/deployment/rds-s3/README.md deleted file mode 100644 index 9d41b61403..0000000000 --- a/docs/deployment/rds-s3/README.md +++ /dev/null @@ -1,379 +0,0 @@ -# Kustomize Manifests for RDS and S3 - -## Overview - -This Kustomize Manifest can be used to deploy Kubeflow Pipelines (KFP) and Katib with RDS and S3. - -### RDS - -[Amazon Relational Database Service (RDS)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html) is a managed relational database service that facilitates several database management tasks such as database scaling, database backups, database software patching, OS patching, and more. - -In the [default kubeflow installation](../../../docs/deployment/vanilla/kustomization.yaml), the [KFP](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/katib/upstream/components/mysql/mysql.yaml) and [Katib](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/pipeline/upstream/third-party/mysql/base/mysql-deployment.yaml) components both use their own MySQL pod to persist KFP data (such as experiments, pipelines, jobs, etc.) and Katib experiment observation logs, respectively. - -As compared to using the MySQL setup in the default installation, using RDS provides the following advantages: -- Easier to configure availability: RDS provides high availability and failover support for DB instances using Multi Availability Zone (Mulit-AZ) deployments with a single standby DB instance, increasing the availability of KFP and Katib services during unexpected network events -- Easy to configure scalability: While the default Kubeflow installation uses a EBS hosted Peristent Volume Claim that is AZ bound and does not support automatic online resizing, RDS can be configured to handle availability and scaling needs -- KFP and Katib data can be persisted beyond single Kubeflow installations: Using RDS decouples the KFP and Katib datastores from the Kubeflow deployment, allowing multiple Kubeflow installations to reuse the same RDS instance provided that the KFP component versions stores data in a format that is compatible with each other. -- Higher level of customizability and management: RDS provides management features to facilitate changing database instance types, updating SQL versions, and more. - -### S3 -[Amazon Simple Storage Service (S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is an object storage service that is highly scalable, available, secure, and performant. - -In the [default kubeflow installation](../../../docs/deployment/vanilla/kustomization.yaml), the [KFP](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/pipeline/upstream/third-party/minio/base/minio-deployment.yaml) component uses the MinIO object storage service that can be configured to store objects in S3. However, by default the installation hosts the object store service locally in the cluster. KFP stores data such as pipeline architectures and pipeline run artifacts in MinIO. - -Configuring MinIO to read and write to S3 provides the following advantages: -- Higher scalability and availability: S3 offers industry-leading scalability and availability and is more durable than the default MinIO object storage solution provided by Kubeflow. -- KFP artifacts can be persisted beyond single Kubeflow installations: Using S3 decouples the KFP artifact store from the Kubeflow deployment, allowing multiple Kubeflow installations to access the same artifacts provided that the KFP component versions stores data in a format that is compatible with each other. -- Higher level of customizability and management: S3 provides management features to help optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements - -To get started with configuring and installing your Kubeflow installation with RDS and S3 follow the [install](#install) steps below to configure and deploy the Kustomize manifest. - -## Install - -The following steps show how to configure and deploy Kubeflow with supported AWS services. - -### Using only RDS or only S3 - -Steps relevant only to the RDS installation will be prefixed with `[RDS]`. - -Steps relevant only to the S3 installation will be prefixed with `[S3]`. - -Steps without any prefixing are necessary for all installations. - -To install for either only RDS or S3 complete the steps relevant to your installation choice. - -To install for both RDS and S3 complete all the below steps. - -## 1.0 Prerequisites -Follow the pre-requisites section from [this guide](../prerequisites.md) -1. Verify that your are in the root of this repository by running the pwd command. The path should be - ``` - pwd - ``` - -4. Create an OIDC provider for your cluster - **Important :** - You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56` or if you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. - -## 2.0 Setup RDS, S3 and configure Secrets - -There are two ways to create the RDS and S3 resources before you deploy the Kubeflow manifests. Either use the automated python script we have provided by following the steps in section 2.1 or fall back on the manual setup steps as in section 2.2 below - - -### 2.1 **Option 1: Automated Setup** - -This setup performs all the manual steps in an automated fashion. -The script takes care of creating the S3 bucket, creating the S3 secrets using the secrets manager, setting up the RDS database and creating the RDS secret using the secrets manager. It also edits the required configuration files for Kubeflow pipeline to be properly configured for the RDS database during Kubeflow installation. -The script also handles cases where the resources already exist in which case it will simply skips the step. - -Note : The script will **not** delete any resource therefore if a resource already exists (eg: secret, database with the same name or S3 bucket etc), **it will skip the creation of those resources and use the existing resources instead**. This was done in order to prevent unwanted results such as accidental deletion. For instance, if a database with the same name already exists, the script will skip the database creation setup. If it's expected in your scenario, then perhaps this is fine for you, if you simply forgot to change the database name used for creation then this gives you the chance to retry the script with the proper value. See `python auto-rds-s3-setup.py --help` for the list of parameters as well as their default values. - -1. Navigate to the Navigate to `tests/e2e` directory -``` -cd tests/e2e -``` -2. Install the script dependencies `pip install -r requirements.txt` -3. [Create an IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_cliwpsapi) with permissions to get bucket location and allow read and write access to objects in an S3 bucket where you want to store the Kubeflow artifacts. Use the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` of the IAM user below: -4. Export values for `CLUSTER_REGION`, `CLUSTER_NAME`, `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, and `AWS_SECRET_ACCESS_KEY` then run the script -``` -export CLUSTER_REGION= -export CLUSTER_NAME= -export S3_BUCKET= -export AWS_ACCESS_KEY_ID= -export AWS_SECRET_ACCESS_KEY= - -PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --bucket $S3_BUCKET --s3_aws_access_key_id $AWS_ACCESS_KEY_ID --s3_aws_secret_access_key $AWS_SECRET_ACCESS_KEY -``` - -### Advanced customization - -The script applies some sensible default values for the db user password, max storage, storage type, instance type etc but if you know what you are doing, you can always tweak those preferences by passing different values. -You can learn more about the different parameters by running `PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-setup.py --help`. - -### 2.2 **Option 2: Manual Setup** -If you prefer to manually setup each components then you can follow this manual guide. -1. [S3] Create an S3 Bucket - - Refer to the [S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html) for steps on creating an `S3 bucket`. - To complete the following steps you will need to keep track of the `S3 bucket name`. - -2. [RDS] Create an RDS Instance - - Refer to the [RDS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.MySQL.html) for steps on creating an `RDS MySQL instance`. - - When creating the RDS instance for security and connectivity reasons we recommend that: - - The RDS instance is in the same VPC as the cluster - - The RDS instance subnets must belong to at least 2 private subnets within the VPC - - The RDS instance security group is the same security group used by the EKS node instances - - To complete the following steps you will need to keep track of the following: - - `RDS database name` (not to be confused with the `DB identifier`) - - `RDS database admin username` - - `RDS database admin password` - - `RDS database endpoint URL` - - `RDS database port` - -3. Create Secrets in AWS Secrets Manager - - 1. [RDS] Create the RDS secret and configure the secret provider: - 1. Configure a secret, for example named `rds-secret`, with the RDS DB name, RDS endpoint URL, RDS DB port, and RDS DB credentials that were configured when following the steps in Create RDS Instance. - - For example, if your database name is `kubeflow`, your endpoint URL is `rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com`, your DB port is `3306`, your DB username is `admin`, and your DB password is `Kubefl0w` your secret should look like: - - ``` - export RDS_SECRET= - aws secretsmanager create-secret --name $RDS_SECRET --secret-string '{"username":"admin","password":"Kubefl0w","database":"kubeflow","host":"rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com","port":"3306"}' --region $CLUSTER_REGION - ``` - 1. Rename the `parameters.objects.objectName` field in [the rds secret provider configuration](../../../awsconfigs/common/aws-secrets-manager/rds/secret-provider.yaml) to the name of the secret. - - One line command: - ``` - yq e -i '.spec.parameters.objects |= sub("rds-secret",env(RDS_SECRET))' awsconfigs/common/aws-secrets-manager/rds/secret-provider.yaml - ``` - - For example, if your secret name is `rds-secret-new`, the configuration would look like: - - ``` - apiVersion: secrets-store.csi.x-k8s.io/v1alpha1 - kind: SecretProviderClass - metadata: - name: rds-secret - - ... - - parameters: - objects: | - - objectName: "rds-secret-new" # This line was changed - objectType: "secretsmanager" - jmesPath: - - path: "username" - objectAlias: "user" - - path: "password" - objectAlias: "pass" - - path: "host" - objectAlias: "host" - - path: "database" - objectAlias: "database" - - path: "port" - objectAlias: "port" - ``` - - 1. [S3] Create the S3 secret and configure the secret provider: - 1. Configure a secret, for example named `s3-secret`, with your AWS credentials. These need to be long term credentials from an IAM user and not temporary. - - Find more details about configuring/getting your AWS credentials here: - https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html - - ``` - export S3_SECRET= - aws secretsmanager create-secret --name S3_SECRET --secret-string '{"accesskey":"AXXXXXXXXXXXXXXXXXX6","secretkey":"eXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXq"}' --region $CLUSTER_REGION - ``` - 1. Rename the `parameters.objects.objectName` field in [the s3 secret provider configuration](../../../awsconfigs/common/aws-secrets-manager/s3/secret-provider.yaml) to the name of the secret. - - One line command: - ``` - yq e -i '.spec.parameters.objects |= sub("s3-secret",env(S3_SECRET))' awsconfigs/common/aws-secrets-manager/s3/secret-provider.yaml - ``` - - For example, if your secret name is `s3-secret-new`, the configuration would look like: - - ``` - apiVersion: secrets-store.csi.x-k8s.io/v1alpha1 - kind: SecretProviderClass - metadata: - name: s3-secret - - ... - - parameters: - objects: | - - objectName: "s3-secret-new" # This line was changed - objectType: "secretsmanager" - jmesPath: - - path: "accesskey" - objectAlias: "access" - - path: "secretkey" - objectAlias: "secret" - ``` - -4. Install AWS Secrets & Configuration Provider with Kubernetes Secrets Store CSI driver - - 1. Run the following commands to enable oidc and create an iamserviceaccount with permissions to retrieve the secrets created from AWS Secrets Manager - - ``` - eksctl utils associate-iam-oidc-provider --region=$CLUSTER_REGION --cluster=$CLUSTER_NAME --approve - - eksctl create iamserviceaccount --name kubeflow-secrets-manager-sa --namespace kubeflow --cluster $CLUSTER_NAME --attach-policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess --attach-policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite --override-existing-serviceaccounts --approve --region $CLUSTER_REGION - ``` - - 2. Run these commands to install AWS Secrets & Configuration Provider with Kubernetes Secrets Store CSI driver - - ``` - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretproviderclass.yaml - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/csidriver.yaml - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasses.yaml - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasspodstatuses.yaml - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store-csi-driver.yaml - kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretprovidersyncing.yaml - - kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/deployment/aws-provider-installer.yaml - ``` - -5. Update the KFP configurations - 1. [RDS] Configure the [RDS params file](../../../awsconfigs/apps/pipeline/rds/params.env) with the RDS endpoint url and the metadata db name. - - For example, if your RDS endpoint URL is `rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com` and your metadata db name is `metadata_db` your `params.env` file should look like: - ``` - dbHost=rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com - mlmdDb=metadata_db - ``` - - 2. [S3] Configure the [S3 params file](../../../awsconfigs/apps/pipeline/s3/params.env) with with the `S3 bucket name`, and `S3 bucket region`.. - - For example, if your S3 bucket name is `kf-aws-demo-bucket` and s3 bucket region is `us-west-2` your `params.env` file should look like: - ``` - bucketName=kf-aws-demo-bucket - minioServiceHost=s3.amazonaws.com - minioServiceRegion=us-west-2 - ``` - -## 3.0 Build Manifests and Install Kubeflow - -Once you have the resources ready, you can continue on to deploying the Kubeflow manifests using the single line command below - - -Choose one of the deployment options from below: - -- Deploying the configuration for both RDS and S3 -- Deploying the configuration for RDS only -- Deploying the configuration for S3 only - -#### [RDS and S3] Deploy both RDS and S3 - -```sh -while ! kustomize build docs/deployment/rds-s3 | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done -``` - -#### [RDS] Deploy RDS only - -```sh -while ! kustomize build docs/deployment/rds-s3/rds-only | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done -``` - -#### [S3] Deploy S3 only - -```sh -while ! kustomize build docs/deployment/rds-s3/s3-only | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done -``` - -Once, everything is installed successfully, you can access the Kubeflow Central Dashboard [by logging in to your cluster](../vanilla/README.md#connect-to-your-kubeflow-cluster). - -Congratulations! You can now start experimenting and running your end-to-end ML workflows with Kubeflow. - - -## 4.0 Verify the installation - -### 4.1 Verify RDS - -1. Connect to the RDS instance from a pod within the cluster - -``` -kubectl run -it --rm --image=mysql:5.7 --restart=Never mysql-client -- mysql -h -u -p -``` - -Note that you can find your credentials by visiting [aws secrets manager](https://aws.amazon.com/secrets-manager/) or by using [AWS CLI](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/secretsmanager/get-secret-value.html) - -For instance, to retrieve the value of a secret named `rds-secret`, we would do : - -``` -aws secretsmanager get-secret-value \ - --region $CLUSTER_REGION \ - --secret-id rds-secret \ - --query 'SecretString' \ - --output text -``` - -2. Once connected verify the databases `kubeflow` and `mlpipeline` exist. - -``` -mysql> show databases; - -+--------------------+ -| Database | -+--------------------+ -| information_schema | -| kubeflow | -| mlpipeline | -| mysql | -| performance_schema | -+--------------------+ -``` - -3. Verify the database `mlpipeline` has the following tables: - -``` -mysql> use mlpipeline; show tables; - -+----------------------+ -| Tables_in_mlpipeline | -+----------------------+ -| db_statuses | -| default_experiments | -| experiments | -| jobs | -| pipeline_versions | -| pipelines | -| resource_references | -| run_details | -| run_metrics | -+----------------------+ -``` - -4. Access the Kubeflow Central Dashboard [by logging in to your cluster](../vanilla/README.md#connect-to-your-kubeflow-cluster) and navigate to Katib (under Experiments (AutoML)). - -5. Create an experiment using the following [yaml file](../../../tests/e2e/resources/custom-resource-templates/katib-experiment-random.yaml). - -6. Once the experiment is complete verify the following table exists: - -``` -mysql> use kubeflow; show tables; - -+----------------------+ -| Tables_in_kubeflow | -+----------------------+ -| observation_logs | -+----------------------+ -``` - -7. Describe `observation_logs` to verify it is being populated. - -``` -mysql> select * from observation_logs; -``` - -### 4.2 Verify S3 - -1. Access the Kubeflow Central Dashboard [by logging in to your cluster](../vanilla/README.md#connect-to-your-kubeflow-cluster) and navigate to Kubeflow Pipelines (under Pipelines). - -2. Create an experiment named `test` and create a run using the sample pipeline `[Demo] XGBoost - Iterative model training`. - -3. Once the run is completed go to the S3 AWS console and open the bucket you specified for the Kubeflow installation. - -4. Verify the bucket is not empty and was populated by outputs of the experiment. - -## 5.0 Uninstall Kubeflow - -Run the following command to uninstall: - -```sh -kustomize build docs/deployment/rds-s3 | kubectl delete -f - -``` - -Additionally, the following cleanup steps may be required: - -```sh -kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.eventing.knative.dev webhook.istio.networking.internal.knative.dev webhook.serving.knative.dev - -kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io config.webhook.eventing.knative.dev config.webhook.istio.networking.internal.knative.dev config.webhook.serving.knative.dev - -kubectl delete endpoints -n default mxnet-operator pytorch-operator tf-operator -``` - -To uninstall AWS resources created by the automated setup run the cleanup script -1. Navigate to the Navigate to `tests/e2e` directory -``` -cd tests/e2e -``` -2. Install the script dependencies `pip install -r requirements.txt` -3. Make sure you have the configuration file created by the script in `tests/e2e/utils/rds-s3/metadata.yaml` -``` -PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-cleanup.py -``` \ No newline at end of file diff --git a/docs/deployment/rds-s3/guide.md b/docs/deployment/rds-s3/guide.md new file mode 100644 index 0000000000..0b9003ef38 --- /dev/null +++ b/docs/deployment/rds-s3/guide.md @@ -0,0 +1,381 @@ ++++ +title = "RDS and S3" +description = "Deploying Kubeflow with RDS and S3" +weight = 40 ++++ + +This guide can be used to deploy Kubeflow Pipelines (KFP) and Katib with RDS and S3. + +### RDS + +[Amazon Relational Database Service (RDS)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html) is a managed relational database service that facilitates several database management tasks such as database scaling, database backups, database software patching, OS patching, and more. + +In the [default Kubeflow installation](/docs/deployment/vanilla/guide/), the [KFP](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/katib/upstream/components/mysql/mysql.yaml) and [Katib](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/pipeline/upstream/third-party/mysql/base/mysql-deployment.yaml) components both use their own MySQL pod to persist KFP data (such as experiments, pipelines, jobs, etc.) and Katib experiment observation logs, respectively. + +Compared to the MySQL setup in the default installation, using RDS provides the following advantages: +- Availability: RDS provides high availability and failover support for DB instances using Multi Availability Zone (Mulit-AZ) deployments with a single standby DB instance, increasing the availability of KFP and Katib services during unexpected network events. +- Scalability: RDS can be configured to handle availability and scaling needs. The default Kubeflow installation uses an EBS-hosted Persistent Volume Claim that is AZ-bound and does not support automatic online resizing. +- Persistent data: KFP and Katib data can persist beyond single Kubeflow installations. Using RDS decouples the KFP and Katib datastores from the Kubeflow deployment, allowing multiple Kubeflow installations to reuse the same RDS instance provided that the KFP component versions store data in a format that is compatible. +- Customization and management: RDS provides management features to facilitate changing database instance types, updating SQL versions, and more. + +### S3 +[Amazon Simple Storage Service (S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is an object storage service that is highly scalable, available, secure, and performant. + +In the [default Kubeflow installation](/docs/deployment/vanilla/guide/), the [KFP](https://github.com/kubeflow/manifests/blob/v1.4-branch/apps/pipeline/upstream/third-party/minio/base/minio-deployment.yaml) component uses the MinIO object storage service that can be configured to store objects in S3. However, by default the installation hosts the object store service locally in the cluster. KFP stores data such as pipeline architectures and pipeline run artifacts in MinIO. + +Configuring MinIO to read and write to S3 provides the following advantages: +- Scalability and availability: S3 offers industry-leading scalability and availability and is more durable than the default MinIO object storage solution provided by Kubeflow. +- Persistent artifacts: KFP artifacts can persist beyond single Kubeflow installations. Using S3 decouples the KFP artifact store from the Kubeflow deployment, allowing multiple Kubeflow installations to access the same artifacts provided that the KFP component versions store data in a format that is compatible. +- Customization and management: S3 provides management features to help optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements. + +To get started with configuring and installing your Kubeflow installation with RDS and S3 follow the [install](#install) steps below to configure and deploy the Kustomize manifest. + +## Install + +The following steps show how to configure and deploy Kubeflow with supported AWS services. + +### Using only RDS or only S3 + +Steps relevant only to the RDS installation are prefixed with `[RDS]`. + +Steps relevant only to the S3 installation are prefixed with `[S3]`. + +Steps without any prefixing are necessary for all installations. + +To install for only RDS or only S3, complete the steps relevant to your installation choice. + +To install for both RDS and S3, complete all the steps below. + +## 1.0 Prerequisites +Follow the steps in [Prerequisites](/docs/deployment/prerequisites/) to make sure that you have everything you need to get started. + +1. Verify that you are in the root of your repository by running the `pwd` command. The path should be : + ``` + pwd + ``` + +4. Create an OIDC provider for your cluster . + + **Important :** + You must make sure you have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) for your cluster and that it was added from `eksctl` >= `0.56`. If you already have an OIDC provider in place, then you must make sure you have the tag `alpha.eksctl.io/cluster-name` with the cluster name as its value. If you don't have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags. + +## 2.0 Set up RDS, S3, and configure Secrets + +There are two ways to create RDS and S3 resources before you deploy the Kubeflow manifests. Either use the [automated setup](#21-option-1-automated-setup) Python script that is mentioned in the following step, or follow the [manual setup instructions](#22-option-2-manual-setup). + +### 2.1 **Option 1: Automated Setup** + +This setup performs all the manual steps in an automated fashion. + +The script takes care of creating the S3 bucket, creating the S3 Secrets using the Secrets manager, setting up the RDS database, and creating the RDS Secret using the Secrets manager. The script also edits the required configuration files for Kubeflow Pipelines to be properly configured for the RDS database during Kubeflow installation. The script also handles cases where the resources already exist. In this case, the script will simply skip the step. + +> Note: The script will **not** delete any resource. Therefore, if a resource already exists (eg: Secret, database with the same name, or S3 bucket), **it will skip the creation of those resources and use the existing resources instead**. This is by design in order to prevent unwanted results, such as accidental deletion. For example, if a database with the same name already exists, the script will skip the database creation setup. If you forgot to change the database name used for creation, then this gives you the chance to retry the script with the proper value. See `python auto-rds-s3-setup.py --help` for the list of parameters, as well as their default values. + +1. Navigate to the `tests/e2e` directory. +```bash +cd tests/e2e +``` +2. Install the script dependencies. +```bash +pip install -r requirements.txt +``` +3. [Create an IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_cliwpsapi) with permissions to get bucket locations and allow read and write access to objects in an S3 bucket where you want to store the Kubeflow artifacts. Take note of the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` of the IAM user that you created to use in the following step. +4. Export values for `CLUSTER_REGION`, `CLUSTER_NAME`, `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, and `AWS_SECRET_ACCESS_KEY`. Then, run the `auto-rds-s3-setup.py` script. +``` +export CLUSTER_REGION= +export CLUSTER_NAME= +export S3_BUCKET= +export AWS_ACCESS_KEY_ID= +export AWS_SECRET_ACCESS_KEY= + +PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --bucket $S3_BUCKET --s3_aws_access_key_id $AWS_ACCESS_KEY_ID --s3_aws_secret_access_key $AWS_SECRET_ACCESS_KEY +``` + +### Advanced customization + +The `auto-rds-s3-setup.py` script applies default values for the user password, max storage, storage type, instance type, and more. You can customize those preferences by specifying different values. + +Learn more about the different parameters with the following command: +```bash +PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-setup.py --help +``` + +### 2.2 **Option 2: Manual Setup** +Follow this step if you prefer to manually set up each component. +1. [S3] Create an S3 Bucket + + Refer to the [S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html) for steps on creating an `S3 bucket`. + Take note of your `S3 bucket name` to use in the following steps. + +2. [RDS] Create an RDS Instance + + Refer to the [RDS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.MySQL.html) for steps on creating an `RDS MySQL instance`. + + When creating the RDS instance for security and connectivity reasons, we recommend that: + - The RDS instance is in the same VPC as the cluster + - The RDS instance subnets must belong to at least two private subnets within the VPC + - The RDS instance security group is the same security group used by the EKS node instances + + To complete the following steps you will need to keep track of the following: + - `RDS database name` (not to be confused with the `DB identifier`) + - `RDS database admin username` + - `RDS database admin password` + - `RDS database endpoint URL` + - `RDS database port` + +3. Create Secrets in AWS Secrets Manager + + 1. [RDS] Create the RDS Secret and configure the Secret provider: + 1. Configure a Secret (e.g `rds-secret`), with the RDS DB name, RDS endpoint URL, RDS DB port, and RDS DB credentials that were configured when creating your RDS instance. + - For example, if your database name is `kubeflow`, your endpoint URL is `rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com`, your DB port is `3306`, your DB username is `admin`, and your DB password is `Kubefl0w` your secret should look similar to the following: + - ```bash + export RDS_SECRET= + aws secretsmanager create-secret --name $RDS_SECRET --secret-string '{"username":"admin","password":"Kubefl0w","database":"kubeflow","host":"rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com","port":"3306"}' --region $CLUSTER_REGION + ``` + 1. Rename the `parameters.objects.objectName` field in [the RDS Secret provider configuration](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/aws-secrets-manager/rds/secret-provider.yaml) to the name of the Secret. + - Rename the field with the following command: + ```bash + yq e -i '.spec.parameters.objects |= sub("rds-secret",env(RDS_SECRET))' awsconfigs/common/aws-secrets-manager/rds/secret-provider.yaml + ``` + - For example, if your Secret name is `rds-secret-new`, the configuration should look similar to the following: + - ```bash + apiVersion: secrets-store.csi.x-k8s.io/v1alpha1 + kind: SecretProviderClass + metadata: + name: rds-secret + + ... + + parameters: + objects: | + - objectName: "rds-secret-new" # This line was changed + objectType: "secretsmanager" + jmesPath: + - path: "username" + objectAlias: "user" + - path: "password" + objectAlias: "pass" + - path: "host" + objectAlias: "host" + - path: "database" + objectAlias: "database" + - path: "port" + objectAlias: "port" + ``` + + 1. [S3] Create the S3 Secret and configure the Secret provider: + 1. Configure a Secret (e.g. `s3-secret`) with your AWS credentials. These need to be long-term credentials from an IAM user and not temporary. + - For more details about configuring or finding your AWS credentials, see [AWS security credentials](https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html) + - ```bash + export S3_SECRET= + aws secretsmanager create-secret --name S3_SECRET --secret-string '{"accesskey":"AXXXXXXXXXXXXXXXXXX6","secretkey":"eXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXq"}' --region $CLUSTER_REGION + ``` + 1. Rename the `parameters.objects.objectName` field in [the S3 Secret provider configuration](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/aws-secrets-manager/s3/secret-provider.yaml) to the name of the Secret. + - Rename the field with the following command: + ```bash + yq e -i '.spec.parameters.objects |= sub("s3-secret",env(S3_SECRET))' awsconfigs/common/aws-secrets-manager/s3/secret-provider.yaml + ``` + - For example, if your Secret name is `s3-secret-new`, the configuration should look similar to the following: + - ```bash + apiVersion: secrets-store.csi.x-k8s.io/v1alpha1 + kind: SecretProviderClass + metadata: + name: s3-secret + + ... + + parameters: + objects: | + - objectName: "s3-secret-new" # This line was changed + objectType: "secretsmanager" + jmesPath: + - path: "accesskey" + objectAlias: "access" + - path: "secretkey" + objectAlias: "secret" + ``` + +4. Install AWS Secrets & Configuration Provider with Kubernetes Secrets Store CSI driver + + 1. Run the following commands to enable OIDC and create an `iamserviceaccount` with permissions to retrieve the Secrets created with AWS Secrets Manager. + + ```bash + eksctl utils associate-iam-oidc-provider --region=$CLUSTER_REGION --cluster=$CLUSTER_NAME --approve + + eksctl create iamserviceaccount --name kubeflow-secrets-manager-sa --namespace kubeflow --cluster $CLUSTER_NAME --attach-policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess --attach-policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite --override-existing-serviceaccounts --approve --region $CLUSTER_REGION + ``` + + 2. Run the following commands to install AWS Secrets & Configuration Provider with Kubernetes Secrets Store CSI driver: + + ```bash + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretproviderclass.yaml + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/csidriver.yaml + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasses.yaml + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasspodstatuses.yaml + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store-csi-driver.yaml + kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretprovidersyncing.yaml + kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/deployment/aws-provider-installer.yaml + ``` + +5. Update the KFP configurations. + 1. [RDS] Configure the [RDS params file](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/apps/pipeline/rds/params.env) with the RDS endpoint URL and the metadata DB name. + + For example, if your RDS endpoint URL is `rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com` and your metadata DB name is `metadata_db`, then your `params.env` file should look similar to the following: + ```bash + dbHost=rm12abc4krxxxxx.xxxxxxxxxxxx.us-west-2.rds.amazonaws.com + mlmdDb=metadata_db + ``` + + 2. [S3] Configure the [S3 params file](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/apps/pipeline/s3/params.env) with the `S3 bucket name` and `S3 bucket region`. + + For example, if your S3 bucket name is `kf-aws-demo-bucket` and your S3 bucket region is `us-west-2`, then your `params.env` file should look similar to the following: + ```bash + bucketName=kf-aws-demo-bucket + minioServiceHost=s3.amazonaws.com + minioServiceRegion=us-west-2 + ``` + +## 3.0 Build Manifests and install Kubeflow + +Once you have the resources ready, you can deploy the Kubeflow manifests for one of the following deployment options: +- both RDS and S3 +- RDS only +- S3 only + +#### [RDS and S3] Deploy both RDS and S3 + +Use the following command to deploy the Kubeflow manifests for both RDS and S3: +```sh +while ! kustomize build docs/deployment/rds-s3 | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done +``` + +#### [RDS] Deploy RDS only +Use the following command to deploy the Kubeflow manifests for RDS only: +```sh +while ! kustomize build docs/deployment/rds-s3/rds-only | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done +``` + +#### [S3] Deploy S3 only +Use the following command to deploy the Kubeflow manifests for S3 only: +```sh +while ! kustomize build docs/deployment/rds-s3/s3-only | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done +``` + +Once everything is installed successfully, you can access the Kubeflow Central Dashboard [by logging in to your cluster](/docs/deployment/vanilla/guide/#connect-to-your-kubeflow-cluster). + +You can now start experimenting and running your end-to-end ML workflows with Kubeflow! + +## 4.0 Verify the installation + +### 4.1 Verify RDS + +1. Connect to your RDS instance from a pod within the cluster with the following command: +```bash +kubectl run -it --rm --image=mysql:5.7 --restart=Never mysql-client -- mysql -h -u -p +``` + +You can find your credentials by visiting [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) or by using the [AWS CLI](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/secretsmanager/get-secret-value.html). + +For example, use the following command to retrieve the value of a Secret named `rds-secret`: +```bash +aws secretsmanager get-secret-value \ + --region $CLUSTER_REGION \ + --secret-id rds-secret \ + --query 'SecretString' \ + --output text +``` + +2. Once you are connected to your RDS instance, verify that the databases `kubeflow` and `mlpipeline` exist. +```bash +mysql> show databases; + ++--------------------+ +| Database | ++--------------------+ +| information_schema | +| kubeflow | +| mlpipeline | +| mysql | +| performance_schema | ++--------------------+ +``` + +3. Verify that the database `mlpipeline` has the following tables: +```bash +mysql> use mlpipeline; show tables; + ++----------------------+ +| Tables_in_mlpipeline | ++----------------------+ +| db_statuses | +| default_experiments | +| experiments | +| jobs | +| pipeline_versions | +| pipelines | +| resource_references | +| run_details | +| run_metrics | ++----------------------+ +``` + +4. Access the Kubeflow Central Dashboard [by logging in to your cluster](/docs/deployment/vanilla/guide/#connect-to-your-kubeflow-cluster) and navigate to Katib (under Experiments (AutoML)). + +5. Create an experiment using the following [yaml file](https://github.com/awslabs/kubeflow-manifests/blob/main/tests/e2e/resources/custom-resource-templates/katib-experiment-random.yaml). + +6. Once the experiment is complete, verify that the following table exists: +```bash +mysql> use kubeflow; show tables; + ++----------------------+ +| Tables_in_kubeflow | ++----------------------+ +| observation_logs | ++----------------------+ +``` + +7. Describe the `observation_logs` to verify that they are being populated. +```bash +mysql> select * from observation_logs; +``` + +### 4.2 Verify S3 + +1. Access the Kubeflow Central Dashboard [by logging in to your cluster](/docs/deployment/vanilla/guide/#connect-to-your-kubeflow-cluster) and navigate to Kubeflow Pipelines (under Pipelines). + +2. Create an experiment named `test` and create a run using the sample pipeline `[Demo] XGBoost - Iterative model training`. + +3. Once the run is completed, go to the S3 AWS console and open the bucket that you specified for your Kubeflow installation. + +4. Verify that the bucket is not empty and was populated by the outputs of the experiment. + +## 5.0 Uninstall Kubeflow + +Run the following command to uninstall your Kubeflow deployment: +```sh +kustomize build docs/deployment/rds-s3 | kubectl delete -f - +``` + +The following cleanup steps may also be required: + +```sh +kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io webhook.eventing.knative.dev webhook.istio.networking.internal.knative.dev webhook.serving.knative.dev + +kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io config.webhook.eventing.knative.dev config.webhook.istio.networking.internal.knative.dev config.webhook.serving.knative.dev + +kubectl delete endpoints -n default mxnet-operator pytorch-operator tf-operator +``` + +To uninstall AWS resources created by the automated setup, run the cleanup script: +1. Navigate to the `tests/e2e` directory. +```bash +cd tests/e2e +``` +2. Install the script dependencies. +```bash +pip install -r requirements.txt +``` +3. Make sure that you have the configuration file created by the script in `tests/e2e/utils/rds-s3/metadata.yaml`. +``` +PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-cleanup.py +``` \ No newline at end of file diff --git a/docs/deployment/uninstall-kubeflow.md b/docs/deployment/uninstall-kubeflow.md new file mode 100644 index 0000000000..07e906515d --- /dev/null +++ b/docs/deployment/uninstall-kubeflow.md @@ -0,0 +1,38 @@ ++++ +title = "Uninstall Kubeflow" +description = "Delete Kubeflow deployments and Amazon EKS clusters" +weight = 80 ++++ + +## Uninstall Kubeflow on AWS + +First, delete all existing Kubeflow profiles. + +```bash +kubectl get profile +kubectl delete profile --all +``` + +You can delete a Kubeflow deployment by running the `kubectl delete` command on the manifest according to the deployment option you chose. For example, to delete a vanilla installation, run the following command: + +```bash +kustomize build docs/deployment/vanilla/ | kubectl delete -f - +``` + +This command assumes that you have the repository in the same state as when you installed Kubeflow. + +Cleanup steps for specific deployment options can be found in their respective [installation guides](/docs/deployment/). + +> Note: This will not delete your Amazon EKS cluster. + +## (Optional) Delete Amazon EKS cluster + +If you created a dedicated Amazon EKS cluster for Kubeflow using `eksctl`, you can delete it with the following command: + +```bash +eksctl delete cluster --region $CLUSTER_REGION --name $CLUSTER_NAME +``` + +> Note: It is possible that parts of the CloudFormation deletion will fail depending upon modifications made post-creation. In that case, manually delete the eks-xxx role in IAM, then the ALB, the EKS target groups, and the subnets of that particular cluster. Then, retry the command to delete the nodegroups and the cluster. + +For more detailed information on deletion options, see [Deleting an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/delete-cluster.html). \ No newline at end of file diff --git a/docs/deployment/vanilla/README.md b/docs/deployment/vanilla/guide.md similarity index 52% rename from docs/deployment/vanilla/README.md rename to docs/deployment/vanilla/guide.md index 3d254f72c7..71c8a82777 100644 --- a/docs/deployment/vanilla/README.md +++ b/docs/deployment/vanilla/guide.md @@ -1,51 +1,18 @@ ++++ +title = "Vanilla Installation" +description = "Deploy Kubeflow on AWS using Amazon Elastic Kubernetes Service (EKS)" +weight = 60 ++++ + # Deploying Kubeflow on EKS -This guide describes how to deploy vanilla Kubeflow on AWS EKS. This vanilla version has very minimal changes to the upstream Kubeflow manifests. Here are the changes -- Kubeflow Notebook configuration has been modified to include some [custom container images](../../../components/notebook-dockerfiles) built on [AWS DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) -- [Usage Tracking](../README.md#usage-tracking) +This guide describes how to deploy Kubeflow on AWS EKS. This vanilla version has minimal changes to the upstream Kubeflow manifests. ## Prerequisites -This guide assumes that you have: - -1. Installed the following tools on the client machine - - [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - A command line tool for interacting with AWS services. - - [eksctl](https://eksctl.io/introduction/#installation) - A command line tool for working with EKS clusters. - - [kubectl](https://kubernetes.io/docs/tasks/tools) - A command line tool for working with Kubernetes clusters. - - [yq](https://mikefarah.gitbook.io/yq) - A command line tool for YAML processing. (For Linux environments, use the [wget plain binary installation](https://mikefarah.gitbook.io/yq/#wget)) - - [jq](https://stedolan.github.io/jq/download/) - A command line tool for processing JSON. - - [kustomize version 3.2.0](https://github.com/kubernetes-sigs/kustomize/releases/tag/v3.2.0) - A command line tool to customize Kubernetes objects through a kustomization file. - - :warning: Kubeflow is not compatible with the latest versions of of kustomize 4.x. This is due to changes in the order resources are sorted and printed. Please see [kubernetes-sigs/kustomize#3794](https://github.com/kubernetes-sigs/kustomize/issues/3794) and [kubeflow/manifests#1797](https://github.com/kubeflow/manifests/issues/1797). We know this is not ideal and are working with the upstream kustomize team to add support for the latest versions of kustomize as soon as we can. - -1. Created an EKS cluster - - If you do not have an existing cluster, run the following command to create an EKS cluster. More details about cluster creation via `eksctl` can be found [here](https://eksctl.io/usage/creating-and-managing-clusters/). - - Substitute values for the CLUSTER_NAME and CLUSTER_REGION in the script below - ``` - export CLUSTER_NAME=$CLUSTER_NAME - export CLUSTER_REGION=$CLUSTER_REGION - eksctl create cluster \ - --name ${CLUSTER_NAME} \ - --version 1.19 \ - --region ${CLUSTER_REGION} \ - --nodegroup-name linux-nodes \ - --node-type m5.xlarge \ - --nodes 5 \ - --nodes-min 1 \ - --nodes-max 10 \ - --managed - ``` - -1. Clone the `awslabs/kubeflow-manifest` repo, `kubeflow/manifests` repo and checkout the release branches. - - Substitute the value for `KUBEFLOW_RELEASE_VERSION`(e.g. v1.4.1) and `AWS_RELEASE_VERSION`(e.g. v1.4.1-aws-b1.0.0) with the tag or branch you want to use below. Read more about [releases and versioning](../../community/releases.md#releases-and-versioning) policy if you are unsure about what these values should be. - ``` - export KUBEFLOW_RELEASE_VERSION=<> - export AWS_RELEASE_VERSION=<> - git clone https://github.com/awslabs/kubeflow-manifests.git && cd kubeflow-manifests - git checkout ${AWS_RELEASE_VERSION} - git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream - ``` - -### Build Manifests and Install Kubeflow +Be sure that you have satisfied the [installation prerequisites](/docs/deployment/prerequisites/) before working through this guide. + +### Build Manifests and install Kubeflow There two options for installing Kubeflow official components and common services with kustomize. @@ -55,12 +22,12 @@ There two options for installing Kubeflow official components and common service Option 1 targets ease of deployment for end users. \ Option 2 targets customization and ability to pick and choose individual components. -:warning: In both options, we use a default email (`user@example.com`) and password (`12341234`). For any production Kubeflow deployment, you should change the default password by following [the relevant section](#change-default-user-password). +> Warning: In both options, we use a default email (`user@example.com`) and password (`12341234`). For any production Kubeflow deployment, you should change the default password by following [the relevant section](#change-default-user-password). --- **NOTE** -`kubectl apply` commands may fail on the first try. This is inherent in how Kubernetes and `kubectl` work (e.g., CR must be created after CRD becomes ready). The solution is to simply re-run the command until it succeeds. For the single-line command, we have included a bash one-liner to retry the command. +`kubectl apply` commands may fail on the first try. This is inherent in how Kubernetes and `kubectl` work (e.g., CR must be created after CRD becomes ready). The solution is to re-run the command until it succeeds. For the single-line command, we have included a bash one-liner to retry the command. --- @@ -72,25 +39,26 @@ You can install all Kubeflow official components (residing under `apps`) and all while ! kustomize build docs/deployment/vanilla | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done ``` -Once, everything is installed successfully, you can access the Kubeflow Central Dashboard [by logging in to your cluster](#connect-to-your-kubeflow-cluster). +Once everything is installed successfully, you can access the Kubeflow Central Dashboard [by logging into your cluster](#connect-to-your-kubeflow-cluster). -Congratulations! You can now start experimenting and running your end-to-end ML workflows with Kubeflow. +You can now start experimenting and running your end-to-end ML workflows with Kubeflow! ### Install individual components -In this section, we will install each Kubeflow official component (under `apps`) and each common service (under `common`) separately, using just `kubectl` and `kustomize`. +This section lists an installation command for each official Kubeflow component (under `apps`) and each common service (under `common`) using just `kubectl` and `kustomize`. -If all the following commands are executed, the result is the same as in the above section of the single command installation. The purpose of this section is to: +If you run all of the following commands, the end result is the same as installing everything through the [single command installation](#install-with-a-single-command). +The purpose of this section is to: - Provide a description of each component and insight on how it gets installed. - Enable the user or distribution owner to pick and choose only the components they need. #### cert-manager -cert-manager is used by many Kubeflow components to provide certificates for +`cert-manager` is used by many Kubeflow components to provide certificates for admission webhooks. -Install cert-manager: +Install `cert-manager`: ```sh kustomize build common/cert-manager/cert-manager/base | kubectl apply -f - @@ -100,7 +68,7 @@ kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f - #### Istio Istio is used by many Kubeflow components to secure their traffic, enforce -network authorization and implement routing policies. +network authorization, and implement routing policies. Install Istio: @@ -112,7 +80,7 @@ kustomize build common/istio-1-9/istio-install/base | kubectl apply -f - #### Dex -Dex is an OpenID Connect Identity (OIDC) with multiple authentication backends. In this default installation, it includes a static user with email `user@example.com`. By default, the user's password is `12341234`. For any production Kubeflow deployment, you should change the default password by following [the relevant section](#change-default-user-password). +Dex is an OpenID Connect Identity (OIDC) with multiple authentication backends. In this default installation, it includes a static user with the email `user@example.com`. By default, the user's password is `12341234`. For any production Kubeflow deployment, you should change the default password by following the steps in [Change default user password](#change-default-user-password). Install Dex: @@ -122,7 +90,9 @@ kustomize build common/dex/overlays/istio | kubectl apply -f - #### OIDC AuthService -The OIDC AuthService extends your Istio Ingress-Gateway capabilities, to be able to function as an OIDC client: +The OIDC AuthService extends your Istio Ingress-Gateway capabilities to be able to function as an OIDC client: + +Install OIDC AuthService: ```sh kustomize build common/oidc-authservice/base | kubectl apply -f - @@ -139,18 +109,20 @@ kustomize build common/knative/knative-serving/base | kubectl apply -f - kustomize build common/istio-1-9/cluster-local-gateway/base | kubectl apply -f - ``` -Optionally, you can install Knative Eventing which can be used for inference request logging: +Optionally, you can install Knative Eventing, which can be used for inference request logging. + +Install Knative Eventing: ```sh kustomize build common/knative/knative-eventing/base | kubectl apply -f - ``` -#### Kubeflow Namespace +#### Kubeflow namespace -Create the namespace where the Kubeflow components will live in. This namespace +Create the namespace where the Kubeflow components will live. This namespace is named `kubeflow`. -Install kubeflow namespace: +Install the `kubeflow` namespace: ```sh kustomize build common/kubeflow-namespace/base | kubectl apply -f - @@ -158,11 +130,11 @@ kustomize build common/kubeflow-namespace/base | kubectl apply -f - #### Kubeflow Roles -Create the Kubeflow ClusterRoles, `kubeflow-view`, `kubeflow-edit` and +Create the Kubeflow ClusterRoles `kubeflow-view`, `kubeflow-edit`, and `kubeflow-admin`. Kubeflow components aggregate permissions to these ClusterRoles. -Install kubeflow roles: +Install Kubeflow roles: ```sh kustomize build common/kubeflow-roles/base | kubectl apply -f - @@ -171,11 +143,11 @@ kustomize build common/kubeflow-roles/base | kubectl apply -f - #### Kubeflow Istio Resources Create the Istio resources needed by Kubeflow. This kustomization currently -creates an Istio Gateway named `kubeflow-gateway`, in namespace `kubeflow`. +creates an Istio Gateway named `kubeflow-gateway` in the `kubeflow` namespace. If you want to install with your own Istio, then you need this kustomization as well. -Install istio resources: +Install Istio resources: ```sh kustomize build common/istio-1-9/kubeflow-istio-resources/base | kubectl apply -f - @@ -235,9 +207,9 @@ Install the Jupyter Web App official Kubeflow component: kustomize build apps/jupyter/jupyter-web-app/upstream/overlays/istio | kubectl apply -f - ``` -#### Profiles + KFAM +#### Profiles and Kubeflow Access-Management (KFAM) -Install the Profile Controller and the Kubeflow Access-Management (KFAM) official Kubeflow +Install the Profile controller and the Kubeflow Access-Management (KFAM) official Kubeflow components: ```sh @@ -260,7 +232,7 @@ Install the Tensorboards Web App official Kubeflow component: kustomize build apps/tensorboard/tensorboards-web-app/upstream/overlays/istio | kubectl apply -f - ``` -Install the Tensorboard Controller official Kubeflow component: +Install the Tensorboard controller official Kubeflow component: ```sh kustomize build apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow | kubectl apply -f - @@ -284,21 +256,21 @@ kustomize build apps/mpi-job/upstream/overlays/kubeflow | kubectl apply -f - #### AWS Telemetry -Install the AWS Kubeflow telemetry component. This is an optional component. See the [usage tracking documentation](../README.md#usage-tracking) for more information +Install the AWS Kubeflow telemetry component. This is an optional component. See [Usage Tracking](/docs/about/usage-tracking/) for more information ```sh kustomize build awsconfigs/common/aws-telemetry | kubectl apply -f - ``` -#### User Namespace +#### User namespace -Finally, create a new namespace for the the default user (named `kubeflow-user-example-com`). +Finally, create a new namespace for the the default user. In this example, the namespace is called `kubeflow-user-example-com`. ```sh kustomize build common/user-namespace/base | kubectl apply -f - ``` -### Connect to your Kubeflow Cluster +### Connect to your Kubeflow cluster After installation, it will take some time for all Pods to become ready. Make sure all Pods are ready before trying to connect, otherwise you might get unexpected errors. To check that all Kubeflow-related Pods are ready, use the following commands: @@ -314,7 +286,7 @@ kubectl get pods -n kubeflow-user-example-com #### Port-Forward -To get started quickly you can access Kubeflow via port-forward. Run the following to port-forward Istio's Ingress-Gateway to local port `8080`: +To get started quickly, you can access Kubeflow via port-forward. Run the following to port-forward Istio's Ingress-Gateway to local port `8080`: ```sh kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 @@ -327,11 +299,11 @@ After running the command, you can access the Kubeflow Central Dashboard by doin #### Exposing Kubeflow over Load Balancer -In order to expose Kubeflow over an external address you can setup AWS Application Load Balancer. Please take a look at the [load-balancer](../add-ons/load-balancer/README.md) guide to set it up. +In order to expose Kubeflow over an external address, you can set up AWS Application Load Balancer. Please take a look at the [Load Balancer guide](/docs/deployment/add-ons/load-balancer/guide/) to set it up. ### Change default user password -For security reasons, we don't want to use the default password for the default Kubeflow user when installing in security-sensitive environments. Instead, you should define your own password before deploying. To define a password for the default user: +For security reasons, we do not recommend using the default password for the default Kubeflow user when installing in security-sensitive environments. Instead, you should define your own password before deploying. To define a password for the default user: 1. Pick a password for the default user, with email `user@example.com`, and hash it using `bcrypt`: