From f2fd48acdf4a050ce518e1dc8f26b55096e0892e Mon Sep 17 00:00:00 2001 From: Isaac Yang Date: Wed, 17 Apr 2024 11:26:29 -0700 Subject: [PATCH] Add user guide on how to deploy to EKS (#2510) * Add user guide on how to deploy to EKS * Address comments --- docs/real_world_fl.rst | 1 + docs/real_world_fl/kubernetes.rst | 371 ++++++++++++++++++++++++++++++ docs/real_world_fl/overview.rst | 6 + 3 files changed, 378 insertions(+) create mode 100644 docs/real_world_fl/kubernetes.rst diff --git a/docs/real_world_fl.rst b/docs/real_world_fl.rst index d9867adb19..c2b0b78e62 100644 --- a/docs/real_world_fl.rst +++ b/docs/real_world_fl.rst @@ -30,5 +30,6 @@ to see the capabilities of the system and how it can be operated. real_world_fl/job real_world_fl/workspace real_world_fl/cloud_deployment + real_world_fl/kubernetes real_world_fl/notes_on_large_models user_guide/security/identity_security diff --git a/docs/real_world_fl/kubernetes.rst b/docs/real_world_fl/kubernetes.rst new file mode 100644 index 0000000000..8c52bddc8f --- /dev/null +++ b/docs/real_world_fl/kubernetes.rst @@ -0,0 +1,371 @@ +.. _eks_deployment: + +############################################ +Amazon Elastic Kubernetes Service Deployment +############################################ +In this document, we will describe how to run the entire NVIDIA FLARE inside one Amazon Elastic Kubernetes Service (EKS). For information +how to run NVIDIA FLARE inside microk8s (local kubernetes cluster), please refer to :ref:`_helm_chart`. That document describes how to +provision one NVIDIA FLARE system, configure your microk8s cluster, deploy the servers, the overseer and the clients to that cluster, and +control and submit jobs to that NVIDIA FLARE from admin console. + + +Start the EKS +============= +We assume that you have one AWS account which allows you to start one EKS. We also assume you have eksctl, aws and kubectl installed in your local machine. +Note that the versions of those CLI may affect the operations. We suggest keep them updated. + +The first thing is to start the EKS with eksctl. The following is a sample yaml file, ``cluster.yaml``, to create EKS with one command. + +.. code-block:: yaml + + apiVersion: eksctl.io/v1alpha5 + kind: ClusterConfig + + metadata: + name: nvflare-cluster + region: us-west-2 + tags: + project: nvflare + + nodeGroups: + - name: worker-node + instanceType: t3.large + desiredCapacity: 2 + +.. code-block:: shell + + eksctl create cluster -f cluster.yaml + +After this, you will have one cluster with two `t3.large` EC2 nodes. + + +Provision +========= + +With NVIDIA FLARE installed in your local machine, you can create one set of startup kits easily with ``nvflare provision``. If there is a project.yml file +in your current working directory, ``nvflare provision`` will create a workspace directory. If that project.yml file does not exist, ``nvflare provision`` will +create a sample project.yml for you. For simplicity, we suggest you remove/rename any existing project.yml and workspace directory. Then provision the +set of startup kits from scratch. When selecting the sampel project.yml during provisioning time, select non-HA one as most clusters support HA easily. + +After provisioning, you will have a workspace/example_project/prod_00 folder, which includes server, site-1, site-2 and admin@nvidia.com folders. If you +would like to use other names instead of ``site-1``, ``site-2``, etc, you can remove the workspace folder and modify the project.yml file. After that, +you can run ``nvflare provision`` command to get the new set of startup kits. + +Persistent Volume +================= + +EKS provides several ways to create persistent volumes. Before you can use create the volume, +you will need to create one OIDC provider, add one service account and attach a pollicy to two roles, the node instance group and that service account. + +.. code-block:: shell + + eksctl utils associate-iam-oidc-provider --region=us-west-2 --cluster=nvflare-cluster --approve + +.. code-block:: shell + + eksctl create iamserviceaccount \ + --region us-west-2 \ + --name ebs-csi-controller-sa \ + --namespace kube-system \ + --cluster nvflare-cluster \ + --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \ + --approve \ + --role-only \ + --role-name AmazonEKS_EBS_CSI_DriverRole + + +.. code-block:: shell + + eksctl create addon --name aws-ebs-csi-driver \ + --cluster nvflare-cluster \ + --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole \ + --force + +The following is the policy json file that you have to attach to the roles. + +.. code-block:: json + + { + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "Poicly4EKS", + "Effect": "Allow", + "Action": [ + "ec2:DetachVolume", + "ec2:AttachVolume", + "ec2:DeleteVolume", + "ec2:DescribeInstances", + "ec2:DescribeTags", + "ec2:DeleteTags", + "ec2:CreateTags", + "ec2:DescribeVolumes", + "ec2:CreateVolume" + ], + "Resource": [ + "*" + ] + } + ] + } + +The following yaml file will utilize EKS gp2 StorageClass to allocate 5GiByte space. You +can run ``kubectl apply -f volume.yaml`` to make the volume available. + +.. code-block:: yaml + + apiVersion: v1 + kind: PersistentVolumeClaim + metadata: + name: nvflare-pv-claim + labels: + app: nvflare + spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi + storageClassName: gp2 + +After that, your EKS persistent volme should be waiting for the first claim. + + +Start Helper Pod +================ + +Now you will need to copy your startup kits to your EKS cluster. Those startup kits will copied into the volume you just created. +In order to access the volume, we deploy a helper pod which mounts that persistent volume and use kubectl cp to copy files from your +local machine to the cluster. + +The following is the helper pod yaml file. + +.. code-block:: yaml + + apiVersion: apps/v1 + kind: Deployment + metadata: + labels: + run: bb8 + name: bb8 + spec: + replicas: 1 + selector: + matchLabels: + run: bb8 + template: + metadata: + labels: + run: bb8 + spec: + containers: + - args: + - sleep + - "50000" + image: busybox + name: bb8 + volumeMounts: + - name: nvfl + mountPath: /workspace/nvfl/ + volumes: + - name: nvfl + persistentVolumeClaim: + claimName: nvflare-pv-claim + + +All pods can be deployed with ``kubectl apply -f`` so we just need the following command. + +.. code-block:: shell + + kubectl apply -f bb8.yaml + +Your helper pod should be up and running very soon. Now copy the startup kits to the cluster with + +.. code-block:: shell + + kubectl cp workspace/example_project/prod_00/server :/workspace/nvfl/ + +And the same for site-1, site-2, admin@nvidia.com. + +This will make the entire startup kits available at the nvflare-pv-claim of the cluster so that NVIDIA FLARE system +can mount that nvflare-pv-claim and access the startup kits. + +After copying those folders to nvflare-pv-claim, you can shutdown the helper pod. The nvflare-pv-claim and its contents will stay and is +available to server/client/admin pods. + +Start Server Pod +================ + +The NVIDIA FLARE server consists of two portions for Kubernetes clusters. As you might know, +the server needs computation to handle model updates, aggregations and other operations. It also needs to provide a service for clients and admins +to connect. Therefore, the followings are two separate yaml files that work together to create the NVIDIA FLARE server in EKS. + +.. code-block:: yaml + + apiVersion: apps/v1 + kind: Deployment + metadata: + labels: + run: nvflare + name: nvflare + spec: + replicas: 1 + selector: + matchLabels: + run: nvflare + template: + metadata: + labels: + run: nvflare + spec: + containers: + - args: + - -u + - -m + - nvflare.private.fed.app.server.server_train + - -m + - /workspace/nvfl/server + - -s + - fed_server.json + - --set + - secure_train=true + - config_folder=config + - org=nvidia + command: + - /usr/local/bin/python3 + image: nvflare/nvflare:2.4.0 + imagePullPolicy: Always + name: nvflare + volumeMounts: + - name: nvfl + mountPath: /workspace/nvfl/ + volumes: + - name: nvfl + persistentVolumeClaim: + claimName: nvflare-pv-claim + + +.. code-block:: yaml + + apiVersion: v1 + kind: Service + metadata: + labels: + run: server + name: server + spec: + ports: + - port: 8002 + protocol: TCP + targetPort: 8002 + name: flport + - port: 8003 + protocol: TCP + targetPort: 8003 + name: adminport + selector: + run: nvflare + + +Note that the pod will use nvflare/nvflare:2.4.0 container image from dockerhub.com. This image only includes the necessary dependencies to start +NVIDIA FLARE system. If you require additional dependencies, such as Torch or MONAI, you will need to build and publish your own image and update +the yaml file accordingly. + +Start Client Pods +================= + +For the client pods, we only need one yaml file for eacch client. The following is the deployment yaml file for site-1. + +.. code-block:: yaml + + apiVersion: apps/v1 + kind: Deployment + metadata: + labels: + run: site1 + name: site1 + spec: + replicas: 1 + selector: + matchLabels: + run: site1 + template: + metadata: + labels: + run: site1 + spec: + containers: + - args: + - -u + - -m + - nvflare.private.fed.app.client.client_train + - -m + - /workspace/nvfl/site-1 + - -s + - fed_client.json + - --set + - secure_train=true + - uid=site-1 + - config_folder=config + - org=nvidia + command: + - /usr/local/bin/python3 + image: nvflare/nvflare:2.4.0 + imagePullPolicy: Always + name: site1 + volumeMounts: + - name: nvfl + mountPath: /workspace/nvfl/ + volumes: + - name: nvfl + persistentVolumeClaim: + claimName: nvflare-pv-claim + +Once the client is up and running, you can check the server log with ``kubectl logs`` and the log should show the clients registered. + +Start and Connect to Admin Pods +=============================== + +We can also run the admin console inside the EKS cluster to submit jobs to the NVIDIA FLARE running in the EKS cluster. Start the admin pod +with the following yaml file. + +.. code-block:: yaml + + apiVersion: apps/v1 + kind: Deployment + metadata: + labels: + run: admin + name: admin + spec: + replicas: 1 + selector: + matchLabels: + run: admin + template: + metadata: + labels: + run: admin + spec: + containers: + - args: + - "50000" + command: + - /usr/bin/sleep + image: nvflare/nvflare:2.4.0 + imagePullPolicy: Always + name: admin + volumeMounts: + - name: nvfl + mountPath: /workspace/nvfl/ + volumes: + - name: nvfl + persistentVolumeClaim: + claimName: nvflare-pv-claim + +Once the admin pod is running, you can enter the pod with ``kubectl exec`` , cd to ``/workspace/nvfl/admin@nvidia.com/startup`` and run ``fl_admin.sh``. + + +Note that you need to copy the job from your local machine to the EKS cluster so that the ``transfer`` directory of admin@nvidia.com contains the jobs +you would like to run in that EKS cluster. + diff --git a/docs/real_world_fl/overview.rst b/docs/real_world_fl/overview.rst index 8f0572c1ec..23d494d0b1 100644 --- a/docs/real_world_fl/overview.rst +++ b/docs/real_world_fl/overview.rst @@ -159,6 +159,12 @@ See how to deploy to Azure and AWS clouds can be found in :ref:`cloud_deployment Deploy to Google Cloud will be made available in a future release. +Kubernetes Deployment +===================== +As mentioned above, you can run NVIDIA FLARE in the public cloud. If you prefer to deploy NVIDIA FLARE in Amazon Elastic Kubernetes Service (EKS), +you can find the deployment guide in :ref:`eks_deployment`. + + Starting Federated Learning Servers ============================================= The FL Server will coordinate the federated learning training and be the main hub all clients and admin