Add user guide on how to deploy to EKS (#2510)

* Add user guide on how to deploy to EKS * Address comments
NVIDIA · Apr 17, 2024 · f2fd48a · f2fd48a
1 parent 2b4cf2a
commit f2fd48a
Show file tree

Hide file tree

Showing 3 changed files with 378 additions and 0 deletions.
diff --git a/docs/real_world_fl.rst b/docs/real_world_fl.rst
@@ -30,5 +30,6 @@ to see the capabilities of the system and how it can be operated.
    real_world_fl/job
    real_world_fl/workspace
    real_world_fl/cloud_deployment
+   real_world_fl/kubernetes
    real_world_fl/notes_on_large_models
    user_guide/security/identity_security
diff --git a/docs/real_world_fl/kubernetes.rst b/docs/real_world_fl/kubernetes.rst
@@ -0,0 +1,371 @@
+.. _eks_deployment:
+
+############################################
+Amazon Elastic Kubernetes Service Deployment
+############################################
+In this document, we will describe how to run the entire NVIDIA FLARE inside one Amazon Elastic Kubernetes Service (EKS).  For information
+how to run NVIDIA FLARE inside microk8s (local kubernetes cluster), please refer to :ref:`_helm_chart`.  That document describes how to
+provision one NVIDIA FLARE system, configure your microk8s cluster, deploy the servers, the overseer and the clients to that cluster, and
+control and submit jobs to that NVIDIA FLARE from admin console.
+
+
+Start the EKS
+=============
+We assume that you have one AWS account which allows you to start one EKS.  We also assume you have eksctl, aws and kubectl installed in your local machine.
+Note that the versions of those CLI may affect the operations.  We suggest keep them updated.
+
+The first thing is to start the EKS with eksctl.  The following is a sample yaml file, ``cluster.yaml``, to create EKS with one command.
+
+.. code-block:: yaml
+
+    apiVersion: eksctl.io/v1alpha5
+    kind: ClusterConfig
+
+    metadata:
+    name: nvflare-cluster
+    region: us-west-2
+    tags:
+        project: nvflare
+
+    nodeGroups:
+    - name: worker-node
+        instanceType: t3.large
+        desiredCapacity: 2
+
+.. code-block:: shell
+
+    eksctl create cluster -f cluster.yaml
+
+After this, you will have one cluster with two `t3.large` EC2 nodes.
+
+
+Provision
+=========
+
+With NVIDIA FLARE installed in your local machine, you can create one set of startup kits easily with ``nvflare provision``.  If there is a project.yml file
+in your current working directory, ``nvflare provision`` will create a workspace directory.  If that project.yml file does not exist, ``nvflare provision`` will
+create a sample project.yml for you.  For simplicity, we suggest you remove/rename any existing project.yml and workspace directory.  Then provision the
+set of startup kits from scratch.  When selecting the sampel project.yml during provisioning time, select non-HA one as most clusters support HA easily.
+
+After provisioning, you will have a workspace/example_project/prod_00 folder, which includes server, site-1, site-2 and admin@nvidia.com folders.  If you
+would like to use other names instead of ``site-1``, ``site-2``, etc, you can remove the workspace folder and modify the project.yml file.  After that,
+you can run ``nvflare provision`` command to get the new set of startup kits.
+
+Persistent Volume
+=================
+
+EKS provides several ways to create persistent volumes.  Before you can use create the volume, 
+you will need to create one OIDC provider, add one service account and attach a pollicy to two roles, the node instance group and that service account.
+
+.. code-block:: shell
+
+    eksctl utils associate-iam-oidc-provider --region=us-west-2 --cluster=nvflare-cluster --approve
+
+.. code-block:: shell
+
+    eksctl create iamserviceaccount \
+    --region us-west-2 \
+    --name ebs-csi-controller-sa \
+    --namespace kube-system \
+    --cluster nvflare-cluster \
+    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
+    --approve \
+    --role-only \
+    --role-name AmazonEKS_EBS_CSI_DriverRole
+
+
+.. code-block:: shell
+    
+    eksctl create addon --name aws-ebs-csi-driver \
+    --cluster nvflare-cluster \
+    --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole \
+    --force
+    
+The following is the policy json file that you have to attach to the roles.
+
+.. code-block:: json
+
+    {
+        "Version": "2012-10-17",
+        "Statement": [
+            {
+                "Sid": "Poicly4EKS",
+                "Effect": "Allow",
+                "Action": [
+                    "ec2:DetachVolume",
+                    "ec2:AttachVolume",
+                    "ec2:DeleteVolume",
+                    "ec2:DescribeInstances",
+                    "ec2:DescribeTags",
+                    "ec2:DeleteTags",
+                    "ec2:CreateTags",
+                    "ec2:DescribeVolumes",
+                    "ec2:CreateVolume"
+                ],
+                "Resource": [
+                    "*"
+                ]
+            }
+        ]
+    }
+
+The following yaml file will utilize EKS gp2 StorageClass to allocate 5GiByte space.  You
+can run ``kubectl apply -f volume.yaml`` to make the volume available.
+
+.. code-block:: yaml
+
+    apiVersion: v1
+    kind: PersistentVolumeClaim
+    metadata:
+        name: nvflare-pv-claim
+        labels:
+            app: nvflare 
+    spec:
+        accessModes:
+            - ReadWriteOnce
+        resources:
+            requests:
+                storage: 5Gi
+        storageClassName: gp2
+
+After that, your EKS persistent volme should be waiting for the first claim.
+
+
+Start Helper Pod
+================
+
+Now you will need to copy your startup kits to your EKS cluster.  Those startup kits will copied into the volume you just created.
+In order to access the volume, we deploy a helper pod which mounts that persistent volume and use kubectl cp to copy files from your
+local machine to the cluster.
+
+The following is the helper pod yaml file.
+
+.. code-block:: yaml
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+    labels:
+        run: bb8
+    name: bb8
+    spec:
+    replicas: 1
+    selector:
+        matchLabels:
+        run: bb8
+    template:
+        metadata:
+        labels:
+            run: bb8
+        spec:
+        containers:
+        - args:
+            - sleep
+            - "50000"
+            image: busybox
+            name: bb8
+            volumeMounts:
+            - name: nvfl
+                mountPath: /workspace/nvfl/
+        volumes:
+            - name: nvfl
+            persistentVolumeClaim:
+                claimName: nvflare-pv-claim
+
+
+All pods can be deployed with ``kubectl apply -f`` so we just need the following command.
+
+.. code-block:: shell
+
+    kubectl apply -f bb8.yaml
+
+Your helper pod should be up and running very soon.  Now copy the startup kits to the cluster with
+
+.. code-block:: shell
+
+    kubectl cp workspace/example_project/prod_00/server <helper-pod>:/workspace/nvfl/
+
+And the same for site-1, site-2, admin@nvidia.com.
+
+This will make the entire startup kits available at the nvflare-pv-claim of the cluster so that NVIDIA FLARE system
+can mount that nvflare-pv-claim and access the startup kits.
+
+After copying those folders to nvflare-pv-claim, you can shutdown the helper pod.  The nvflare-pv-claim and its contents will stay and is
+available to server/client/admin pods.
+
+Start Server Pod
+================
+
+The NVIDIA FLARE server consists of two portions for Kubernetes clusters.  As you might know, 
+the server needs computation to handle model updates, aggregations and other operations.  It also needs to provide a service for clients and admins
+to connect.  Therefore, the followings are two separate yaml files that work together to create the NVIDIA FLARE server in EKS.
+
+.. code-block:: yaml
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+    labels:
+        run: nvflare
+    name: nvflare 
+    spec:
+    replicas: 1
+    selector:
+        matchLabels:
+        run: nvflare
+    template:
+        metadata:
+        labels:
+            run: nvflare
+        spec:
+        containers:
+        - args:
+            - -u
+            - -m
+            - nvflare.private.fed.app.server.server_train
+            - -m
+            - /workspace/nvfl/server
+            - -s
+            - fed_server.json
+            - --set
+            - secure_train=true
+            - config_folder=config
+            - org=nvidia
+            command:
+            - /usr/local/bin/python3
+            image: nvflare/nvflare:2.4.0
+            imagePullPolicy: Always
+            name: nvflare
+            volumeMounts:
+            - name: nvfl
+                mountPath: /workspace/nvfl/
+        volumes:
+            - name: nvfl
+            persistentVolumeClaim:
+                claimName: nvflare-pv-claim
+
+
+.. code-block:: yaml
+    
+    apiVersion: v1
+    kind: Service
+    metadata:
+    labels:
+        run: server
+    name: server
+    spec:
+    ports:
+    - port: 8002
+        protocol: TCP
+        targetPort: 8002
+        name: flport
+    - port: 8003
+        protocol: TCP
+        targetPort: 8003
+        name: adminport
+    selector:
+        run: nvflare
+
+    
+Note that the pod will use nvflare/nvflare:2.4.0 container image from dockerhub.com.  This image only includes the necessary dependencies to start
+NVIDIA FLARE system.  If you require additional dependencies, such as Torch or MONAI, you will need to build and publish your own image and update
+the yaml file accordingly.
+
+Start Client Pods
+=================
+
+For the client pods, we only need one yaml file for eacch client.  The following is the deployment yaml file for site-1.
+
+.. code-block:: yaml
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+    labels:
+        run: site1
+    name: site1
+    spec:
+    replicas: 1
+    selector:
+        matchLabels:
+        run: site1
+    template:
+        metadata:
+        labels:
+            run: site1
+        spec:
+        containers:
+        - args:
+            - -u
+            - -m
+            - nvflare.private.fed.app.client.client_train
+            - -m
+            - /workspace/nvfl/site-1
+            - -s
+            - fed_client.json
+            - --set
+            - secure_train=true
+            - uid=site-1
+            - config_folder=config
+            - org=nvidia
+            command:
+            - /usr/local/bin/python3
+            image: nvflare/nvflare:2.4.0
+            imagePullPolicy: Always
+            name: site1
+            volumeMounts:
+            - name: nvfl
+                mountPath: /workspace/nvfl/
+        volumes:
+            - name: nvfl
+            persistentVolumeClaim:
+                claimName: nvflare-pv-claim
+
+Once the client is up and running, you can check the server log with ``kubectl logs`` and the log should show the clients registered.
+
+Start and Connect to Admin Pods
+===============================
+
+We can also run the admin console inside the EKS cluster to submit jobs to the NVIDIA FLARE running in the EKS cluster.  Start the admin pod
+with the following yaml file.
+
+.. code-block:: yaml
+
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+    labels:
+        run: admin
+    name: admin
+    spec:
+    replicas: 1
+    selector:
+        matchLabels:
+        run: admin
+    template:
+        metadata:
+        labels:
+            run: admin
+        spec:
+        containers:
+        - args:
+            - "50000" 
+            command:
+            - /usr/bin/sleep
+            image: nvflare/nvflare:2.4.0
+            imagePullPolicy: Always
+            name: admin
+            volumeMounts:
+            - name: nvfl
+                mountPath: /workspace/nvfl/
+        volumes:
+            - name: nvfl
+            persistentVolumeClaim:
+                claimName: nvflare-pv-claim
+
+Once the admin pod is running, you can enter the pod with ``kubectl exec`` , cd to ``/workspace/nvfl/admin@nvidia.com/startup`` and run ``fl_admin.sh``.
+
+
+Note that you need to copy the job from your local machine to the EKS cluster so that the ``transfer`` directory of admin@nvidia.com contains the jobs
+you would like to run in that EKS cluster.
+
diff --git a/docs/real_world_fl/overview.rst b/docs/real_world_fl/overview.rst
@@ -159,6 +159,12 @@ See how to deploy to Azure and AWS clouds can be found in :ref:`cloud_deployment
 
 Deploy to Google Cloud will be made available in a future release.
 
+Kubernetes Deployment
+=====================
+As mentioned above, you can run NVIDIA FLARE in the public cloud.  If you prefer to deploy NVIDIA FLARE in Amazon Elastic Kubernetes Service (EKS),
+you can find the deployment guide in :ref:`eks_deployment`.
+
+
 Starting Federated Learning Servers
 =============================================
 The FL Server will coordinate the federated learning training and be the main hub all clients and admin