kustomize deployments and skaffolding for Label_MIcroservice (#93)

* This PR provides a kustomize package to deploy the label microservice * Also add a skaffolding config for the Label_Microservice * Remove the old YAML deployment files for the Label Microservice. * Edit the worker Dockerfile * Use TensorFlow 1.15.0 rather than using the "latest" image * We can also use a regular TensorFlow image and not a GPU version since this is just for inference and so we shouldn't need GPUs * Create a new requirements.worker.txt to only include the libraries that are needed in the worker. This should be much smaller than the uber set of python libraries (e.g. we don't need Jupyter, fairing, etc...) * Create requirements.universal_model.txt to contain some of the required python dependencies for the universal model. * Universal model is using ktext and some other libraries. * Add a prod overlay for the issue_embedding service. * create_secrets.py is a helper script for creating the required secrets in the clusters based on files in GCS. Related to #70 ensemble models.
kubeflow · Jan 4, 2020 · 244b6eb · 244b6eb
1 parent 262bfb8
commit 244b6eb
Show file tree

Hide file tree

Showing 24 changed files with 524 additions and 227 deletions.
diff --git a/Issue_Embeddings/deployment/README.md b/Issue_Embeddings/deployment/README.md
@@ -10,22 +10,30 @@ This is currently running on a GKE cluster.
 There is a dedicated instance running in
 
 * **GCP project**: issue-label-bot-dev
-* **cluster**: github-api-cluster
-* **namespace**: issuefeat
+* **cluster**: issue-label-bot
+* **namespace**: label-bot-prod
+
+
+See [kubeflow/code-intelligence#70](https://github.com/kubeflow/code-intelligence/issues/70) for a log of how it was setup.
 
 Deploying it
 
-1. Create the deployment
+1. Use skaffold to build a new image.
 
    ```
-   kustomize build deployment/overlays/dev | kubectl apply -f -
+   skaffold build
    ```
 
-   * TODO(jlewi): We should probably define suitable prod and possibly staging environments as well
+1. Edit the image
 
-1. You can also follow the [developer_guide.md](../developer_guide.md) to deploy it using skaffold
+   ```
+   cd deployment/overlays/prod
+   kustomize edit set image gcr.io/issue-label-bot-dev/issue-embedding=gcr.io/issue-label-bot-dev/issue-embedding:${TAG}@${SHA}
+   ```
 
-1. TODO(jlewi): Add instructions for how to build and update the images; one way to do this would be to use
-   `skaffold build` followed by `kustomize edit`
+1. Create the deployment
 
-   * We may need/want to use skaffold profiles to define GCR buckets corresponding to dev, staging, and prod
+   ```
+   cd Label_Microservice/deployment/overlays/prod
+   kustomize build | kubectl apply -f -
+   ```
diff --git a/Issue_Embeddings/deployment/base/deployments.yaml b/Issue_Embeddings/deployment/base/deployments.yaml
@@ -21,7 +21,7 @@ spec:
           httpGet:
             path: /healthz
             port: 80
-          initialDelaySeconds: 10
+          initialDelaySeconds: 30
           periodSeconds: 3
         env:
         - name: FLASK_ENV
@@ -31,4 +31,7 @@ spec:
         - name: authors
           value: 'f'
         ports:
-        - containerPort: 80
+        - containerPort: 80
+      # We need to set a service account corresponding to workload
+      # identity
+      serviceAccountName: default-editor
diff --git a/Issue_Embeddings/deployment/overlays/dev/deployments.yaml b/Issue_Embeddings/deployment/overlays/dev/deployments.yaml
@@ -10,12 +10,6 @@ spec:
         containers:
         - name: app
           env:
-            - name: PROJECT
-              value: issue-label-bot-dev
-            - name: ISSUE_EVENT_TOPIC
-              value: "TEST_event_queue"
-            - name: ISSUE_EVENT_SUBSCRIPTION
-              value: "TEST_subscription_for_event_queue"
             # Flask environment variables
             # TODO: Unfortunately it looks like if we enable debugging we hit the following error
             # https://stackoverflow.com/questions/53522052/flask-app-valueerror-signal-only-works-in-main-thread

diff --git a/Issue_Embeddings/deployment/overlays/dev/kustomization.yaml b/Issue_Embeddings/deployment/overlays/dev/kustomization.yaml
@@ -4,6 +4,6 @@ bases:
 - ../../base
 commonLabels:
   environment: dev
-namespace: jlewi-dev
+namespace: label-bot-dev
 patchesStrategicMerge:
 - deployments.yaml
diff --git a/Issue_Embeddings/deployment/overlays/prod/deployments.yaml b/Issue_Embeddings/deployment/overlays/prod/deployments.yaml
@@ -0,0 +1,11 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+    name: server
+spec:
+    # Use a single replica for development
+    replicas: 3
+    template:
+      spec:
+        containers:
+        - name: app    
diff --git a/Issue_Embeddings/deployment/overlays/prod/kustomization.yaml b/Issue_Embeddings/deployment/overlays/prod/kustomization.yaml
@@ -0,0 +1,13 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+commonLabels:
+  environment: prod
+namespace: label-bot-prod
+patchesStrategicMerge:
+- deployments.yaml
+resources:
+- ../../base
+images:
+- digest: sha256:292e6af3214b3a3dc499fe08a1873b986b77ba9e201ca57afd9d6736f513fe40
+  name: gcr.io/issue-label-bot-dev/issue-embedding
+  newName: gcr.io/issue-label-bot-dev/issue-embedding:3191fea
diff --git a/Issue_Embeddings/skaffold.yaml b/Issue_Embeddings/skaffold.yaml
@@ -14,9 +14,11 @@ build:
     # TODO(https://github.com/GoogleContainerTools/skaffold/issues/3448): We use manual sync
     # because inferred sync doesn't work
     #
-    # TODO(https://github.com/kubeflow/code-intelligence/issues/78): To use skaffold filesync
+    # TODO(https://github.com/kubeflow/code-intelligence/issues/78): 
+    # To use skaffold filesync
     # I think we will need to use a custome program to autorestart the server on file changes
-    # because we can't run flask in debug mode and rely on its auto-loader.
+    # because we can't run flask in debug mode and rely on its auto-loader..
+    # We created autorestart for that so we just have to use it.
     #sync:
     #    manual:
     #    - src: 'py/code_intelligence/*.py'
@@ -29,14 +31,17 @@ build:
       buildContext:
         gcsBucket: issue-label-bot-dev_skaffold-kaniko
       env: 
+        # TODO(GoogleContainerTools/skaffold#3468) skaffold doesn't
+        # appear to work with workload identity
         - name: GOOGLE_APPLICATION_CREDENTIALS
           value: /secret/user-gcp-sa.json
       cache: {}
   cluster:
     # pullSecret can be set to a local file from which the pull secret should be created.
     pullSecretName: user-gcp-sa
-    # TODO(jlewi): This should be changed for each developer; or maybe we should create a reusable one?
-    namespace: jlewi-dev
+    # Build in the kaniko namespace because we need to disable ISTIO sidecar injection
+    # see  GoogleContainerTools/skaffold#3442
+    namespace: kaniko
     resources:
       requests:
         cpu: 8

diff --git a/Label_Microservice/deployment/Dockerfile.worker b/Label_Microservice/deployment/Dockerfile.worker
@@ -1,72 +1,37 @@
-# borrowed from hamelsmu/ml-gpu-lite
+# Dockerfile for prediction workers
+FROM tensorflow/tensorflow:1.15.0-py3
 
-FROM tensorflow/tensorflow:latest-gpu-py3
 ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
 
-RUN add-apt-repository -y ppa:git-core/ppa
-RUN add-apt-repository -y ppa:jonathonf/python-3.6
-
-RUN apt-get update --fix-missing && apt-get install -y wget bzip2 ca-certificates \
-    build-essential \
-    byobu \
-    ca-certificates \
-    git-core git \
-    htop \
-    libglib2.0-0 \
-    libjpeg-dev \
-    libpng-dev \
-    libxext6 \
-    libsm6 \
-    libxrender1 \
-    libcupti-dev \
-    openssh-server \
-    python3.6 \
-    python3.6-dev \
-    software-properties-common \
-    vim \
-    && \
-apt-get clean && \
-rm -rf /var/lib/apt/lists/*
-
-RUN apt-get -y update
-
-#  Setup Python 3.6 (Need for other dependencies)
-RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1
-RUN apt-get install -y python3-setuptools
-RUN easy_install pip
-RUN pip install --upgrade pip
-
-# Fastai dependencies
-RUN pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
-
 # install python packages
-COPY Label_Microservice/requirements.txt .
-RUN pip --no-cache-dir install -r requirements.txt
-
-#For Fairseq-py
-ENV NVIDIA_VISIBLE_DEVICES all
-ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
-ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
-
-# Open Ports for TensorBoard, Jupyter, and SSH
-EXPOSE 6006
-EXPOSE 7654
-EXPOSE 22
-
-#Setup File System
-RUN mkdir ds
-ENV HOME=/ds
-ENV SHELL=/bin/bash
-VOLUME /ds
-WORKDIR /ds
+COPY Label_Microservice/deployment/requirements.worker.txt .
+COPY Label_Microservice/deployment/requirements.universal_model.txt .
+RUN pip --no-cache-dir install -r requirements.worker.txt
+RUN pip --no-cache-dir install -r requirements.universal_model.txt
 
 # Copy needed files for worker
-COPY py/label_microservice/worker.py /ds/worker.py
 COPY py /py
 ENV PYTHONPATH=/py
 
+# Skaffold hack
+# Skaffold infers the files to watch for changes by parsing the dockerfile 
+# and looking for COPY statements. Skaffold v1.1.0 doesn't appear to detect
+# changes to directories so we add explicit COPY statements for the files that
+# we want to retrigger skaffold on when they are modified
+# TODO(jlewi): Need to try removing this. I think the problem might have been I was out of
+# notify resources on my local machine. When I switched skaffold to use --notify=polling
+# it started to detect changes.
+COPY py/label_microservice/mlp.py /py/label_microservice/mlp.py
+COPY py/label_microservice/models.py /py/label_microservice/models.py
+COPY py/label_microservice/repo_config.py /py/label_microservice/repo_config.py
+COPY py/label_microservice/repo_specific_model.py /py/label_microservice/repo_specific_model.py
+COPY py/label_microservice/universal_kind_label_model.py /py/label_microservice/universal_kind_label_model.py
+COPY py/label_microservice/worker.py /py/label_microservice/worker.py
+
 # Add helper files
+# TODO(jlewi): What is this for?
 RUN pip freeze > container_requirements.txt
 
 # Run the shell
+# TODO(jlewi): Why is the default command tail?
 CMD [ "/bin/bash", "-c", "tail -f /dev/null" ]
diff --git a/Label_Microservice/deployment/README.md b/Label_Microservice/deployment/README.md
@@ -10,43 +10,32 @@ This is currently running on a GKE cluster.
 There is a dedicated instance running in
 
 * **GCP project**: issue-label-bot-dev
-* **cluster**: workers
+* **cluster**: issue-label-bot
+
+See [kubeflow/code-intelligence#70](https://github.com/kubeflow/code-intelligence/issues/70) for a log of how it was setup.
 
 Deploying it
 
-1. Create the deployment
+1. Use skaffold to build a new image.
 
    ```
-   kubectl apply -f deployments.yaml  
+   skaffold build
    ```
 
-1. Create the secret
+1. Edit the image
 
    ```
-   gsutil cp gs://github-probots_secrets/ml-app-inference-secret.yaml /tmp
-   kubectl apply -f /tmp/ml-app-inference-secret.yaml
+   cd deployment/overlays/prod
+   kustomize edit set image gcr.io/issue-label-bot-dev/bot-worker=gcr.io/issue-label-bot-dev/bot-worker:${TAG}@${SHA}
    ```
 
-
-## Testing
-
-There is a staging cluster running in
-
-* **GCP project**: issue-label-bot-dev
-* **cluster**: github-mlapp-test
-
-Deploying it
-
 1. Create the deployment
 
    ```
-   kubectl apply -f deployments-test.yaml  
+   cd Label_Microservice/deployment/overlays/prod
+   kustomize build | kubectl apply -f -
    ```
 
-1. Create the secret
-
-   ```
-   gsutil cp gs://github-probots_secrets/ml-app-inference-secret-test.yaml /tmp
-   kubectl apply -f /tmp/ml-app-inference-secret-test.yaml
-   ```
+## Staging/Dev
 
+There is a staging/dev instance running in a different namespace
diff --git a/Label_Microservice/deployment/base/deployments.yaml b/Label_Microservice/deployment/base/deployments.yaml
@@ -0,0 +1,51 @@
+apiVersion: extensions/v1beta1
+kind: Deployment
+metadata:
+    name: worker
+spec:
+    replicas: 5
+    template:
+      metadata:
+        labels:
+          app: worker          
+      spec:
+        volumes:
+        - name: github-app
+          secret:
+            secretName: github-app
+        containers:
+        - name: app
+          image: gcr.io/issue-label-bot-dev/bot-worker
+          command: 
+            - python3
+            - -m 
+            - label_microservice.worker
+            - subscribe_from_env
+          resources:
+            requests:
+              memory: "4Gi"
+              cpu: "4"                      
+          volumeMounts:
+            - name: github-app
+              mountPath: /var/secrets/github
+          env:
+            - name: PORT
+              value: "80"
+            # This should be the name of the in-cluster K8s service running issue embeddings
+            - name: ISSUE_EMBEDDING_SERVICE
+              value: "http://issue-embedding-server"
+            - name: PROJECT
+              value: issue-label-bot-dev      
+            # The values for the Kubeflow kf-label-bot-dev application
+            # See kubeflow/code-intelligence#84
+            - name: GITHUB_APP_ID
+              value: "50112"         
+            - name: GITHUB_APP_PEM_KEY
+              value: /var/secrets/github/kf-label-bot-dev.private-key.pem
+            # TODO(jlewi):Not needed because we use workload identity
+            #- name: GOOGLE_APPLICATION_CREDENTIALS
+            #  value: /var/secrets/google/user-gcp-sa.json
+        restartPolicy: Always
+        # We need to set a service account corresponding to workload
+        # identity
+        serviceAccountName: default-editor
diff --git a/Label_Microservice/deployment/base/kustomization.yaml b/Label_Microservice/deployment/base/kustomization.yaml
@@ -0,0 +1,12 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+namePrefix: label-bot-
+commonLabels:
+  service: label-bot
+  app: label-bot
+images:
+- name: gcr.io/issue-label-bot-dev/bot-worker
+  newName: gcr.io/issue-label-bot-dev/bot-worker
+resources:
+  - service.yaml
+  - deployments.yaml
diff --git a/Label_Microservice/deployment/base/service.yaml b/Label_Microservice/deployment/base/service.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: worker
+spec:
+  selector:
+    app: worker
+  ports:
+  - name: http
+    port: 80
+    protocol: TCP
+    targetPort: 80
+  type: ClusterIP