Skip to content
This repository has been archived by the owner on Jan 31, 2022. It is now read-only.

Commit

Permalink
kustomize deployments and skaffolding for Label_MIcroservice (#93)
Browse files Browse the repository at this point in the history
* This PR provides a kustomize package to deploy the label microservice

* Also add a skaffolding config for the Label_Microservice

* Remove the old YAML deployment files for the Label Microservice.

* Edit the worker Dockerfile

  * Use TensorFlow 1.15.0 rather than using the "latest" image
  * We can also use a regular TensorFlow image and not a GPU version
    since this is just for inference and so we shouldn't need GPUs

  * Create a new requirements.worker.txt to only include the libraries that
    are needed in the worker. This should be much smaller than the uber
    set of python libraries (e.g. we don't need Jupyter, fairing, etc...)

  * Create requirements.universal_model.txt to contain some of the required
    python dependencies for the universal model.

    * Universal model is using ktext and some other libraries.

* Add a prod overlay for the issue_embedding service.

* create_secrets.py is a helper script for creating the required secrets
  in the clusters based on files in GCS.

Related to #70 ensemble models.
  • Loading branch information
jlewi authored and k8s-ci-robot committed Jan 4, 2020
1 parent 262bfb8 commit 244b6eb
Show file tree
Hide file tree
Showing 24 changed files with 524 additions and 227 deletions.
26 changes: 17 additions & 9 deletions Issue_Embeddings/deployment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,30 @@ This is currently running on a GKE cluster.
There is a dedicated instance running in

* **GCP project**: issue-label-bot-dev
* **cluster**: github-api-cluster
* **namespace**: issuefeat
* **cluster**: issue-label-bot
* **namespace**: label-bot-prod


See [kubeflow/code-intelligence#70](https://github.com/kubeflow/code-intelligence/issues/70) for a log of how it was setup.

Deploying it

1. Create the deployment
1. Use skaffold to build a new image.

```
kustomize build deployment/overlays/dev | kubectl apply -f -
skaffold build
```

* TODO(jlewi): We should probably define suitable prod and possibly staging environments as well
1. Edit the image

1. You can also follow the [developer_guide.md](../developer_guide.md) to deploy it using skaffold
```
cd deployment/overlays/prod
kustomize edit set image gcr.io/issue-label-bot-dev/issue-embedding=gcr.io/issue-label-bot-dev/issue-embedding:${TAG}@${SHA}
```

1. TODO(jlewi): Add instructions for how to build and update the images; one way to do this would be to use
`skaffold build` followed by `kustomize edit`
1. Create the deployment

* We may need/want to use skaffold profiles to define GCR buckets corresponding to dev, staging, and prod
```
cd Label_Microservice/deployment/overlays/prod
kustomize build | kubectl apply -f -
```
7 changes: 5 additions & 2 deletions Issue_Embeddings/deployment/base/deployments.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 10
initialDelaySeconds: 30
periodSeconds: 3
env:
- name: FLASK_ENV
Expand All @@ -31,4 +31,7 @@ spec:
- name: authors
value: 'f'
ports:
- containerPort: 80
- containerPort: 80
# We need to set a service account corresponding to workload
# identity
serviceAccountName: default-editor
6 changes: 0 additions & 6 deletions Issue_Embeddings/deployment/overlays/dev/deployments.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,6 @@ spec:
containers:
- name: app
env:
- name: PROJECT
value: issue-label-bot-dev
- name: ISSUE_EVENT_TOPIC
value: "TEST_event_queue"
- name: ISSUE_EVENT_SUBSCRIPTION
value: "TEST_subscription_for_event_queue"
# Flask environment variables
# TODO: Unfortunately it looks like if we enable debugging we hit the following error
# https://stackoverflow.com/questions/53522052/flask-app-valueerror-signal-only-works-in-main-thread
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ bases:
- ../../base
commonLabels:
environment: dev
namespace: jlewi-dev
namespace: label-bot-dev
patchesStrategicMerge:
- deployments.yaml
11 changes: 11 additions & 0 deletions Issue_Embeddings/deployment/overlays/prod/deployments.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: server
spec:
# Use a single replica for development
replicas: 3
template:
spec:
containers:
- name: app
13 changes: 13 additions & 0 deletions Issue_Embeddings/deployment/overlays/prod/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
commonLabels:
environment: prod
namespace: label-bot-prod
patchesStrategicMerge:
- deployments.yaml
resources:
- ../../base
images:
- digest: sha256:292e6af3214b3a3dc499fe08a1873b986b77ba9e201ca57afd9d6736f513fe40
name: gcr.io/issue-label-bot-dev/issue-embedding
newName: gcr.io/issue-label-bot-dev/issue-embedding:3191fea
13 changes: 9 additions & 4 deletions Issue_Embeddings/skaffold.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ build:
# TODO(https://github.com/GoogleContainerTools/skaffold/issues/3448): We use manual sync
# because inferred sync doesn't work
#
# TODO(https://github.com/kubeflow/code-intelligence/issues/78): To use skaffold filesync
# TODO(https://github.com/kubeflow/code-intelligence/issues/78):
# To use skaffold filesync
# I think we will need to use a custome program to autorestart the server on file changes
# because we can't run flask in debug mode and rely on its auto-loader.
# because we can't run flask in debug mode and rely on its auto-loader..
# We created autorestart for that so we just have to use it.
#sync:
# manual:
# - src: 'py/code_intelligence/*.py'
Expand All @@ -29,14 +31,17 @@ build:
buildContext:
gcsBucket: issue-label-bot-dev_skaffold-kaniko
env:
# TODO(GoogleContainerTools/skaffold#3468) skaffold doesn't
# appear to work with workload identity
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /secret/user-gcp-sa.json
cache: {}
cluster:
# pullSecret can be set to a local file from which the pull secret should be created.
pullSecretName: user-gcp-sa
# TODO(jlewi): This should be changed for each developer; or maybe we should create a reusable one?
namespace: jlewi-dev
# Build in the kaniko namespace because we need to disable ISTIO sidecar injection
# see GoogleContainerTools/skaffold#3442
namespace: kaniko
resources:
requests:
cpu: 8
Expand Down
81 changes: 23 additions & 58 deletions Label_Microservice/deployment/Dockerfile.worker
Original file line number Diff line number Diff line change
@@ -1,72 +1,37 @@
# borrowed from hamelsmu/ml-gpu-lite
# Dockerfile for prediction workers
FROM tensorflow/tensorflow:1.15.0-py3

FROM tensorflow/tensorflow:latest-gpu-py3
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

RUN add-apt-repository -y ppa:git-core/ppa
RUN add-apt-repository -y ppa:jonathonf/python-3.6

RUN apt-get update --fix-missing && apt-get install -y wget bzip2 ca-certificates \
build-essential \
byobu \
ca-certificates \
git-core git \
htop \
libglib2.0-0 \
libjpeg-dev \
libpng-dev \
libxext6 \
libsm6 \
libxrender1 \
libcupti-dev \
openssh-server \
python3.6 \
python3.6-dev \
software-properties-common \
vim \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

RUN apt-get -y update

# Setup Python 3.6 (Need for other dependencies)
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1
RUN apt-get install -y python3-setuptools
RUN easy_install pip
RUN pip install --upgrade pip

# Fastai dependencies
RUN pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html

# install python packages
COPY Label_Microservice/requirements.txt .
RUN pip --no-cache-dir install -r requirements.txt

#For Fairseq-py
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

# Open Ports for TensorBoard, Jupyter, and SSH
EXPOSE 6006
EXPOSE 7654
EXPOSE 22

#Setup File System
RUN mkdir ds
ENV HOME=/ds
ENV SHELL=/bin/bash
VOLUME /ds
WORKDIR /ds
COPY Label_Microservice/deployment/requirements.worker.txt .
COPY Label_Microservice/deployment/requirements.universal_model.txt .
RUN pip --no-cache-dir install -r requirements.worker.txt
RUN pip --no-cache-dir install -r requirements.universal_model.txt

# Copy needed files for worker
COPY py/label_microservice/worker.py /ds/worker.py
COPY py /py
ENV PYTHONPATH=/py

# Skaffold hack
# Skaffold infers the files to watch for changes by parsing the dockerfile
# and looking for COPY statements. Skaffold v1.1.0 doesn't appear to detect
# changes to directories so we add explicit COPY statements for the files that
# we want to retrigger skaffold on when they are modified
# TODO(jlewi): Need to try removing this. I think the problem might have been I was out of
# notify resources on my local machine. When I switched skaffold to use --notify=polling
# it started to detect changes.
COPY py/label_microservice/mlp.py /py/label_microservice/mlp.py
COPY py/label_microservice/models.py /py/label_microservice/models.py
COPY py/label_microservice/repo_config.py /py/label_microservice/repo_config.py
COPY py/label_microservice/repo_specific_model.py /py/label_microservice/repo_specific_model.py
COPY py/label_microservice/universal_kind_label_model.py /py/label_microservice/universal_kind_label_model.py
COPY py/label_microservice/worker.py /py/label_microservice/worker.py

# Add helper files
# TODO(jlewi): What is this for?
RUN pip freeze > container_requirements.txt

# Run the shell
# TODO(jlewi): Why is the default command tail?
CMD [ "/bin/bash", "-c", "tail -f /dev/null" ]
35 changes: 12 additions & 23 deletions Label_Microservice/deployment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,43 +10,32 @@ This is currently running on a GKE cluster.
There is a dedicated instance running in

* **GCP project**: issue-label-bot-dev
* **cluster**: workers
* **cluster**: issue-label-bot

See [kubeflow/code-intelligence#70](https://github.com/kubeflow/code-intelligence/issues/70) for a log of how it was setup.

Deploying it

1. Create the deployment
1. Use skaffold to build a new image.

```
kubectl apply -f deployments.yaml
skaffold build
```

1. Create the secret
1. Edit the image

```
gsutil cp gs://github-probots_secrets/ml-app-inference-secret.yaml /tmp
kubectl apply -f /tmp/ml-app-inference-secret.yaml
cd deployment/overlays/prod
kustomize edit set image gcr.io/issue-label-bot-dev/bot-worker=gcr.io/issue-label-bot-dev/bot-worker:${TAG}@${SHA}
```


## Testing

There is a staging cluster running in

* **GCP project**: issue-label-bot-dev
* **cluster**: github-mlapp-test

Deploying it

1. Create the deployment

```
kubectl apply -f deployments-test.yaml
cd Label_Microservice/deployment/overlays/prod
kustomize build | kubectl apply -f -
```

1. Create the secret

```
gsutil cp gs://github-probots_secrets/ml-app-inference-secret-test.yaml /tmp
kubectl apply -f /tmp/ml-app-inference-secret-test.yaml
```
## Staging/Dev

There is a staging/dev instance running in a different namespace
51 changes: 51 additions & 0 deletions Label_Microservice/deployment/base/deployments.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: worker
spec:
replicas: 5
template:
metadata:
labels:
app: worker
spec:
volumes:
- name: github-app
secret:
secretName: github-app
containers:
- name: app
image: gcr.io/issue-label-bot-dev/bot-worker
command:
- python3
- -m
- label_microservice.worker
- subscribe_from_env
resources:
requests:
memory: "4Gi"
cpu: "4"
volumeMounts:
- name: github-app
mountPath: /var/secrets/github
env:
- name: PORT
value: "80"
# This should be the name of the in-cluster K8s service running issue embeddings
- name: ISSUE_EMBEDDING_SERVICE
value: "http://issue-embedding-server"
- name: PROJECT
value: issue-label-bot-dev
# The values for the Kubeflow kf-label-bot-dev application
# See kubeflow/code-intelligence#84
- name: GITHUB_APP_ID
value: "50112"
- name: GITHUB_APP_PEM_KEY
value: /var/secrets/github/kf-label-bot-dev.private-key.pem
# TODO(jlewi):Not needed because we use workload identity
#- name: GOOGLE_APPLICATION_CREDENTIALS
# value: /var/secrets/google/user-gcp-sa.json
restartPolicy: Always
# We need to set a service account corresponding to workload
# identity
serviceAccountName: default-editor
12 changes: 12 additions & 0 deletions Label_Microservice/deployment/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namePrefix: label-bot-
commonLabels:
service: label-bot
app: label-bot
images:
- name: gcr.io/issue-label-bot-dev/bot-worker
newName: gcr.io/issue-label-bot-dev/bot-worker
resources:
- service.yaml
- deployments.yaml
13 changes: 13 additions & 0 deletions Label_Microservice/deployment/base/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: v1
kind: Service
metadata:
name: worker
spec:
selector:
app: worker
ports:
- name: http
port: 80
protocol: TCP
targetPort: 80
type: ClusterIP
Loading

0 comments on commit 244b6eb

Please sign in to comment.