In this repository, we deploy multiple models on EKS. We will configure so as to enable scaling down to 0, which is done via Knative in KServe. We will also deploy an Inference Graph, which allows us to run inference - single or multiple - based on predefined conditions.
We will progress one step at-a-time. The steps will be
- choose 5 models from HF and run inference for each model
- test on TorchServe (with CPU)
- setup minikube and test on KServe (with CPU)
- create an Inference Graph on minikube (CPU)
- test one model on GPU (EKS+Knative+scale down to 0)
- create an Inference Graph on EKS (GPU)
The following image classification models have been chosen for inference.
- nateraw/vit-base-patch16-224-cifar10
- Sena/dog
- dima806/man_woman_face_image_detection
- dima806/faces_age_detection
- PriyamSheta/EmotionClassModel
We download these models
python download_all.pyTo test inference on the models, we can run our test script as
python test_hf_models.pyStart TorchServe docker container
docker run -it --rm --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -p 8085:8085 -v `pwd`:/opt/src pytorch/torchserve:0.9.0-cpu bashWe need to create MAR archives first.
cd /opt/src/02_torchserve/
python create_mar.pyThen we start TorchServe server
torchserve --model-store cifar10/model-store --start --models all --ts-config cifar10/config/config.properties --foregroundTo see the list of models
curl http://localhost:8085/modelsWe can run an inference as follows. Note that for torchserve, our request is in binary format (unlike KServe, where it is base64 encoded)
curl http://localhost:8085/predictions/cifar10 -T img.jpg
# OR
python test_ts.pyWe will upload to S3 for the next steps.
rm -rf model-store/logs
aws s3 cp --recursive model-store s3://emlo-s22/kserve-mar/See the setup steps here
We will create an IAM user with read-only access to the bucket specified in step 2. Copy access key and secret access key in the YAML and apply.
kubectl apply -f s3_sa.yaml Then deploy all models
kubectl apply -f all-classifiers_cpu.yamlTo enable public access to our cluster
minikube tunnel --bind-address 0.0.0.0Test using postman or python script.
For KServe , our request is base64 encoded (unlike TorchServe, where it is in binary format)
Then we will deploy the inference graph, which looks as follows.
kubectl apply -f inf_graph.yamlTest using Postman or Python script.
python test_ks_ig.pyWe can also run the GPU deployments on minikube running on g4dn instance via the following steps.
sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
minikube start --driver docker --container-runtime docker --gpus allSee the setup steps here
Let's see how many GPUs we have.
kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, gpu_capacity: {key: "nvidia.com/gpu", value: .status.capacity."nvidia.com/gpu"}}'
Only 1! Too bad...:worried:
Let's enable time-slicing to simulate multiple GPUs.
helm repo add nvdp https://nvidia.github.io/k8s-device-pluginFind name of GPU node and apply a label to it
kubectl label node {} eks-node=gpuhelm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
-f nvdp-values.yaml \
--version 0.14.0kubectl apply -f slice.yamlhelm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
-f nvdp-values.yaml \
--version 0.14.0 \
--set config.name=nvidia-device-plugin \
--forceWe can see that the node has 16 GPUs 😉.
kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, gpu_capacity: {key: "nvidia.com/gpu", value: .status.capacity."nvidia.com/gpu"}}'Then deploy any model
kubectl apply -f cifar10_gpu.yamlWe should see no inference service/pod (or atleast one that is created for a short while but gets terminated because of no traffic. Then, we perform load-testing using locust. Pods get created and then terrminated once the test ends.
Pods: (time-jumped screenshots)

Locust: (time-jumped screenshots)

Deploy all classifiers:
kubectl apply -f all-classifiers_gpu.yamlThen deploy the inference graph.
kubectl apply -f inf_graph.yamlFinally, to test the Inference Graph, run
python test_ks_ig.py

