The pipeline is to make use of SparseML to optimize the model, and then the KServe InferenceService/ServingRuntime are the one running the DeepSparse runtime with the model
Create namespace for the object store if you don't have one
oc new-project object-datastore
Deploy MinIO:
oc apply -f minio.yaml
And create a couple of buckets, one for the pipeline (e.g., named mlops
) and one for the models (e.g., named models
).
Create pipeline server, pointing to an S3 bucket
Import the pipeline (sparseml_pipeline.yaml
) into the pipeline server.
This can be generated by running:
python openshift-ai/pipeline.py
- NOTE: if some of the steps may take longer than one hour you either need to change the defaults for taskRuns in OpenShift AI or add a timeout: Xh per taskRun.
You can see
sparseml_simplified_pipeline.yaml
and search fortimeout: 5h
to see an example.
Cluster storage named models-shared
, so that a volume to be shared is created
Data connection, named models
, pointing to the S3 bucket to store the resulting model
- NOTE: the cluster storage and the data connection can have any name, as long as it is the same given later on the pipeline parameters.
Build the container for the sparsification and the evaluation steps:
podman build -t quay.io/USER/neural-magic:sparseml -f openshift-ai/sparseml_Dockerfile .
podman build -t quay.io/USER/neural-magic:sparseml_eval -f openshift-ai/sparseml_eval_Dockerfile .
podman build -t quay.io/USER/neural-magic:nm_vllm_eval -f openshift-ai/nm_vllm_eval_Dockerfile .
podman build -t quay.io/USER/neural-magic:base_eval -f openshift-ai/base_eval_Dockerfile .
And push them to a registry
podman push quay.io/USER/neural-magic:sparseml
podman push quay.io/USER/neural-magic:sparseml_eval
podman push quay.io/USER/neural-magic:nm_vllm_eval
podman push quay.io/USER/neural-magic:base_eval
This is the process to create the PipelineRun
yaml file from the python script. It requires kfp_tekton
version 1.5.9:
pip install kfp_tekton==1.5.9
python pipeline_simplified.py
- NOTE: there is another option for a more complex/flexible pipeline at
pipeline_nmvllm.py
, but the rest assumes the usage of the simplified one.
This is the process to create the pipeline yaml
file from the python script.
It requires kfp.kubernetes
:
pip install kfp[kubernetes]
python pipeline_v2_cpu.py
python pipeline_v2_gpu.py
- NOTE: there are two different pipelines for V2, one for GPU and one for CPU. It would be straightforward to merge them in one and have a pipeline parameter to chose between them
Run the pipeline selecting the model and the options:
- Evaluate or not
- GPU (Quantized) or CPU (Sparsified: Quantized + Pruned). Note for GPU inferencing, it is not supported to both prune and quantized yet.
Run the optimized model with DeepSparse
Build the container with:
podman build -t quay.io/USER/neural-magic:deepsparse -f deepsparse_Dockerfile .
And push it to a registry
podman push quay.io/USER/neural-magic:deepsparse
Note DeepSparse require write access to the mounted volume with the model, so doing a workaround so that it gets copied to an extra mount with ReadOnly
set to False
.
oc apply -f openshift-ai/serving_runtime_deepsparse.yaml
And them from the OpenShift AI you can deploy a model using it and pointing to the models
DataConnection
Create a secret and a Service Account that points to the S3 endpoint. Modified them as needed.
oc apply -f openshift-ai/secret.yaml
oc apply -f openshift-ai/sa.yaml
oc apply -f openshift-ai/inference.yaml
Run the optimized model with nm-vLLM
Build the container with:
podman build -t quay.io/USER/neural-magic:nm-vllm -f nmvllm_Dockerfile .
And push it to a registry
podman push quay.io/USER/neural-magic:nm-vllm
Note DeepSparse require write access to the mounted volume with the model, so doing a workaround so that it gets copied to an extra mount with ReadOnly
set to False
.
oc apply -f openshift-ai/serving_runtime_vllm.yaml
oc apply -f openshift-ai/serving_runtime_vllm_marlin.yaml
And them from the OpenShift AI you can deploy a model using it and pointing to the models
DataConnection. You can use one or the other depending on running sparsified models or quantized (with marlin
) models.
Run the request.py and access the Gradio server deployed locally at 127.0.0.1:7860
. Update the URL with the one from the deployed runtime (ksvc
route)
python openshift-ai/request.py