How-To Setup Observability for OPEA Workload in Kubernetes

This guide provides a step-by-step approach to setting up observability for the OPEA workload in a Kubernetes environment. We will cover the setup of Prometheus and Grafana, as well as the collection of metrics for Gaudi hardware, OPEA/chatqna including TGI, TEI-Embedding, TEI-Reranking and other microservices, and PCM.

For monitoring Helm installed OPEA applications, see Helm monitoring option.

Prepare

git clone https://github.com/opea-project/GenAIInfra.git
cd kubernetes-addons/Observability

1. Setup Prometheus & Grafana

Setting up Prometheus and Grafana is essential for monitoring and visualizing your workloads. Follow these steps to get started:

Step 1: Install Prometheus&Grafana

kubectl create ns monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 55.5.1 -n monitoring

Step 2: Verify the installation

kubectl get pods -n monitoring

Step 3: Port-forward to access Grafana

kubectl port-forward service/grafana 3000:80

Step 4: Access Grafana

Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.

2. Metrics for Gaudi Hardware (v1.16.2)

To monitor Gaudi hardware metrics, you can use the following steps:

Step 1: Install daemonset

kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-daemonset.yaml

Step 2: Install metric-exporter

kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-service.yaml

Step 3: Install service-monitor

kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml

Step 4: Verify the metrics

# To get the metric endpoints, e.g. to get first endpoint to test
habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"`
# Fetch the metrics
curl ${habana_metric_url}/metrics

# you will see the habana metric data  like this:
process_resident_memory_bytes 2.9216768e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71394960963e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.862641152e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 125
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Step 5: Import the dashboard into Grafana

Manually import the Dashboard-Gaudi-HW.json file into Grafana

3. Metrics for OPEA applications

To monitor OPEA application metrics including TGI-gaudi, TEI, TEI-Reranking and other micro services, you can use the following steps:

Step 1: Install application with Helm

Install Helm (version >= 3.15) first. Refer to the Helm Installation Guide for more information.

Install OPEA application as described in Helm charts README.

For example, to install ChatQnA, follow ChatQnA helm chart for instructions on deploying it to Kubernetes.

Make sure to enable Helm monitoring option.

Step 2: Install dashboards

Here are few Grafana dashboards for monitoring different aspects of OPEA applications:

queue_size_embedding_rerank_tgi.json: queue size of TGI-gaudi, TEI-Embedding, TEI-reranking
tgi_grafana.json: tgi-gaudi text generation inferencing service utilization
opea-scaling.json: scaling, request rates and failures for OPEA application megaservice, TEI-reranking, TEI-embedding, and TGI

You can either:

Import them manually to Grafana,
Use update-dashboards.sh script to add them to Kubernetes as Grafana dashboard configMaps
- (Script assumes Prometheus / Grafana to be installed according to above instructions)
Or create your own dashboards based on them

Note: when dashboard is imported to Grafana, you can directly save changes to it, but those dashboards go away if Grafana is removed / re-installed.

Whereas with dashboard configMaps, Grafana saves changes to a selected file, but you need to remember to re-apply them to Kubernetes / Grafana, for your changes to be there when that dashboard is reloaded.

4. Metrics for PCM (Intel® Performance Counter Monitor)

Step 1: Install PCM

Please refer to this repo to install Intel® PCM

Step 2: Modify & Install pcm-service

modify the pcm/pcm-service.yaml file to set the addresses

kubectl apply -f pcm/pcm-service.yaml

Step 3: Install PCM serviceMonitor

kubectl apply -f pcm/pcm-serviceMonitor.yaml

Step 4: Install the PCM dashboard

manually import the pcm-dashboard.json file into the Grafana

More dashboards

GenAIEval repository includes additional dashboards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

How-To Setup Observability for OPEA Workload in Kubernetes

Prepare

1. Setup Prometheus & Grafana

Step 1: Install Prometheus&Grafana

Step 2: Verify the installation

Step 3: Port-forward to access Grafana

Step 4: Access Grafana

2. Metrics for Gaudi Hardware (v1.16.2)

Step 1: Install daemonset

Step 2: Install metric-exporter

Step 3: Install service-monitor

Step 4: Verify the metrics

Step 5: Import the dashboard into Grafana

3. Metrics for OPEA applications

Step 1: Install application with Helm

Step 2: Install dashboards

4. Metrics for PCM (Intel® Performance Counter Monitor)

Step 1: Install PCM

Step 2: Modify & Install pcm-service

Step 3: Install PCM serviceMonitor

Step 4: Install the PCM dashboard

More dashboards

Files

README.md

Latest commit

History

README.md

File metadata and controls

How-To Setup Observability for OPEA Workload in Kubernetes

Prepare

1. Setup Prometheus & Grafana

Step 1: Install Prometheus&Grafana

Step 2: Verify the installation

Step 3: Port-forward to access Grafana

Step 4: Access Grafana

2. Metrics for Gaudi Hardware (v1.16.2)

Step 1: Install daemonset

Step 2: Install metric-exporter

Step 3: Install service-monitor

Step 4: Verify the metrics

Step 5: Import the dashboard into Grafana

3. Metrics for OPEA applications

Step 1: Install application with Helm

Step 2: Install dashboards

4. Metrics for PCM (Intel® Performance Counter Monitor)

Step 1: Install PCM

Step 2: Modify & Install pcm-service

Step 3: Install PCM serviceMonitor

Step 4: Install the PCM dashboard

More dashboards