Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

eero-t · 2024-11-08T18:02:28Z

Description

Adds Grafana dashboard for monitoring OPEA application scaling:

How many of the application and its TGI + TEI pods are created, ready and in use
How many requests they are processing (min and max across all replicats)
How many failures they are reporting (sum across replicas)

And a helper script for installing dashboard k8s configMaps for Grafana.

Unlike earlier ChatQnA dashboard, this handles multiple OPEA application having same names but being in separate namespaces. User selects namespace and then the OPEA application from that. If cluster has only one running, Dashboard will default to that.

(Therefore it does not make sense to install dashboard with application specific Helm charts, as it can cover all apps that use TGI for LLM, i.e. most of them.)

Issues

n/a.

Type of change

New feature (non-breaking change which adds new functionality)

Dependencies

n/a.

Tests

Manual testing of the script and dashboard working.

eero-t · 2024-11-08T18:06:55Z

Currently dashboard relies on HTTP inprogress metric for how many pending requests application has: opea-project/GenAIComps#845

But depending on whether following PR is merged for v1.1, that particular metric may need to be changed before v1.1: opea-project/GenAIComps#864

eero-t · 2024-11-08T18:16:44Z

I can add blurb about this also to README, but scaling is currently a bit of corner case, so IMHO it could come also in next release.

Larger question about Observability README, and things it refers to, is what to do with chatqna/ sub-directory content here, now that Helm charts have more generic monitoring support for OPEA applications.

Regarding the dashboards under that:

queue_size_embedding_rerank_tgi.json: some queries in that do not have any selectors, some use service selector
tgi_grafana.json: queries use container selector (container="$service")

I.e. neither handles properly cases when cluster is running multiple OPEA applications with TGI instances. The new dashboard covers first one to some extent. TGI details dashboard could be updated to have similar selectors as this new dashboard.

eero-t · 2024-11-12T15:45:14Z

FYI: I'm going to change dashboard "Failures" heading to "Incomplete requests". I do not think half of TGI requests are failures, but that frontend needs to request rest of reply with another query before TGI deems it "complete" (successful).

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

for more information, see https://pre-commit.ci

eero-t · 2024-11-12T16:56:38Z

Dashboard changes:

Changed HTTP metric to one that will come with: Replace HTTP "inprogress" gauge with megaservice "request_pending" one GenAIComps#864
Lowered scaling & failures rows so that dashboard first better to screen
"Failures" -> "Incomplete requests" header change

eero-t requested a review from daisy-ycguo as a code owner November 8, 2024 18:02

poussa added this to the v1.1 milestone Nov 8, 2024

poussa requested review from poussa and lianhao and removed request for daisy-ycguo November 12, 2024 15:19

jfding approved these changes Nov 12, 2024

View reviewed changes

eero-t added 2 commits November 12, 2024 18:49

Add helper script to create/update Grafana dashboards

a093caf

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

Add Grafana dashboard for OPEA application scaling metrics

db4407b

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

eero-t force-pushed the grafana branch from f3e67fa to db4407b Compare November 12, 2024 16:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf7e0b3

for more information, see https://pre-commit.ci

lianhao approved these changes Nov 13, 2024

View reviewed changes

poussa approved these changes Nov 13, 2024

View reviewed changes

poussa merged commit 691bbc5 into opea-project:main Nov 13, 2024
6 checks passed

eero-t mentioned this pull request Nov 13, 2024

Update observability README + fix typos #556

Merged

1 task

eero-t deleted the grafana branch November 14, 2024 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

eero-t commented Nov 8, 2024 •

edited

Loading

eero-t commented Nov 8, 2024

eero-t commented Nov 8, 2024 •

edited

Loading

eero-t commented Nov 12, 2024 •

edited

Loading

eero-t commented Nov 12, 2024

Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

Conversation

eero-t commented Nov 8, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

eero-t commented Nov 8, 2024

eero-t commented Nov 8, 2024 • edited Loading

eero-t commented Nov 12, 2024 • edited Loading

eero-t commented Nov 12, 2024

eero-t commented Nov 8, 2024 •

edited

Loading

eero-t commented Nov 8, 2024 •

edited

Loading

eero-t commented Nov 12, 2024 •

edited

Loading