Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Oct 14, 2025

Follow-up to the work and discussion in this PR #22188.
Here we expose the metrics we're currently tracking for the nixl connector to prometheus.

image

cc @markmc

Update:
@markmc generalized the PR to work for KVConnectorStats more broadly

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NickLucche

I wanted to make sure the per-connector metrics was going to work out. I didn't feel good about adding a bunch of NIXL-specific metrics into vllm/v1/metrics. So I worked it out at NickLucche/pull/4

Couple of small inline comments on buckets too 👍

nixl_histogram_post_time, engine_indexes, model_name
)
# uniform 2kb to 16gb range
buckets = [2**10 + i for i in range(1, 24, 2)]

This comment was marked as resolved.

name="vllm:nixl_post_time_seconds",
documentation="Histogram of transfer post time for NIXL KV"
" Cache transfers.",
buckets=buckets[1:],

This comment was marked as resolved.

[NIXL][Metrics] Add abstraction for per-connector Prometheus metrics
It's post times that need the smaller bucket size, not
transfer duration.

Uniform 2kb to 16gb range:

```
>>> def human_size(bytes, units=[' bytes','KB','MB','GB','TB', 'PB', 'EB']):
...     """ Returns a human readable string representation of bytes """
...     return str(bytes) + units[0] if bytes < 1024 else human_size(bytes>>10, units[1:])
...
>>> [human_size(2**(10+i)) for i in range(1, 25, 2)]
['2KB', '8KB', '32KB', '128KB', '512KB', '2MB', '8MB', '32MB', '128MB', '512MB', '2GB', '8GB']
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
NickLucche and others added 3 commits October 24, 2025 17:02
…ctor

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
[KV Connector][Metrics] Add prometheus metrics support to multi-connector
@NickLucche NickLucche changed the title [Nixl] Add metrics to Prometheus-Grafana dashboard [KVConnector] Add metrics to Prometheus-Grafana dashboard Oct 28, 2025
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc reviewed

@simon-mo simon-mo enabled auto-merge (squash) October 28, 2025 17:50
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025
@simon-mo simon-mo merged commit accb8fa into vllm-project:main Oct 29, 2025
51 checks passed
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025
…ct#26811)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants