[V1] Zero-copy tensor/ndarray serialization/transmission #13790

njhill · 2025-02-24T23:36:58Z

Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying.

This is a first step, we still need to update the message schemas that we are using to exploit this. In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

This is also only enabled for the communication between the engine core and front end process so far; we'll also probably want to look at exploiting it between the engine core process and distributed workers in the TP case.

A step beyond this to explore would be to use tensors/ndarrays in shared mem, which can also hopefully be pinned so that large image tensors can be propagated intra-node without sending via zmq, and then can be copied directly to GPU mem.

Tests and benchmarks to follow.

Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying. Signed-off-by: Nick Hill <nhill@redhat.com>

github-actions · 2025-02-24T23:37:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Nick Hill <nhill@redhat.com>

# Conflicts: # vllm/v1/engine/core_client.py

…ocopy Signed-off-by: Nick Hill <nhill@redhat.com>

mergify · 2025-04-01T08:22:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ocopy Signed-off-by: Nick Hill <nhill@redhat.com> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py # vllm/v1/serial_utils.py

Signed-off-by: Nick Hill <nhill@redhat.com>

Copilot

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

p88h · 2025-04-08T20:33:01Z

Looking at this, I have a couple of questions perhaps.
While zero-copy sounds nice in theory, Torch does complain about creating non-writable tensors - which may end up in undefined behavior.
While trying to make this work, though, I think there may be a simpler approach - assuming we want to remove pickle, one could potentially wrap the tensors themselves as MsgSpec structs with Raw fields - that implements the zero-copy behavior out of the box, without having to leak multi-buffers out (which becomes rather complex with NestedTensors).
An implementation of that would look sth like this:
https://gist.github.com/p88h/1daec6374c35293f6bced9333d6f2c4c

... but for now, that doesn't work. When deserializing msgpack complains about malformed data
It works with pickle serialization on top rather than msgpack, and even triggers the 'readonly buffer' behavior mentioned above (accessing the Raw field marks it as readonly apparently)

njhill · 2025-04-08T21:01:18Z

Thanks @p88h

While zero-copy sounds nice in theory, Torch does complain about creating non-writable tensors - which may end up in undefined behavior.

I actually didn't notice such a warning - in what context did you see it? All the buffers we're using here should be writable I think.

While trying to make this work, though, I think there may be a simpler approach - assuming we want to remove pickle, one could potentially wrap the tensors themselves as MsgSpec structs with Raw fields - that implements the zero-copy behavior out of the box, without having to leak multi-buffers out (which becomes rather complex with NestedTensors).
An implementation of that would look sth like this:
https://gist.github.com/p88h/1daec6374c35293f6bced9333d6f2c4c

This wouldn't be zero-copy on the encode side though - the msgspec encode method still produces a single contiguous buffer which must contain copies of all the original tensor data. We want to transmit the tensor data directly from its backing buffer.

…ocopy

Signed-off-by: Nick Hill <nhill@redhat.com>

DarkLight1337 · 2025-04-09T12:29:08Z

In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

I'm thinking of this "hack' to flatten/unflatten the nested tensors using our existing JSONTree helper functions:

def serialize(tensors: NestedTensors) -> tuple[list[torch.Tensor], str]:
    tensors_flat = json_reduce_leaves(lambda acc, e: acc + [e], tensors, [])

    i_next = 0

    def visit_leaf(leaf: torch.Tensor):
        nonlocal i_next
        i_current = i_next
        i_next += 1
        return i_current

    tensors_idx = json_map_leaves(visit_leaf, tensors)
    assert i_next == len(tensors_flat)

    # msgpack can't serialize JSONTree[int] recursively so we convert it to a string
    tensors_idx_str = json.dumps(tensors_idx)

    return tensors_flat, tensors_idx_str

def deserialize(tensors_flat: list[torch.Tensor], tensors_idx_str: str) -> NestedTensors:
    tensors_idx: JSONTree[int] = json.loads(tensors_idx_str)
    return json_map_leaves(lambda idx: tensors_flat[idx], tensors_idx)

p88h · 2025-04-09T14:20:21Z

In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

I'm thinking of this "hack' to flatten/unflatten the nested tensors using our existing JSONTree helper functions:

I guess this is similar to what's implemented in this PR : #16279 - you need to serialize a bit more than tensor indexes (that per-Tensor state is encapsulated in CustomArray) and then that can be represented as JSON, but can also be represented via a recurrent object that's serialized a bit more efficiently with msgpack.

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-04-09T17:53:27Z

Thanks @DarkLight1337, but I think we can avoid resorting to json ... discussing the MultiModalKwargs serialization in separate PR #16279 to follow this one.

…ocopy

ywang96 · 2025-04-10T04:33:53Z

A step beyond this to explore would be to use tensors/ndarrays in shared mem

A quick note - this is something I thought about as well for the preprocessing cache, but I was a little concerned about if this will overcomplicate the deployment workflow

p88h · 2025-04-10T07:41:52Z

@ywang96 managing tensors in shared memory is rather unwieldy - torch doesn't support that natively, the only option is either copying to shm (which is not that different from what this setup is already doing via zmq) or allocating all tensors into pre-allocated buffers / via ndarrays, but that's a massive overcomplication. This performs quite good already (combined with #16279)

@njhill re: read-only Tensor message this was only happening with msgpack raw buffers version. With this implementation, AFAICT zmq copies the buffers into and out of shm anyways (so it's not really zero copy) and it allows read-write access to them.

…ocopy

njhill · 2025-04-10T15:17:22Z

the only option is either copying to shm (which is not that different from what this setup is already doing via zmq)

Actually zmq doesn't use shm ... the IPC transport uses unix domain sockets.

allocating all tensors into pre-allocated buffers / via ndarrays, but that's a massive overcomplication

This is what I was referring to as a later possibility. I agree that it would be additional complexity though and not worth considering until the current planned changes are in which will be a huge improvement.

AFAICT zmq copies the buffers into and out of shm anyways (so it's not really zero copy) and it allows read-write access to them.

It doesn't use shm but yes there is still a "copy" in that the data needs to be transferred via the socket from the client buffer to the server buffer. Zero-copy often doesn't mean really zero, more like minimal/fewer copies :) we are reducing the number of copies here.

The advantage of using shm would be to eliminate this final copy and also the transfer overhead. Something related to explore is receiving directly into pinned buffers on the engine/worker side to avoid another copy during the cpu->gpu transfer.

russellb

This looks great. I added some comments reflecting my understanding. Feel free to take some/all/none of them!

vllm/v1/serial_utils.py

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Nick Hill <nhill@redhat.com>

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yang Wang <elainewy@meta.com>

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com>

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

[V1] Zero-copy tensor/ndarray serialization/transmission

7b3b6ea

Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying. Signed-off-by: Nick Hill <nhill@redhat.com>

mergify bot added the v1 label Feb 24, 2025

njhill added 4 commits February 24, 2025 16:48

TypeAlias keyword is python >= 3.10 only

35d1cd9

Signed-off-by: Nick Hill <nhill@redhat.com>

use highest pickle protocol

f6f26b6

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'origin/main' into tensor-nocopy

4382a16

# Conflicts: # vllm/v1/engine/core_client.py

Merge remote-tracking branch 'refs/remotes/origin/main' into tensor-n…

9d91483

…ocopy Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added the needs-tests Tests needed for this PR label Mar 26, 2025

mergify bot added the needs-rebase label Apr 1, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into tensor-n…

ea75bd3

…ocopy Signed-off-by: Nick Hill <nhill@redhat.com> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py # vllm/v1/serial_utils.py

mergify bot removed the needs-rebase label Apr 5, 2025

This was referenced Apr 8, 2025

[Bug]: Huge memory overhead with V1 (multiprocessing) when handling several multimodal inputs #16185

Closed

[V1][Performance] Implement custom serializaton for MultiModalKwargs #16279

Closed

Add unit test

95b0600

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill removed the needs-tests Tests needed for this PR label Apr 8, 2025

njhill requested a review from russellb April 8, 2025 19:36

njhill marked this pull request as ready for review April 8, 2025 19:36

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat and ywang96 as code owners April 8, 2025 19:36

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

pre-commit fix

910f30f

Signed-off-by: Nick Hill <nhill@redhat.com>

russellb requested a review from Copilot April 8, 2025 20:04

Copilot AI reviewed Apr 8, 2025

View reviewed changes

Merge remote-tracking branch 'refs/remotes/origin/main' into tensor-n…

747ce1c

…ocopy

Fix unrecognized type decode

478ce09

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added 2 commits April 9, 2025 09:36

Handle scalars properly

7ea02a8

Signed-off-by: Nick Hill <nhill@redhat.com>

Optimization: encode small tensors inline.

e7d010d

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'refs/remotes/origin/main' into tensor-n…

f946398

…ocopy

Merge remote-tracking branch 'refs/remotes/origin/main' into tensor-n…

60797b4

…ocopy

russellb approved these changes Apr 10, 2025

View reviewed changes

njhill and others added 6 commits April 10, 2025 09:33

Update vllm/v1/serial_utils.py

c0c6e43

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Update vllm/v1/serial_utils.py

3b978ad

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Update vllm/v1/serial_utils.py

80d90a5

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Update vllm/v1/serial_utils.py

6bd45dc

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Update vllm/v1/serial_utils.py

97c144b

Co-authored-by: Russell Bryant <rbryant@redhat.com>

Comment/docstring updates

c6c2a90

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill enabled auto-merge (squash) April 10, 2025 17:03

njhill merged commit dd143ef into vllm-project:main Apr 10, 2025
43 checks passed

njhill deleted the tensor-nocopy branch April 10, 2025 19:39

p88h mentioned this pull request Apr 10, 2025

[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] #16432

Merged

njhill mentioned this pull request Apr 11, 2025

[BugFix] Handle non-contiguous tensors properly when serializing #16492

Merged

njhill mentioned this pull request Apr 18, 2025

[BugFix] Support bf16 in zero-copy tensor serialization #16860

Closed

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[V1] Zero-copy tensor/ndarray serialization/transmission (vllm-projec…

764e7a2

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yang Wang <elainewy@meta.com>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1] Zero-copy tensor/ndarray serialization/transmission (vllm-projec…

6163762

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1] Zero-copy tensor/ndarray serialization/transmission (vllm-projec…

55e5a09

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1] Zero-copy tensor/ndarray serialization/transmission (vllm-projec…

990d136

…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Uh oh!

[V1] Zero-copy tensor/ndarray serialization/transmission #13790

[V1] Zero-copy tensor/ndarray serialization/transmission #13790

Uh oh!

Conversation

njhill commented Feb 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

p88h commented Apr 8, 2025

Uh oh!

njhill commented Apr 8, 2025

Uh oh!

DarkLight1337 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p88h commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Apr 9, 2025

Uh oh!

ywang96 commented Apr 10, 2025

Uh oh!

p88h commented Apr 10, 2025

Uh oh!

njhill commented Apr 10, 2025

Uh oh!

russellb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhill commented Feb 24, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Apr 9, 2025 •

edited

Loading

p88h commented Apr 9, 2025 •

edited

Loading