Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Feb 24, 2025

Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying.

This is a first step, we still need to update the message schemas that we are using to exploit this. In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

This is also only enabled for the communication between the engine core and front end process so far; we'll also probably want to look at exploiting it between the engine core process and distributed workers in the TP case.

A step beyond this to explore would be to use tensors/ndarrays in shared mem, which can also hopefully be pinned so that large image tensors can be propagated intra-node without sending via zmq, and then can be copied directly to GPU mem.

Tests and benchmarks to follow.

Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying.

Signed-off-by: Nick Hill <nhill@redhat.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Feb 24, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
# Conflicts:
#	vllm/v1/engine/core_client.py
…ocopy

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill added the needs-tests Tests needed for this PR label Mar 26, 2025
@mergify
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2025
…ocopy

Signed-off-by: Nick Hill <nhill@redhat.com>

# Conflicts:
#	vllm/v1/engine/core.py
#	vllm/v1/engine/core_client.py
#	vllm/v1/serial_utils.py
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill removed the needs-tests Tests needed for this PR label Apr 8, 2025
@njhill njhill requested a review from russellb April 8, 2025 19:36
@njhill njhill marked this pull request as ready for review April 8, 2025 19:36
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
@russellb russellb requested a review from Copilot April 8, 2025 20:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

@p88h
Copy link
Contributor

p88h commented Apr 8, 2025

Looking at this, I have a couple of questions perhaps.
While zero-copy sounds nice in theory, Torch does complain about creating non-writable tensors - which may end up in undefined behavior.
While trying to make this work, though, I think there may be a simpler approach - assuming we want to remove pickle, one could potentially wrap the tensors themselves as MsgSpec structs with Raw fields - that implements the zero-copy behavior out of the box, without having to leak multi-buffers out (which becomes rather complex with NestedTensors).
An implementation of that would look sth like this:
https://gist.github.com/p88h/1daec6374c35293f6bced9333d6f2c4c

... but for now, that doesn't work. When deserializing msgpack complains about malformed data
It works with pickle serialization on top rather than msgpack, and even triggers the 'readonly buffer' behavior mentioned above (accessing the Raw field marks it as readonly apparently)

@njhill
Copy link
Member Author

njhill commented Apr 8, 2025

Thanks @p88h

While zero-copy sounds nice in theory, Torch does complain about creating non-writable tensors - which may end up in undefined behavior.

I actually didn't notice such a warning - in what context did you see it? All the buffers we're using here should be writable I think.

While trying to make this work, though, I think there may be a simpler approach - assuming we want to remove pickle, one could potentially wrap the tensors themselves as MsgSpec structs with Raw fields - that implements the zero-copy behavior out of the box, without having to leak multi-buffers out (which becomes rather complex with NestedTensors).
An implementation of that would look sth like this:
https://gist.github.com/p88h/1daec6374c35293f6bced9333d6f2c4c

This wouldn't be zero-copy on the encode side though - the msgspec encode method still produces a single contiguous buffer which must contain copies of all the original tensor data. We want to transmit the tensor data directly from its backing buffer.

Signed-off-by: Nick Hill <nhill@redhat.com>
@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 9, 2025

In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

I'm thinking of this "hack' to flatten/unflatten the nested tensors using our existing JSONTree helper functions:

def serialize(tensors: NestedTensors) -> tuple[list[torch.Tensor], str]:
    tensors_flat = json_reduce_leaves(lambda acc, e: acc + [e], tensors, [])

    i_next = 0

    def visit_leaf(leaf: torch.Tensor):
        nonlocal i_next
        i_current = i_next
        i_next += 1
        return i_current

    tensors_idx = json_map_leaves(visit_leaf, tensors)
    assert i_next == len(tensors_flat)

    # msgpack can't serialize JSONTree[int] recursively so we convert it to a string
    tensors_idx_str = json.dumps(tensors_idx)

    return tensors_flat, tensors_idx_str

def deserialize(tensors_flat: list[torch.Tensor], tensors_idx_str: str) -> NestedTensors:
    tensors_idx: JSONTree[int] = json.loads(tensors_idx_str)
    return json_map_leaves(lambda idx: tensors_flat[idx], tensors_idx)

@p88h
Copy link
Contributor

p88h commented Apr 9, 2025

In particular we need to add some custom serialization logic for MultiModalKwargs / NestedTensors used for image data since msgpack doesn't work natively with recursive types.

I'm thinking of this "hack' to flatten/unflatten the nested tensors using our existing JSONTree helper functions:

I guess this is similar to what's implemented in this PR : #16279 - you need to serialize a bit more than tensor indexes (that per-Tensor state is encapsulated in CustomArray) and then that can be represented as JSON, but can also be represented via a recurrent object that's serialized a bit more efficiently with msgpack.

njhill added 2 commits April 9, 2025 09:36
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill
Copy link
Member Author

njhill commented Apr 9, 2025

Thanks @DarkLight1337, but I think we can avoid resorting to json ... discussing the MultiModalKwargs serialization in separate PR #16279 to follow this one.

@ywang96
Copy link
Member

ywang96 commented Apr 10, 2025

A step beyond this to explore would be to use tensors/ndarrays in shared mem

A quick note - this is something I thought about as well for the preprocessing cache, but I was a little concerned about if this will overcomplicate the deployment workflow

@p88h
Copy link
Contributor

p88h commented Apr 10, 2025

@ywang96 managing tensors in shared memory is rather unwieldy - torch doesn't support that natively, the only option is either copying to shm (which is not that different from what this setup is already doing via zmq) or allocating all tensors into pre-allocated buffers / via ndarrays, but that's a massive overcomplication. This performs quite good already (combined with #16279)

@njhill re: read-only Tensor message this was only happening with msgpack raw buffers version. With this implementation, AFAICT zmq copies the buffers into and out of shm anyways (so it's not really zero copy) and it allows read-write access to them.

@njhill
Copy link
Member Author

njhill commented Apr 10, 2025

the only option is either copying to shm (which is not that different from what this setup is already doing via zmq)

Actually zmq doesn't use shm ... the IPC transport uses unix domain sockets.

allocating all tensors into pre-allocated buffers / via ndarrays, but that's a massive overcomplication

This is what I was referring to as a later possibility. I agree that it would be additional complexity though and not worth considering until the current planned changes are in which will be a huge improvement.

AFAICT zmq copies the buffers into and out of shm anyways (so it's not really zero copy) and it allows read-write access to them.

It doesn't use shm but yes there is still a "copy" in that the data needs to be transferred via the socket from the client buffer to the server buffer. Zero-copy often doesn't mean really zero, more like minimal/fewer copies :) we are reducing the number of copies here.

The advantage of using shm would be to eliminate this final copy and also the transfer overhead. Something related to explore is receiving directly into pinned buffers on the engine/worker side to avoid another copy during the cpu->gpu transfer.

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I added some comments reflecting my understanding. Feel free to take some/all/none of them!

njhill and others added 6 commits April 10, 2025 09:33
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill enabled auto-merge (squash) April 10, 2025 17:03
@njhill njhill merged commit dd143ef into vllm-project:main Apr 10, 2025
43 checks passed
@njhill njhill deleted the tensor-nocopy branch April 10, 2025 19:39
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
…t#13790)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…t#13790)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants