-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[V1] Zero-copy tensor/ndarray serialization/transmission #13790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying. Signed-off-by: Nick Hill <nhill@redhat.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
# Conflicts: # vllm/v1/engine/core_client.py
…ocopy Signed-off-by: Nick Hill <nhill@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…ocopy Signed-off-by: Nick Hill <nhill@redhat.com> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py # vllm/v1/serial_utils.py
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
|
Looking at this, I have a couple of questions perhaps. ... but for now, that doesn't work. When deserializing msgpack complains about malformed data |
|
Thanks @p88h
I actually didn't notice such a warning - in what context did you see it? All the buffers we're using here should be writable I think.
This wouldn't be zero-copy on the encode side though - the msgspec encode method still produces a single contiguous buffer which must contain copies of all the original tensor data. We want to transmit the tensor data directly from its backing buffer. |
Signed-off-by: Nick Hill <nhill@redhat.com>
I'm thinking of this "hack' to flatten/unflatten the nested tensors using our existing def serialize(tensors: NestedTensors) -> tuple[list[torch.Tensor], str]:
tensors_flat = json_reduce_leaves(lambda acc, e: acc + [e], tensors, [])
i_next = 0
def visit_leaf(leaf: torch.Tensor):
nonlocal i_next
i_current = i_next
i_next += 1
return i_current
tensors_idx = json_map_leaves(visit_leaf, tensors)
assert i_next == len(tensors_flat)
# msgpack can't serialize JSONTree[int] recursively so we convert it to a string
tensors_idx_str = json.dumps(tensors_idx)
return tensors_flat, tensors_idx_str
def deserialize(tensors_flat: list[torch.Tensor], tensors_idx_str: str) -> NestedTensors:
tensors_idx: JSONTree[int] = json.loads(tensors_idx_str)
return json_map_leaves(lambda idx: tensors_flat[idx], tensors_idx) |
I guess this is similar to what's implemented in this PR : #16279 - you need to serialize a bit more than tensor indexes (that per-Tensor state is encapsulated in CustomArray) and then that can be represented as JSON, but can also be represented via a recurrent object that's serialized a bit more efficiently with msgpack. |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
|
Thanks @DarkLight1337, but I think we can avoid resorting to json ... discussing the MultiModalKwargs serialization in separate PR #16279 to follow this one. |
A quick note - this is something I thought about as well for the preprocessing cache, but I was a little concerned about if this will overcomplicate the deployment workflow |
|
@ywang96 managing tensors in shared memory is rather unwieldy - torch doesn't support that natively, the only option is either copying to shm (which is not that different from what this setup is already doing via zmq) or allocating all tensors into pre-allocated buffers / via ndarrays, but that's a massive overcomplication. This performs quite good already (combined with #16279) @njhill re: read-only Tensor message this was only happening with msgpack raw buffers version. With this implementation, AFAICT zmq copies the buffers into and out of shm anyways (so it's not really zero copy) and it allows read-write access to them. |
Actually zmq doesn't use shm ... the IPC transport uses unix domain sockets.
This is what I was referring to as a later possibility. I agree that it would be additional complexity though and not worth considering until the current planned changes are in which will be a huge improvement.
It doesn't use shm but yes there is still a "copy" in that the data needs to be transferred via the socket from the client buffer to the server buffer. Zero-copy often doesn't mean really zero, more like minimal/fewer copies :) we are reducing the number of copies here. The advantage of using shm would be to eliminate this final copy and also the transfer overhead. Something related to explore is receiving directly into pinned buffers on the engine/worker side to avoid another copy during the cpu->gpu transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I added some comments reflecting my understanding. Feel free to take some/all/none of them!
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yang Wang <elainewy@meta.com>
…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com>
…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com>
…t#13790) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Update the custom msgpack encoding/decoding to work with lists of buffers so that the backing data of tensors/numpy arrays contained in messages is sent directly by zmq without copying.
This is a first step, we still need to update the message schemas that we are using to exploit this. In particular we need to add some custom serialization logic for
MultiModalKwargs/NestedTensorsused for image data since msgpack doesn't work natively with recursive types.This is also only enabled for the communication between the engine core and front end process so far; we'll also probably want to look at exploiting it between the engine core process and distributed workers in the TP case.
A step beyond this to explore would be to use tensors/ndarrays in shared mem, which can also hopefully be pinned so that large image tensors can be propagated intra-node without sending via zmq, and then can be copied directly to GPU mem.
Tests and benchmarks to follow.