[RFC] Support DLPACK C Functions for Speed Exchange and Stream Handling #174

tqchen · 2025-09-12T18:10:32Z

This PR adds support for three C functions to speedup DLPack exchange. As of now, DLPack exchange relies on python functions such as tensor.__dlpack__().

While they works well for common cases, the general overhead of such exchange is at the level of 0.2-0.3 us for very well optimized version, and can go up to 0.4-1 us for less optimized implementation.

For a function that takes three arguments f(a, b, c), assume we run DLPack exchange for each argument, the general conversion overhead usually gets to around 1us and sometimes to 3us.

While such overhead can be acceptable in many settings, in GPU applications the extra 1-3us overhead can still be significant.

This PR proposes four functions for speed exchange DLPack tensors without going through python interpreter.

DLPackManagedTensorFromPyObjectNoSync for fast exchange owned tensors into consumer
DLPackDLTensorFromPyObjectNoSync for fast exchange non-owned tensors into consumer
DLPackManagedTensorToPyObjectNoSync for fast exchange consumer tensors into producer (return values)
DLPackCurrentWorkStream for query the current working stream from the stream context

Our preliminary results show that these functions, when incorporated correctly via native extensions such as c/c++, can bring exchange cost to the level of 30ns - 80ns, giving us about one order of maginitude speedup. That means the API overhead of functions like f(a, b, c) will be at 0.2us-0.4us level (including exchange), which is close to what native cpp extension overhead do without exchange.

tqchen · 2025-09-12T20:51:27Z

RFC #175

tqchen · 2025-09-14T16:13:24Z

updates to incorporate suggestion by @dalcinl.

This PR adds support for three C functions to speedup DLPack exchange. As of now, DLPack exchange relies on python functions such as tensor.__dlpack__(). While they works well for common cases, the general overhead of such exchange is at the level of 0.2-0.3 us for very well optimized version, and can go up to 0.4-1 us for less optimized implementation. For a function that takes three arguments f(a, b, c), assume we run DLPack exchange for each argument, the general conversion overhead usually gets to around 1us and sometimes to 3us. While such overhead can be acceptable in many settings, in GPU applications the extra 1-3us overhead can still be significant. This PR proposes three functions for speed exchange DLPack tensors without going through python interpreter. - DLPackFromPyObject: exports a PyObject Tensor to DLManagedTensorVesioned - DLPackToPyObject: DLManagedTensorVesioned converts to a PyObject Tensor - DLPackTensorAllocator: Used to expose one package's tensor allocator to another package - This allows for example we implement libraries that allocates intermediate tensor based on the caller's specified Tensor Allocator. Our preliminary results show that these functions, when incorporated correctly via native extensions such as c/c++, can bring exchange cost to the level of 30ns - 80ns, giving us about one order of maginitude speedup. That means functions like f(a, b, c) can finish at 0.2us-0.4us level, which is close to what native cpp extension overhead do without exchange.

for all function pointers

include/dlpack/dlpack.h

tqchen · 2025-09-22T15:39:34Z

This PR is updated to reflect the all the suggestions in #175

Naming (updated per suggestion from @oleksandr-pavlyk @kkraus14):

DLPackManagedTensorAllocator
DLPackManagedTensorToPyObject
DLPackManagedTensorFromPyObject

Clarified that the dunder should be attached to the class type @gbonik and @seberg

kkraus14

Everything LGTM except for the ongoing discussions related to stream / synchronization handling

include/dlpack/dlpack.h

tqchen · 2025-10-06T18:49:25Z

Thanks everyone for comments. We have updated the proposal to include a non-owned version while explicitly state clearly the intend is for fast caling for framework library/DSL call use cases

Summary of the current functions

DLPackManagedTensorFromPyObjectNoSync for fast exchange owned tensors into consumer
DLPackDLTensorFromPyObjectNoSync for fast exchange non-owned tensors into consumer
DLPackManagedTensorToPyObjectNoSync for fast exchange consumer tensors into producer (return values)
DLPackCurrentWorkStream for query the current working stream from the stream context

I think it is getting to a mergable state, but would be good to also get final inputs if any

yongwww · 2025-10-06T20:58:47Z

I’m in favor of this RF. Adding C APIs for DLPack exchange and stream querying is a very compelling direction, achieving exchange latency down to tens of nanoseconds could remove a significant bottleneck. This RFC is a valuable addition to DLPack. I’m excited to see it get landed and deployed in production!

Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>

tqchen · 2025-10-08T20:51:17Z

incoprorated suggestions from @seberg into the PR. i think we are in good shape, going to merge this in two days if there is no more future comments.

Thanks everyone for suggestions so far

seberg

I like this approach in general. I am still a bit unsure about the allocation function and the filling one, but if nobody else chimes in, then so be it...

One question about the stream sync, we should clarify that a bit maybe.

One curiosity: Should there be an entry (or function?) that allows discovering which device types are supported?

I think in a sense, I still need to see how this actually looks like for a real-world consumer/producer pair, but in the near future it is maybe still malleable enough.

seberg · 2025-10-09T15:59:27Z

include/dlpack/dlpack.h

+ */
+typedef int (*DLPackManagedTensorAllocator)(                                         //
+  DLTensor* prototype, DLManagedTensorVersioned** out, void* error_ctx,              //
+  void (*SetError)(void* error_ctx, const char* kind, const char* message)           //


One small thing is, I am not sure how we will use kind? Let's say this typically get's converted to a Python exception, then we would have to somewhat agree on e.g. MemoryError or so to translate that.

Maybe this will just settle itself, in the end I bet this pretty much only ever raises MemoryError anyway.

(I think this function is important! I know I am slow to settle on something, but I haven't quite settled on loving this approach. But I suspect its true that asking to allocate e.g. the torch tensor here -- requiring the GIL -- and then viewing that is also not really better.)

yes, likely MemoryError is the most common one likely

include/dlpack/dlpack.h

seberg · 2025-10-09T16:12:29Z

include/dlpack/dlpack.h

+  DLPackManagedTensorToPyObjectNoSync managed_tensor_to_py_object_no_sync;
+  /*!
+   * \brief Producer function pointer for DLPackDLTensorFromPyObject
+   *        This function can be NULL when the producer does not support this function.


I like this approach as such, but as I just mentioned it elsewhere, I want to point out that if this is optional and commonly not support, it means all consumers will need:

if (api->managed_tensor_to_py_object_no_sync == NULL) { // do complicated thing }

which unfortunately means that we can get a small speed improvement if available (not having to allocate the DLManagedTensor, etc.) but we may need to support both paths as a consumer, unfortunately.

I suppose part of the solution here may be that you really want a C++ convenience layer that is easy to vendor...

This isn't a deal breaker! It made me wonder if filling in a DLManagedTensor is much worse -- it would be different from currently by not owning its own allocation.

include/dlpack/dlpack.h

seberg · 2025-10-09T16:20:04Z

include/dlpack/dlpack.h

+ * and the producer can simply return NULL when queried.
+ * The consumer do not have to do anything on stream sync or setting.
+ * So CPU only framework can just provide a dummy implementation that
+ * always set out_current_stream[0] to NULL.


I forgot about the CPU part, allowing the function pointer to be NULL makes sense. But I am also happy to just implement always NULL return or an always -1 with a "NumPy doens't have streams, what's going on" error.

seberg · 2025-10-09T16:23:34Z

include/dlpack/dlpack.h

+ *
+ * \param device_type The device type.
+ * \param device_id The device id.
+ * \param out_current_stream The output current work stream.


I am a bit unsure if this is specified as well as it needs to be? I.e. I think NULL would be the default stream.

The question is: Is there any need (or not) an "undefined or no synchronization" return value (such as -1)?
If not, we are all good, but if some producer might need this (for whatever reason), then we need to specify this here.

The alternative is that the producer just has to return the default stream (otherwise the consumer has to guess a stream anyway probably in the kernel use-case!).

i don;t think there is a need in this particular context. returning the default stream is likely more well defined

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

tqchen · 2025-10-09T17:35:27Z

thanks @seberg on entry function to discover device supported, as of now no, but if there is a need, we could add such API. although i am not sure if it is strictly needed

dalcinl · 2025-10-09T21:16:40Z

include/dlpack/dlpack.h

+   *
+   * \sa DLPackExchangeAPI
+   */
+  struct DLPackExchangeAPI* prev_version_api;


In the future, a version v2 may have a DLPackExchangeAPI struct with more/different entries than v1, therefore the struct corresponding to v2 may not have the same layout (and thus type name) than the older v1.

I assume that the intention here is that DLPackExchangeAPI will grow, if ever, to the end; additional, the entries currently defined below will never change (not their relative position, nor the layout of each struct).
I that correct? If not, this will require some nasty and inconvenient fixes.

Long story short: this is a great idea, but the implementation as it is now is not flexible to future changes the layout of the various structures. Just double checking we are all in the same page here.

This is a good point. I think we can say only the two first two fields remains unchanged, will do a patch to clarify this

updated to move the exchange api stable part to a new struct DLPackExchangeAPIHeader

If things ever change after the first two fields, then the C code users will write to deal with older versions will violate the C language strict aliasing rules and entering the undefined behavior territory.

Oh, I see you changed things to use a header.

@dalcinl if it works for you, would be great if you can explicitly approve, thanks

tqchen · 2025-10-10T12:49:40Z

thanks everyone for valuable feedbacks so far, planning to merge in 24hours

include/dlpack/dlpack.h

Co-authored-by: Lisandro Dalcin <dalcinl@gmail.com>

tqchen · 2025-10-11T00:44:54Z

Thanks everyone, this is merged

## Summary of Changes This PR introduces a unified `DLPackExchangeAPI` struct as described in proposal [175](dmlc/dlpack#175). This new convention replaces the previous mechanism of separate function pointers, and aligns with the latest DLPack standard as shown in PR [174](dmlc/dlpack#174). Within the new `DLPackExchangeAPI` struct, it also includes a `current_work_stream` function pointer that allows more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges. All the conversion from/to DLPack has been updated to `_no_sync`, meaning you should use `current_work_stream` to explicitly handle stream synchronization. It also includes a non-owning DLTensor conversion to avoid unnecessary reference counting. Following this change, the Python FFI for PyTorch has been updated to expose the new `DLPackExchangeAPI` struct via `__c_dlpack_exchange_api__` on torch.Tensor. The `3rdparty/dlpack` has been updated to incorporate the latest commit.

…165483) ## Addressed Issue Issue #162845 ## Summary of Changes This PR introduces a unified `DLPackExchangeAPI` struct as described in proposal [175](dmlc/dlpack#175). This new convention replaces the previous mechanism of separate function pointers, and aligns with the latest DLPack standard as shown in PR [174](dmlc/dlpack#174). Specifically, the new `DLPackExchangeAPI` struct is exposed as `torch.Tensor.__c_dlpack_exchange_api__`, which stores and exposes the following function pointers: * `managed_tensor_allocator` * `managed_tensor_from_py_object_no_sync` * `managed_tensor_to_py_object_no_sync` * `dltensor_from_py_object_no_sync` * `current_work_stream` Within the new `DLPackExchangeAPI` struct, the new `current_work_stream` function pointer allows more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges. All the conversion from/to DLPack has been updated to `_no_sync`, meaning you should use `current_work_stream` to explicitly handle stream synchronization. It also includes a non-owning DLTensor conversion `dltensor_from_py_object_no_sync` to avoid unnecessary reference counting. Following this change, the `dlpack.h` has been updated to the latest DLPack. Unit tests are added using `torch.utils.cpp_extension.load_inline` to avoid GIL release issues when calling `THPVariable_Wrap`. Pull Request resolved: #165483 Approved by: https://github.com/tqchen, https://github.com/albanD

…ytorch#165483) ## Addressed Issue Issue pytorch#162845 ## Summary of Changes This PR introduces a unified `DLPackExchangeAPI` struct as described in proposal [175](dmlc/dlpack#175). This new convention replaces the previous mechanism of separate function pointers, and aligns with the latest DLPack standard as shown in PR [174](dmlc/dlpack#174). Specifically, the new `DLPackExchangeAPI` struct is exposed as `torch.Tensor.__c_dlpack_exchange_api__`, which stores and exposes the following function pointers: * `managed_tensor_allocator` * `managed_tensor_from_py_object_no_sync` * `managed_tensor_to_py_object_no_sync` * `dltensor_from_py_object_no_sync` * `current_work_stream` Within the new `DLPackExchangeAPI` struct, the new `current_work_stream` function pointer allows more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges. All the conversion from/to DLPack has been updated to `_no_sync`, meaning you should use `current_work_stream` to explicitly handle stream synchronization. It also includes a non-owning DLTensor conversion `dltensor_from_py_object_no_sync` to avoid unnecessary reference counting. Following this change, the `dlpack.h` has been updated to the latest DLPack. Unit tests are added using `torch.utils.cpp_extension.load_inline` to avoid GIL release issues when calling `THPVariable_Wrap`. Pull Request resolved: pytorch#165483 Approved by: https://github.com/tqchen, https://github.com/albanD

…165483) ## Addressed Issue Issue #162845 ## Summary of Changes This PR introduces a unified `DLPackExchangeAPI` struct as described in proposal [175](dmlc/dlpack#175). This new convention replaces the previous mechanism of separate function pointers, and aligns with the latest DLPack standard as shown in PR [174](dmlc/dlpack#174). Specifically, the new `DLPackExchangeAPI` struct is exposed as `torch.Tensor.__c_dlpack_exchange_api__`, which stores and exposes the following function pointers: * `managed_tensor_allocator` * `managed_tensor_from_py_object_no_sync` * `managed_tensor_to_py_object_no_sync` * `dltensor_from_py_object_no_sync` * `current_work_stream` Within the new `DLPackExchangeAPI` struct, the new `current_work_stream` function pointer allows more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges. All the conversion from/to DLPack has been updated to `_no_sync`, meaning you should use `current_work_stream` to explicitly handle stream synchronization. It also includes a non-owning DLTensor conversion `dltensor_from_py_object_no_sync` to avoid unnecessary reference counting. Following this change, the `dlpack.h` has been updated to the latest DLPack. Unit tests are added using `torch.utils.cpp_extension.load_inline` to avoid GIL release issues when calling `THPVariable_Wrap`. Pull Request resolved: #165483 Approved by: https://github.com/tqchen, https://github.com/albanD

tqchen force-pushed the c-export branch 2 times, most recently from d3eced5 to 41434ff Compare September 12, 2025 18:15

tqchen mentioned this pull request Sep 12, 2025

[RFC] C Functions for Speed Exchange and Stream Handling #175

Closed

tqchen force-pushed the c-export branch from 41434ff to 2175043 Compare September 12, 2025 21:24

rgommers mentioned this pull request Sep 14, 2025

[RFC][GUIDELINE] Intrusive Caching DLPack for Fast Conversion #173

Closed

tqchen force-pushed the c-export branch from ed73944 to 62b5517 Compare September 14, 2025 16:15

tqchen added 2 commits September 14, 2025 19:06

Incorporate suggestions to use a global table

0330ad9

for all function pointers

tqchen force-pushed the c-export branch from b21f290 to 0330ad9 Compare September 14, 2025 23:06

oleksandr-pavlyk reviewed Sep 17, 2025

View reviewed changes

include/dlpack/dlpack.h Outdated Show resolved Hide resolved

tqchen force-pushed the c-export branch from 6b31b30 to 327b088 Compare September 22, 2025 15:36

Update naming and specification according to feedbacks

ffb153d

tqchen force-pushed the c-export branch from 327b088 to ffb153d Compare September 22, 2025 15:40

kkraus14 reviewed Sep 22, 2025

View reviewed changes

include/dlpack/dlpack.h Outdated Show resolved Hide resolved

tqchen added 2 commits September 22, 2025 17:14

Add maximum version parameter

a947bef

Update to include WorkStream proposal, move version to API chain

4b1de24

tqchen force-pushed the c-export branch from 4b1de24 to fd23530 Compare October 2, 2025 01:21

SigureMo mentioned this pull request Oct 2, 2025

[DLPack] Bump DLPack to v1.2 and implement C functions exchange API PaddlePaddle/Paddle#75650

Merged

5 tasks

tqchen force-pushed the c-export branch from fd23530 to 9b2df5e Compare October 6, 2025 17:59

Update based on latest feedback

bddb25b

tqchen force-pushed the c-export branch from 9b2df5e to bddb25b Compare October 6, 2025 18:00

yongwww approved these changes Oct 6, 2025

View reviewed changes

Incorporate feedbacks from sberg

8e628e8

Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>

tqchen force-pushed the c-export branch from d6aaac7 to 8e628e8 Compare October 8, 2025 20:49

seberg approved these changes Oct 9, 2025

View reviewed changes

tqchen and others added 2 commits October 9, 2025 13:31

Update include/dlpack/dlpack.h

e120200

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

Update include/dlpack/dlpack.h

df77508

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

dalcinl reviewed Oct 9, 2025

View reviewed changes

Move exchange api to common header

180dfcd

tqchen force-pushed the c-export branch from 9ad7e6c to 180dfcd Compare October 9, 2025 22:33

dalcinl approved these changes Oct 10, 2025

View reviewed changes

dalcinl reviewed Oct 10, 2025

View reviewed changes

include/dlpack/dlpack.h Outdated Show resolved Hide resolved

Update include/dlpack/dlpack.h

eda587a

Co-authored-by: Lisandro Dalcin <dalcinl@gmail.com>

tqchen merged commit 1117366 into dmlc:main Oct 11, 2025
3 checks passed

Kathryn-cat mentioned this pull request Oct 11, 2025

[DLPack] C Functions for DLPack Speed Exchange and Stream Handling apache/tvm-ffi#96

Merged

Kathryn-cat mentioned this pull request Oct 15, 2025

[DLPack] C Functions for DLPack Speed Exchange and Stream Handling pytorch/pytorch#165483

Closed

junrushao mentioned this pull request Dec 3, 2025

[RFC] Bring up DLPack C Functions for Speedup and Streamline Exchange pytorch/pytorch#162845

Open

[RFC] Support DLPACK C Functions for Speed Exchange and Stream Handling #174

[RFC] Support DLPACK C Functions for Speed Exchange and Stream Handling #174

Uh oh!

Conversation

tqchen commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Sep 12, 2025

Uh oh!

tqchen commented Sep 14, 2025

Uh oh!

Uh oh!

tqchen commented Sep 22, 2025

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tqchen commented Oct 6, 2025

Uh oh!

yongwww commented Oct 6, 2025

Uh oh!

tqchen commented Oct 8, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tqchen commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tqchen commented Sep 12, 2025 •

edited

Loading

tqchen commented Oct 10, 2025 •

edited

Loading