RFC: DLpack support for interoperability with other GPU frameworks #180

EvenOldridge · 2019-11-25T18:44:56Z

Comment period is open until 2019-12-13.

DLpack support for interoperability with other GPU frameworks

Status	(Proposed)
RFC #	180
Author(s)	eoldridge@nvidia.com, wmjlyjemaine@gmail.com, zhoujinjing09@gmail.com
Sponsor	apassos@google.com, sanjoy@google.com
Updated	2019-11-26

Objective

This document proposes the adoption of dlpack as way of passing tensor data to other frameworks without leaving the GPU and without a copy per 24453. dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet.

The interoperability of dlpack would allow for fast on-GPU communication between TensorFlow and these frameworks opening up a wide range of use cases outlined below. It would further enable __cuda_array_interface__ interoperability through cuPy/cuDF which support both methods providing a way to transfer data to Numba, PyArrow and other frameworks that have adopted that method, although a similar request has been made to support that method of interoperability and ideally both would be supported.

Template Creation

Saving this intermediate draft.

Updated description

@VoVAllen

Added @VoVAllen as contributor and tvm example

Added DALI as a benefiting framework

@alextp

Updated to add @alextp as a sponsor and to include his comments.

Updated to include the tf-dlpack solution

googlebot · 2019-11-25T18:45:15Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

alextp · 2019-11-25T19:16:55Z

@sanjoy

ewilderj · 2019-11-25T20:31:13Z

cc/ @brijk7

Thanks @EvenOldridge -- we'll need you to have signed the CLA in order to proceed. If you believe you have and that the bot is in error, let me know and I can bump this through.

One nit: do @alextp and @sanjoy want their personal emails in the RFC, vs their work ones?

To @alextp and @sanjoy: thanks for sponsoring this. You'll be responsible for booking the design review meeting as usual, with the change that it will include the RFC authors (plus any other suggested community members) as well as the TF team.

Once the CLA is sorted out, @brijk7 can push forward with announcing and setting a review period.

alextp · 2019-11-25T20:42:06Z

Work email is best I think.

…

On Mon, Nov 25, 2019 at 12:31 PM Edd Wilder-James ***@***.***> wrote: cc/ @brijk7 <https://github.com/brijk7> Thanks @EvenOldridge <https://github.com/EvenOldridge> -- we'll need you to have signed the CLA in order to proceed. If you believe you have and that the bot is in error, let me know and I can bump this through. One nit: do @alextp <https://github.com/alextp> and @sanjoy <https://github.com/sanjoy> want their personal emails in the RFC, vs their work ones? To @alextp <https://github.com/alextp> and @sanjoy <https://github.com/sanjoy>: thanks for sponsoring this. You'll be responsible for booking the design review meeting as usual, with the change that it will include the RFC authors (plus any other suggested community members) as well as the TF team. Once the CLA is sorted out, @brijk7 <https://github.com/brijk7> can push forward with announcing and setting a review period. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#180?email_source=notifications&email_token=AAABHRN5WFV3RPDVIISSTDTQVQYZFA5CNFSM4JRNFGZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFDWMSY#issuecomment-558327371>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABHRMKN2QLT3HN4K6UR3DQVQYZFANCNFSM4JRNFGZA> .

-- - Alex

EvenOldridge · 2019-11-25T21:33:03Z

@googlebot I signed it!

googlebot · 2019-11-25T21:33:11Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

jermainewang · 2019-11-26T05:36:23Z

@EvenOldridge Thanks for the RFC. Shall @VoVAllen and I sign the CLA as well?

rfcs/20191016-dlpack-support.md

sanjoy · 2019-11-26T00:57:26Z

rfcs/20191016-dlpack-support.md

+Finally, to achieve the maximal efficiency, we want the conversion happens without memory copy.
+
+For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation).
+For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op).


I believe the deallocate call will have to do a host-device sync as well since the dlpack tensor could have users enqueued in arbitrary streams and free'ing it without waiting for those kernels to finish will cause data races.

Basically DeallocateRaw won't delete the tensor, but dereferencing the buffer. The data races/data free issue is handled by the original framework which produces this tensor.

Detail: https://github.com/VoVAllen/tf-dlpack/blob/master/src/to_dlpack_kernel.cc#L66

That can work, but we need to be clear on the contract that TF will unref the dlpack Tensor as soon as all uses have been enqueued, and won't want for the kernels to actually finish. As long as this is part of dlpack's contract all is good.

Echo the concern here. Take an example as followed:

Image another framework that mirrors current TF design, say X.

TF_ToDLPackOp increments the TF TensorBuffer ref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;

X_FromDLPackOp executes in stream context B;

A downstream Op in X consumes this DLPack tensor in stream context B, and you called TF_DeallocateRaw which immediately decrements the ref count;

TF reuses the TensorBuffer and starting to write new data onto it in in stream context A.

How do you plan to sync stream A in TF and stream B in X?

Echo the concern here. Take an example as followed:

Image another framework that mirrors current TF design, say X.

TF_ToDLPackOp increments the TF TensorBuffer ref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;

X_FromDLPackOp executes in stream context B;

A downstream Op in X consumes this DLPack tensor in stream context B, and you called TF_DeallocateRaw which immediately decrements the ref count;

TF reuses the TensorBuffer and starting to write new data onto it in in stream context A.

How do you plan to sync stream A in TF and stream B in X?

@byronyi
Minjie's comment (#180 (comment)) partially addressed your concern. You are right about the situation, as the operation across stream is not guaranteed by dlpack so far. One solution is to sync the stream producing tensor to ensure the memory is ready before downstream's execution, which you can see from mxnet.

VoVAllen · 2019-11-26T06:06:07Z

@EvenOldridge Thanks for the RFC.

brijk7 · 2019-11-26T16:45:04Z

@EvenOldridge : once you update emails of sponsors (and address any initial comments), I will announce the RFC to developers@tensorflow.org.
thanks!
brijesh

VoVAllen · 2019-11-26T16:52:32Z

@EvenOldridge My email is zhoujinjing09@gmail.com if needed. Thanks!

Updated emails and tried to better frame the RFC to reflect the package solution vs a native solution.

byronyi · 2020-01-22T10:28:46Z

Recently I saw @hawkinsp from the XLA team pulled in DLPack into TF core for interoperability between JAX and other GPU libraries in tensorflow/tensorflow@fc1f6fd. Not sure if it's related to this RFC per se.

hawkinsp · 2020-01-22T13:58:40Z

@byronyi That's right; I added DLPack support to the XLA:Python bindings for use in JAX (https://github.com/google/jax). That's somewhat separate to this RFC, which pertains to TensorFlow. However, since XLA and TensorFlow live in the same repository, the PR you linked did some preparatory work that may also assist with an implementation in TensorFlow (e.g., it adds DLPack and its headers to the TF bazel workspace.)

Apropos of the discussion about synchronization, I chose to synchronize the device to the host when converting a JAX array to a DLManagedTensor. This seemed like the safest choice, given DLPack doesn't have any mechanisms to describe device-side synchronization (e.g., a CUDA Event that describes when a particular buffer becomes valid on device would be an obvious choice in the case of CUDA.)

dynamicwebpaige · 2020-02-18T15:52:51Z

Thanks again for submitting the RFC, @EvenOldridge, @jermainewang, and @VoVAllen!

Is there anything that we need to do to move this along? DLPack support is an exciting feature, and I want to make sure that we keep momentum. 🙂

sanjoy

Hi @VoVAllen

Let's merge this RFC (normally we merge the RFC before merging the PR, but I wanted to get the PR in ASAP).

Can you go through the RFC once and make sure it reflects the end state of the discussion? For instance, there is a "Notes from @alextp:" section which should be incorporated into the main text.

VoVAllen · 2020-02-27T07:47:25Z

@sanjoy I created the modification PR EvenOldridge#2 to @EvenOldridge's branch.
@EvenOldridge Could you help merge this? Thanks!

Update 20191016-dlpack-support.md with VoVAllen's changes

sanjoy

Some more wordsmithing comments.

sanjoy · 2020-03-03T05:01:47Z

rfcs/20191016-dlpack-support.md

+Why this is a valuable problem to solve? What background information is needed
+to show how this design addresses the problem?
+
+Which users are affected by the problem? Why is it a problem? What data supports
+this? What related work exists?


These questions can be deleted right?

sanjoy · 2020-03-03T05:01:59Z

rfcs/20191016-dlpack-support.md

+How will users (or other contributors) benefit from this work? What would be the
+headline in the release notes or blog post?


These questions should be deleted as well.

sanjoy · 2020-03-03T05:02:37Z

rfcs/20191016-dlpack-support.md

+Notes from @alextp:
+
+AFAICT it should be easy to take cuda pointers in and out of TF and use them to build dlpack structures from tensors or vice versa. The tricky part is that TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers).
+
+A working version of dlpack integration has been released as a package by coauthors @jermainewang and @VoVAllen here:
+https://github.com/VoVAllen/tf-dlpack/issues/3


I would prefer rephrasing text like this and making the proposal describe what was actually implemented.

Same for other "as mentioned by alextp" comments below.

Removed the questions.

jermainewang · 2020-03-04T03:32:57Z

rfcs/20191016-dlpack-support.md

+
+## Objective
+
+This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453).  dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet. 


spell error? should be DGL.

Fixed in the latest pr

VoVAllen · 2020-03-04T09:11:29Z

I made a new pr to @EvenOldridge's branch. Hope this looks better.

Update 20191016-dlpack-support.md

sanjoy · 2020-03-04T22:41:02Z

rfcs/20191016-dlpack-support.md

+
+DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting.  The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data.  Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required.  By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible.
+
+Existing applications that take advantage of dlpack include: (adding my own and those listed in , other contributions needed)


(adding my own and those listed in , other contributions needed) seems malformed

Removed request for contributions, added thinc.ai framework interoperability.

sanjoy

Some more minor nits, looks good otherwise.

sanjoy · 2020-03-05T00:05:14Z

rfcs/20191016-dlpack-support.md

+Proposed API implementation details:
+- to_dlpack
+ - Implementing `TFE_HandleToDLPack`, which converts tf's eager tensor handle to dlpack tensor's pointer(`DLManagedTensor*`). And wrap it into PyCapsule to adapt to the Python interface in ffi binding file. For the underlying memory liveness, `TensorReference` is used to maintain the reference counting over the underlying `TensorBuffer`, which increases when creating dlpack tensor, and decreases in the deleter of dlpack tensor.  
+- from_dlpack
+ - Implementing `TFE_HandleFromDLPack`, which converts dlpack tensor's pointer(`DLManagedTensor*`) to tf's eager tensor handle. `TFE_TensorHandleDevicePointer` is used to get the data pointer of underlying buffer, and synchronize the related device to ensures the memory readiness. 


I believe this is not formatted correctly.

sanjoy · 2020-03-05T00:06:08Z

rfcs/20191016-dlpack-support.md

+
+## Questions and Discussion Topics
+
+https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid.  Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well.


s/Outlines/outlines/

Also, it would probably more readable to have a text for the hyperlink, like:

The key issues that need to be addressed are outlined here.

VoVAllen · 2020-03-05T18:55:47Z

EvenOldridge#4
Added a new pr to address comment above @EvenOldridge

Update 20191016-dlpack-support.md

EvenOldridge added 10 commits October 16, 2019 11:20

Create 20191016-dlpack-support.md

5e0b96c

Template Creation

Update 20191016-dlpack-support.md

6e5c499

Saving this intermediate draft.

Update 20191016-dlpack-support.md

138d6ac

Update 20191016-dlpack-support.md

fe3c328

Updated description

Update 20191016-dlpack-support.md

5b34b41

Update 20191016-dlpack-support.md

618d024

Added @VoVAllen as contributor and tvm example

Update 20191016-dlpack-support.md

1e3e3d3

Added DALI as a benefiting framework

Update 20191016-dlpack-support.md

ce069c4

Update 20191016-dlpack-support.md

01f3b0b

Updated to add @alextp as a sponsor and to include his comments.

Update 20191016-dlpack-support.md

164026d

Updated to include the tf-dlpack solution

EvenOldridge requested review from ewilderj, martinwicke and theadactyl as code owners November 25, 2019 18:44

googlebot added the cla: no label Nov 25, 2019

googlebot added cla: yes and removed cla: no labels Nov 25, 2019

sanjoy suggested changes Nov 26, 2019

View reviewed changes

brijk7 changed the title ~~DLpack support RFC~~ RFC: DLpack support Nov 26, 2019

brijk7 added the RFC: Proposed RFC Design Document label Nov 26, 2019

Update 20191016-dlpack-support.md

a767ccc

Updated emails and tried to better frame the RFC to reflect the package solution vs a native solution.

VoVAllen mentioned this pull request Jan 18, 2020

kindly remove tensorflow requirement in pip and also provide wheels for other OS VoVAllen/tf-dlpack#11

Open

VoVAllen mentioned this pull request Feb 18, 2020

[Features] DLPack functions tensorflow/tensorflow#36862

Closed

sanjoy reviewed Feb 26, 2020

View reviewed changes

Update 20191016-dlpack-support.md

1b0a94d

Merge pull request #2 from VoVAllen/patch-1

64145ba

Update 20191016-dlpack-support.md with VoVAllen's changes

EvenOldridge requested a review from ematejska as a code owner February 28, 2020 17:53

sanjoy suggested changes Mar 3, 2020

View reviewed changes

Update 20191016-dlpack-support.md

53d92bd

Removed the questions.

jermainewang reviewed Mar 4, 2020

View reviewed changes

Update 20191016-dlpack-support.md

819da0d

Merge pull request #3 from VoVAllen/patch-2

975e96f

Update 20191016-dlpack-support.md

sanjoy suggested changes Mar 4, 2020

View reviewed changes

Update 20191016-dlpack-support.md

5a853a7

Removed request for contributions, added thinc.ai framework interoperability.

sanjoy suggested changes Mar 5, 2020

View reviewed changes

Update 20191016-dlpack-support.md

5cd0967

Merge pull request #4 from VoVAllen/patch-3

10b82e0

Update 20191016-dlpack-support.md

sanjoy approved these changes Mar 6, 2020

View reviewed changes

cjnolet mentioned this pull request Mar 8, 2020

[QST] Timelines for GMM and StandardScaler rapidsai/cuml#1801

Closed

Updated this to accepted.

15030e3

ematejska approved these changes Apr 15, 2020

View reviewed changes

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Apr 15, 2020

ematejska merged commit dc06c7a into tensorflow:master Apr 15, 2020

szha mentioned this pull request Jun 8, 2020

dlpack support for ONNX runtime microsoft/onnxruntime#4162

Closed

		How will users (or other contributors) benefit from this work? What would be the
		headline in the release notes or blog post?


		## Objective

		This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453). dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet.


		DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting. The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data. Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required. By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible.

		Existing applications that take advantage of dlpack include: (adding my own and those listed in , other contributions needed)


		## Questions and Discussion Topics

		https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well.

RFC: DLpack support for interoperability with other GPU frameworks #180

RFC: DLpack support for interoperability with other GPU frameworks #180

Uh oh!

Conversation

EvenOldridge commented Nov 25, 2019 • edited by brijk7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DLpack support for interoperability with other GPU frameworks

Objective

Uh oh!

googlebot commented Nov 25, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

alextp commented Nov 25, 2019

Uh oh!

ewilderj commented Nov 25, 2019

Uh oh!

alextp commented Nov 25, 2019 via email

Uh oh!

EvenOldridge commented Nov 25, 2019

Uh oh!

googlebot commented Nov 25, 2019

Uh oh!

jermainewang commented Nov 26, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

byronyi Dec 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VoVAllen commented Nov 26, 2019

Uh oh!

brijk7 commented Nov 26, 2019

Uh oh!

VoVAllen commented Nov 26, 2019

Uh oh!

byronyi commented Jan 22, 2020

Uh oh!

hawkinsp commented Jan 22, 2020

Uh oh!

dynamicwebpaige commented Feb 18, 2020

Uh oh!

sanjoy left a comment

Choose a reason for hiding this comment

Uh oh!

VoVAllen commented Feb 27, 2020

Uh oh!

sanjoy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VoVAllen commented Mar 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjoy left a comment

Choose a reason for hiding this comment

EvenOldridge commented Nov 25, 2019 •

edited by brijk7

Loading

byronyi Dec 23, 2019 •

edited

Loading