- 
                Notifications
    You must be signed in to change notification settings 
- Fork 579
RFC: DLpack support for interoperability with other GPU frameworks #180
RFC: DLpack support for interoperability with other GPU frameworks #180
Conversation
| Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with  What to do if you already signed the CLAIndividual signers
 Corporate signers
 ℹ️ Googlers: Go here for more info. | 
| cc/ @brijk7 Thanks @EvenOldridge -- we'll need you to have signed the CLA in order to proceed. If you believe you have and that the bot is in error, let me know and I can bump this through. One nit: do @alextp and @sanjoy want their personal emails in the RFC, vs their work ones? To @alextp and @sanjoy: thanks for sponsoring this. You'll be responsible for booking the design review meeting as usual, with the change that it will include the RFC authors (plus any other suggested community members) as well as the TF team. Once the CLA is sorted out, @brijk7 can push forward with announcing and setting a review period. | 
| Work email is best I think.… On Mon, Nov 25, 2019 at 12:31 PM Edd Wilder-James ***@***.***> wrote:
 cc/ @brijk7 <https://github.com/brijk7>
 Thanks @EvenOldridge <https://github.com/EvenOldridge> -- we'll need you
 to have signed the CLA in order to proceed. If you believe you have and
 that the bot is in error, let me know and I can bump this through.
 One nit: do @alextp <https://github.com/alextp> and @sanjoy
 <https://github.com/sanjoy> want their personal emails in the RFC, vs
 their work ones?
 To @alextp <https://github.com/alextp> and @sanjoy
 <https://github.com/sanjoy>: thanks for sponsoring this. You'll be
 responsible for booking the design review meeting as usual, with the change
 that it will include the RFC authors (plus any other suggested community
 members) as well as the TF team.
 Once the CLA is sorted out, @brijk7 <https://github.com/brijk7> can push
 forward with announcing and setting a review period.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#180?email_source=notifications&email_token=AAABHRN5WFV3RPDVIISSTDTQVQYZFA5CNFSM4JRNFGZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFDWMSY#issuecomment-558327371>,
 or unsubscribe
 <https://github.com/notifications/unsubscribe-auth/AAABHRMKN2QLT3HN4K6UR3DQVQYZFANCNFSM4JRNFGZA>
 .
 -- 
 - Alex | 
| @googlebot I signed it! | 
| CLAs look good, thanks! ℹ️ Googlers: Go here for more info. | 
| @EvenOldridge Thanks for the RFC. Shall @VoVAllen and I sign the CLA as well? | 
| Finally, to achieve the maximal efficiency, we want the conversion happens without memory copy. | ||
|  | ||
| For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation). | ||
| For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the deallocate call will have to do a host-device sync as well since the dlpack tensor could have users enqueued in arbitrary streams and free'ing it without waiting for those kernels to finish will cause data races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically DeallocateRaw won't delete the tensor, but dereferencing the buffer. The data races/data free issue is handled by the original framework which produces this tensor.
Detail: https://github.com/VoVAllen/tf-dlpack/blob/master/src/to_dlpack_kernel.cc#L66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can work, but we need to be clear on the contract that TF will unref the dlpack Tensor as soon as all uses have been enqueued, and won't want for the kernels to actually finish. As long as this is part of dlpack's contract all is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo the concern here. Take an example as followed:
Image another framework that mirrors current TF design, say X.
- TF_ToDLPackOpincrements the TF- TensorBufferref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;
- X_FromDLPackOpexecutes in stream context B;
- A downstream Op in X consumes this DLPack tensor in stream context B, and you called TF_DeallocateRawwhich immediately decrements the ref count;
- TF reuses the TensorBufferand starting to write new data onto it in in stream context A.
How do you plan to sync stream A in TF and stream B in X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo the concern here. Take an example as followed:
Image another framework that mirrors current TF design, say X.
TF_ToDLPackOpincrements the TFTensorBufferref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;
X_FromDLPackOpexecutes in stream context B;- A downstream Op in X consumes this DLPack tensor in stream context B, and you called
TF_DeallocateRawwhich immediately decrements the ref count;- TF reuses the
TensorBufferand starting to write new data onto it in in stream context A.How do you plan to sync stream A in TF and stream B in X?
@byronyi
Minjie's comment (#180 (comment)) partially addressed your concern. You are right about the situation, as the operation across stream is not guaranteed by dlpack so far. One solution is to sync the stream producing tensor to ensure the memory is ready before downstream's execution, which you can see from mxnet.
| @EvenOldridge Thanks for the RFC. | 
| @EvenOldridge : once you update emails of sponsors (and address any initial comments), I will announce the RFC to developers@tensorflow.org. | 
| @EvenOldridge My email is zhoujinjing09@gmail.com if needed. Thanks! | 
Updated emails and tried to better frame the RFC to reflect the package solution vs a native solution.
| Recently I saw @hawkinsp from the XLA team pulled in DLPack into TF core for interoperability between JAX and other GPU libraries in tensorflow/tensorflow@fc1f6fd. Not sure if it's related to this RFC per se. | 
| @byronyi That's right; I added DLPack support to the XLA:Python bindings for use in JAX (https://github.com/google/jax). That's somewhat separate to this RFC, which pertains to TensorFlow. However, since XLA and TensorFlow live in the same repository, the PR you linked did some preparatory work that may also assist with an implementation in TensorFlow (e.g., it adds DLPack and its headers to the TF bazel workspace.) Apropos of the discussion about synchronization, I chose to synchronize the device to the host when converting a JAX array to a DLManagedTensor. This seemed like the safest choice, given DLPack doesn't have any mechanisms to describe device-side synchronization (e.g., a CUDA Event that describes when a particular buffer becomes valid on device would be an obvious choice in the case of CUDA.) | 
| Thanks again for submitting the RFC, @EvenOldridge, @jermainewang, and @VoVAllen! Is there anything that we need to do to move this along? DLPack support is an exciting feature, and I want to make sure that we keep momentum. 🙂 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @VoVAllen
Let's merge this RFC (normally we merge the RFC before merging the PR, but I wanted to get the PR in ASAP).
Can you go through the RFC once and make sure it reflects the end state of the discussion? For instance, there is a "Notes from @alextp:" section which should be incorporated into the main text.
| @sanjoy I created the modification PR  EvenOldridge#2 to @EvenOldridge's branch. | 
Update 20191016-dlpack-support.md with VoVAllen's changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more wordsmithing comments.
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      | Why this is a valuable problem to solve? What background information is needed | ||
| to show how this design addresses the problem? | ||
|  | ||
| Which users are affected by the problem? Why is it a problem? What data supports | ||
| this? What related work exists? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These questions can be deleted right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      | How will users (or other contributors) benefit from this work? What would be the | ||
| headline in the release notes or blog post? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These questions should be deleted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      | Notes from @alextp: | ||
|  | ||
| AFAICT it should be easy to take cuda pointers in and out of TF and use them to build dlpack structures from tensors or vice versa. The tricky part is that TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers). | ||
|  | ||
| A working version of dlpack integration has been released as a package by coauthors @jermainewang and @VoVAllen here: | ||
| https://github.com/VoVAllen/tf-dlpack/issues/3 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer rephrasing text like this and making the proposal describe what was actually implemented.
Same for other "as mentioned by alextp" comments below.
Removed the questions.
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      |  | ||
| ## Objective | ||
|  | ||
| This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453). dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spell error? should be DGL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest pr
| I made a new pr to @EvenOldridge's branch. Hope this looks better. | 
Update 20191016-dlpack-support.md
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      |  | ||
| DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting. The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data. Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required. By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible. | ||
|  | ||
| Existing applications that take advantage of dlpack include: (adding my own and those listed in , other contributions needed) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding my own and those listed in , other contributions needed) seems malformed
Removed request for contributions, added thinc.ai framework interoperability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more minor nits, looks good otherwise.
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      | Proposed API implementation details: | ||
| - to_dlpack | ||
| - Implementing `TFE_HandleToDLPack`, which converts tf's eager tensor handle to dlpack tensor's pointer(`DLManagedTensor*`). And wrap it into PyCapsule to adapt to the Python interface in ffi binding file. For the underlying memory liveness, `TensorReference` is used to maintain the reference counting over the underlying `TensorBuffer`, which increases when creating dlpack tensor, and decreases in the deleter of dlpack tensor. | ||
| - from_dlpack | ||
| - Implementing `TFE_HandleFromDLPack`, which converts dlpack tensor's pointer(`DLManagedTensor*`) to tf's eager tensor handle. `TFE_TensorHandleDevicePointer` is used to get the data pointer of underlying buffer, and synchronize the related device to ensures the memory readiness. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not formatted correctly.
        
          
                rfcs/20191016-dlpack-support.md
              
                Outdated
          
        
      |  | ||
| ## Questions and Discussion Topics | ||
|  | ||
| https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Outlines/outlines/
Also, it would probably more readable to have a text for the hyperlink, like:
The key issues that need to be addressed are outlined here.
| EvenOldridge#4 | 
Update 20191016-dlpack-support.md
Comment period is open until 2019-12-13.
DLpack support for interoperability with other GPU frameworks
Objective
This document proposes the adoption of dlpack as way of passing tensor data to other frameworks without leaving the GPU and without a copy per 24453. dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet.
The interoperability of dlpack would allow for fast on-GPU communication between TensorFlow and these frameworks opening up a wide range of use cases outlined below. It would further enable __cuda_array_interface__ interoperability through cuPy/cuDF which support both methods providing a way to transfer data to Numba, PyArrow and other frameworks that have adopted that method, although a similar request has been made to support that method of interoperability and ideally both would be supported.