Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: DLpack support for interoperability with other GPU frameworks #180

Merged
merged 20 commits into from
Apr 15, 2020

Conversation

EvenOldridge
Copy link
Contributor

@EvenOldridge EvenOldridge commented Nov 25, 2019

Comment period is open until 2019-12-13.

DLpack support for interoperability with other GPU frameworks

Status (Proposed)
RFC # 180
Author(s) eoldridge@nvidia.com, wmjlyjemaine@gmail.com, zhoujinjing09@gmail.com
Sponsor apassos@google.com, sanjoy@google.com
Updated 2019-11-26

Objective

This document proposes the adoption of dlpack as way of passing tensor data to other frameworks without leaving the GPU and without a copy per 24453. dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet.

The interoperability of dlpack would allow for fast on-GPU communication between TensorFlow and these frameworks opening up a wide range of use cases outlined below. It would further enable __cuda_array_interface__ interoperability through cuPy/cuDF which support both methods providing a way to transfer data to Numba, PyArrow and other frameworks that have adopted that method, although a similar request has been made to support that method of interoperability and ideally both would be supported.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@alextp
Copy link
Contributor

alextp commented Nov 25, 2019

@sanjoy

@ewilderj
Copy link
Contributor

cc/ @brijk7

Thanks @EvenOldridge -- we'll need you to have signed the CLA in order to proceed. If you believe you have and that the bot is in error, let me know and I can bump this through.

One nit: do @alextp and @sanjoy want their personal emails in the RFC, vs their work ones?

To @alextp and @sanjoy: thanks for sponsoring this. You'll be responsible for booking the design review meeting as usual, with the change that it will include the RFC authors (plus any other suggested community members) as well as the TF team.

Once the CLA is sorted out, @brijk7 can push forward with announcing and setting a review period.

@alextp
Copy link
Contributor

alextp commented Nov 25, 2019 via email

@EvenOldridge
Copy link
Contributor Author

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@jermainewang
Copy link

@EvenOldridge Thanks for the RFC. Shall @VoVAllen and I sign the CLA as well?

Finally, to achieve the maximal efficiency, we want the conversion happens without memory copy.

For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation).
For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the deallocate call will have to do a host-device sync as well since the dlpack tensor could have users enqueued in arbitrary streams and free'ing it without waiting for those kernels to finish will cause data races.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically DeallocateRaw won't delete the tensor, but dereferencing the buffer. The data races/data free issue is handled by the original framework which produces this tensor.

Detail: https://github.com/VoVAllen/tf-dlpack/blob/master/src/to_dlpack_kernel.cc#L66

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can work, but we need to be clear on the contract that TF will unref the dlpack Tensor as soon as all uses have been enqueued, and won't want for the kernels to actually finish. As long as this is part of dlpack's contract all is good.

Copy link
Contributor

@byronyi byronyi Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo the concern here. Take an example as followed:

Image another framework that mirrors current TF design, say X.

  1. TF_ToDLPackOp increments the TF TensorBuffer ref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;
  2. X_FromDLPackOp executes in stream context B;
  3. A downstream Op in X consumes this DLPack tensor in stream context B, and you called TF_DeallocateRaw which immediately decrements the ref count;
  4. TF reuses the TensorBuffer and starting to write new data onto it in in stream context A.

How do you plan to sync stream A in TF and stream B in X?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo the concern here. Take an example as followed:

Image another framework that mirrors current TF design, say X.

  1. TF_ToDLPackOp increments the TF TensorBuffer ref count, and the tensors produced by the upstream TF_Op is ready for use in stream context A;
  2. X_FromDLPackOp executes in stream context B;
  3. A downstream Op in X consumes this DLPack tensor in stream context B, and you called TF_DeallocateRaw which immediately decrements the ref count;
  4. TF reuses the TensorBuffer and starting to write new data onto it in in stream context A.

How do you plan to sync stream A in TF and stream B in X?

@byronyi
Minjie's comment (#180 (comment)) partially addressed your concern. You are right about the situation, as the operation across stream is not guaranteed by dlpack so far. One solution is to sync the stream producing tensor to ensure the memory is ready before downstream's execution, which you can see from mxnet.

@VoVAllen
Copy link
Contributor

@EvenOldridge Thanks for the RFC.

@brijk7 brijk7 changed the title DLpack support RFC RFC: DLpack support Nov 26, 2019
@brijk7 brijk7 added the RFC: Proposed RFC Design Document label Nov 26, 2019
@brijk7
Copy link
Contributor

brijk7 commented Nov 26, 2019

@EvenOldridge : once you update emails of sponsors (and address any initial comments), I will announce the RFC to developers@tensorflow.org.
thanks!
brijesh

@VoVAllen
Copy link
Contributor

@EvenOldridge My email is zhoujinjing09@gmail.com if needed. Thanks!

Updated emails and tried to better frame the RFC to reflect the package solution vs a native solution.
@byronyi
Copy link
Contributor

byronyi commented Jan 22, 2020

Recently I saw @hawkinsp from the XLA team pulled in DLPack into TF core for interoperability between JAX and other GPU libraries in tensorflow/tensorflow@fc1f6fd. Not sure if it's related to this RFC per se.

@hawkinsp
Copy link

@byronyi That's right; I added DLPack support to the XLA:Python bindings for use in JAX (https://github.com/google/jax). That's somewhat separate to this RFC, which pertains to TensorFlow. However, since XLA and TensorFlow live in the same repository, the PR you linked did some preparatory work that may also assist with an implementation in TensorFlow (e.g., it adds DLPack and its headers to the TF bazel workspace.)

Apropos of the discussion about synchronization, I chose to synchronize the device to the host when converting a JAX array to a DLManagedTensor. This seemed like the safest choice, given DLPack doesn't have any mechanisms to describe device-side synchronization (e.g., a CUDA Event that describes when a particular buffer becomes valid on device would be an obvious choice in the case of CUDA.)

@dynamicwebpaige
Copy link
Contributor

Thanks again for submitting the RFC, @EvenOldridge, @jermainewang, and @VoVAllen!

Is there anything that we need to do to move this along? DLPack support is an exciting feature, and I want to make sure that we keep momentum. 🙂

Copy link
Contributor

@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @VoVAllen

Let's merge this RFC (normally we merge the RFC before merging the PR, but I wanted to get the PR in ASAP).

Can you go through the RFC once and make sure it reflects the end state of the discussion? For instance, there is a "Notes from @alextp:" section which should be incorporated into the main text.

@VoVAllen
Copy link
Contributor

@sanjoy I created the modification PR EvenOldridge#2 to @EvenOldridge's branch.
@EvenOldridge Could you help merge this? Thanks!

Update 20191016-dlpack-support.md with VoVAllen's changes
Copy link
Contributor

@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more wordsmithing comments.

Comment on lines 20 to 24
Why this is a valuable problem to solve? What background information is needed
to show how this design addresses the problem?

Which users are affected by the problem? Why is it a problem? What data supports
this? What related work exists?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These questions can be deleted right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 40 to 41
How will users (or other contributors) benefit from this work? What would be the
headline in the release notes or blog post?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These questions should be deleted as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 51 to 56
Notes from @alextp:

AFAICT it should be easy to take cuda pointers in and out of TF and use them to build dlpack structures from tensors or vice versa. The tricky part is that TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers).

A working version of dlpack integration has been released as a package by coauthors @jermainewang and @VoVAllen here:
https://github.com/VoVAllen/tf-dlpack/issues/3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer rephrasing text like this and making the proposal describe what was actually implemented.

Same for other "as mentioned by alextp" comments below.

Removed the questions.

## Objective

This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453). dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGM, TGL, PyTorch, and MxNet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell error? should be DGL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest pr

@VoVAllen
Copy link
Contributor

VoVAllen commented Mar 4, 2020

I made a new pr to @EvenOldridge's branch. Hope this looks better.

Update 20191016-dlpack-support.md

DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting. The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data. Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required. By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible.

Existing applications that take advantage of dlpack include: (adding my own and those listed in , other contributions needed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding my own and those listed in , other contributions needed) seems malformed

Removed request for contributions, added thinc.ai framework interoperability.
Copy link
Contributor

@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor nits, looks good otherwise.

Comment on lines 101 to 105
Proposed API implementation details:
- to_dlpack
- Implementing `TFE_HandleToDLPack`, which converts tf's eager tensor handle to dlpack tensor's pointer(`DLManagedTensor*`). And wrap it into PyCapsule to adapt to the Python interface in ffi binding file. For the underlying memory liveness, `TensorReference` is used to maintain the reference counting over the underlying `TensorBuffer`, which increases when creating dlpack tensor, and decreases in the deleter of dlpack tensor.
- from_dlpack
- Implementing `TFE_HandleFromDLPack`, which converts dlpack tensor's pointer(`DLManagedTensor*`) to tf's eager tensor handle. `TFE_TensorHandleDevicePointer` is used to get the data pointer of underlying buffer, and synchronize the related device to ensures the memory readiness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not formatted correctly.


## Questions and Discussion Topics

https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Outlines/outlines/

Also, it would probably more readable to have a text for the hyperlink, like:

The key issues that need to be addressed are outlined here.

@VoVAllen
Copy link
Contributor

VoVAllen commented Mar 5, 2020

EvenOldridge#4
Added a new pr to address comment above @EvenOldridge

Update 20191016-dlpack-support.md
@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Apr 15, 2020
@ematejska ematejska merged commit dc06c7a into tensorflow:master Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet