|
| 1 | +# dlpack support for interoperability with other GPU frameworks |
| 2 | + |
| 3 | +| Status | Accepted | |
| 4 | +:-------------- |:---------------------------------------------------- | |
| 5 | +| **RFC #** | 180 (https://github.com/tensorflow/community/pull/180) (update when you have community PR #)| |
| 6 | +| **Author(s)** | eoldridge@nvidia.com, wmjlyjemaine@gmail.com, zhoujinjing09@gmail.com | |
| 7 | +| **Sponsor** | apassos@google.com, sanjoy@google.com | |
| 8 | +| **Updated** | 2019-11-26 | |
| 9 | + |
| 10 | +## Objective |
| 11 | + |
| 12 | +This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453). dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGL, TGL, PyTorch, and MxNet. |
| 13 | + |
| 14 | +The interoperability of dlpack would allow for fast on-GPU communication between TensorFlow and these frameworks opening up a wide range of use cases outlined below. It would further enable \_\_cuda_array_interface\_\_ interoperability through cuPy/cuDF which support both methods providing a way to transfer data to Numba, PyArrow and other frameworks that have adopted that method, although [a similar request has been made to support that method of interoperability](https://github.com/tensorflow/tensorflow/issues/29039) and ideally both would be supported. |
| 15 | + |
| 16 | +A solution has already been developed by @VoVAllen and @jermainewang (coauthored above) as an external python package. This RFC would see the concepts from the package integrated into Tensorflow Core, and reviewed and enhanced by the TF team so that dlpack support is native. |
| 17 | + |
| 18 | +## Motivation |
| 19 | + |
| 20 | +DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting. The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data. Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required. By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible. |
| 21 | + |
| 22 | +Existing applications that take advantage of dlpack include: |
| 23 | + - Inline on-gpu preprocessing of tabular data using cuDF to prepare it for deep learning models (continuous normalization, categorical encoding, etc) improving preprocessing performance by 10x over pandas and CPU |
| 24 | + - Larger than cpu memory dataloader that iterates over parquet files and batch loads tensors, providing a significant speedup over traditional dataloaders for tabular data |
| 25 | + - [End to end acceleration of training on GPU](https://medium.com/rapids-ai/accelerating-deep-learning-recommender-systems-by-15x-using-rapids-fastai-and-pytorch-b50b4d8568d1); |
| 26 | + - Use of Tensorflow in conjunction with [tvm](https://github.com/dmlc/tvm); [TF custom op implementation of TVM](https://github.com/tobegit3hub/tftvm) |
| 27 | + - Use of Tensorflow in conjunction with [dgl](https://github.com/dmlc/dgl) |
| 28 | + - Zero copy transfer of data in [DALI](https://github.com/NVIDIA/DALI) reducing memory requirements. |
| 29 | + - [thinc.ai](https://thinc.ai/docs/usage-frameworks) framework interoperability. |
| 30 | + |
| 31 | +Beyond the benefit of specific applications, Tensorflow's adoption of dlpack would further incentivize other frameworks considering its adoption as all three major DL frameworks would now be supporting it. Finally, it would also make the development of applications that operate upstream and downstream of deep learning frameworks easier to develop as a single framework agnostic method could be used in conjunction all DL frameworks. |
| 32 | + |
| 33 | +## User Benefit |
| 34 | + |
| 35 | +Users who wish to utilize other GPU accelerated frameworks like cuDF, cuPy, etc would be able to do so without expensive copy operations. By doing direct dataloading, feature engineering and preprocessing on GPU we see 10-15x speedups over traditional workflows involving CPUs to prepare the data for model readiness in other frameworks and they would be immediately available in tensorflow. |
| 36 | + |
| 37 | +More generally, users would be able to develop preprocessing or other GPU based functionality and be able to support integration with all dl frameworks simplifying development efforts when creating solutions that are upstream or downstream from deep learning models. |
| 38 | + |
| 39 | +A blog post or release notes headline could read "Tensorflow now supports dlpack enabling interoperability with other GPU powered frameworks like cuPy, cuDF, DGL, TGL, PyTorch, and MxNet." |
| 40 | + |
| 41 | +## Design Proposal |
| 42 | + |
| 43 | +A working version of dlpack integration has been released as a package by coauthors @jermainewang and @VoVAllen here: |
| 44 | +https://github.com/VoVAllen/tf-dlpack/issues/3 |
| 45 | + |
| 46 | +This proposal would leverage that solution and integrate it into TF so that the operations could be performed natively. |
| 47 | + |
| 48 | +User experience |
| 49 | +We plan to release a python package tfdlpack, containing two APIs: |
| 50 | +``` |
| 51 | +to_dlpack: Given a tensorflow tensor, return a DLPack tensor contain. |
| 52 | +from_dlpack: Given a DLPack-compatible python capsule, return a tensorflow tensor. |
| 53 | +``` |
| 54 | + |
| 55 | +Example code of converting a Tensorflow tensor to Torch tensor using DLPack using the package: |
| 56 | +```python |
| 57 | +import numpy as np |
| 58 | +import tensorflow as tf |
| 59 | +import torch.utils.dlpack as thdlpack |
| 60 | +import tfdlpack |
| 61 | + |
| 62 | +t1 = tf.constant([1, 2, 3], dtype=np.float32) |
| 63 | +dlpack = tfdlpack.to_dlpack(t1) # tf tensor -> dlpack |
| 64 | +t2 = thdlpack.from_dlpack(dlpack) # dlpack -> th tensor |
| 65 | +print(t2) |
| 66 | +dlpack = thdlpack.to_dlpack(t2) # th tensor -> dlpack |
| 67 | +t3 = tfdlpack.from_dlpack(dlpack) # dlpack -> tf tensor |
| 68 | +print(t3) |
| 69 | +``` |
| 70 | +You will find that t1, t2 and t3 all have the same values, shape, and device contexts. |
| 71 | +Package dependency: tensorflow>=2.0 |
| 72 | + |
| 73 | +Proposed code of converting a Tensorflow tensor to Torch tensor using DLPack natively: |
| 74 | +```python |
| 75 | +import numpy as np |
| 76 | +import tensorflow as tf |
| 77 | +import tensorflow.experimental.dlpack as tfdlpack |
| 78 | +import torch.utils.dlpack as thdlpack |
| 79 | + |
| 80 | + |
| 81 | +t1 = tf.constant([1, 2, 3], dtype=np.float32) |
| 82 | +dlpack = tfdlpack.to_dlpack(t1) # tf tensor -> dlpack |
| 83 | +t2 = thdlpack.from_dlpack(dlpack) # dlpack -> th tensor |
| 84 | +print(t2) |
| 85 | +dlpack = thdlpack.to_dlpack(t2) # th tensor -> dlpack |
| 86 | +t3 = tfdlpack.from_dlpack(dlpack) # dlpack -> tf tensor |
| 87 | +print(t3) |
| 88 | +``` |
| 89 | + |
| 90 | +Potential technical problems for this API: |
| 91 | +1. Memory usability on async device (to_dlpack) |
| 92 | +As mentioned by @alextp |
| 93 | +> TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers). |
| 94 | +Here we decide to manunally sync the device when exporting TF tensor to dlpack. The sync behavior is done in the `TFE_TensorHandleDevicePointer` API, which returns the pointer to the underlying memory. |
| 95 | + |
| 96 | +2. Memory management (avoid leak) (to_dlpack/from_dlpack) |
| 97 | +As the design of dlpack, the framework constructing tensor from dlpack is responsible to call the dlpack's deleter, which is usually dereferencing the underlying buffer, when destructing the constructed tensor. |
| 98 | +For `from_dlpack`, a deleter function is registered when constructing the TF tensor, and would be called upon destruction. |
| 99 | +For `to_dlpack`, the dlpack data structure will hold a reference (by `TensorReference`) to the underlying buffer, and `unref` it in the dlpack's deleter function. |
| 100 | + |
| 101 | +Proposed API implementation details: |
| 102 | +- to_dlpack |
| 103 | + - Implementing `TFE_HandleToDLPack`, which converts tf's eager tensor handle to dlpack tensor's pointer(`DLManagedTensor*`). And wrap it into PyCapsule to adapt to the Python interface in ffi binding file. For the underlying memory liveness, `TensorReference` is used to maintain the reference counting over the underlying `TensorBuffer`, which increases when creating dlpack tensor, and decreases in the deleter of dlpack tensor. |
| 104 | +- from_dlpack |
| 105 | + - Implementing `TFE_HandleFromDLPack`, which converts dlpack tensor's pointer(`DLManagedTensor*`) to tf's eager tensor handle. `TFE_TensorHandleDevicePointer` is used to get the data pointer of underlying buffer, and synchronize the related device to ensures the memory readiness. |
| 106 | + |
| 107 | + |
| 108 | +## Questions and Discussion Topics |
| 109 | + |
| 110 | +https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting [\_\_cuda_array_interface\_\_](https://github.com/tensorflow/tensorflow/issues/29039) is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well. |
| 111 | + |
| 112 | +## Reference |
| 113 | + |
| 114 | +### tfdlpack package implementation detail |
| 115 | + |
| 116 | +The first design consideration is that we want to avoid any modification to the main Tensorflow library, so to get around the potential long delay of PR, code review, and release cycle of Tensorflow main package. Inspired by the solution from https://github.com/tobegit3hub/tftvm, we decide to implement the functionality as two custom tensor ops: to_dlpack and from_dlpack. |
| 117 | + |
| 118 | +Besides, we want this feature to be plugged into other projects quite easily. For example, any project that relies on this feature is able to run without compiling against Tensorflow's header files. Not only that an extra dependency usually means extra effort, but also that such maintenance is repetitive and should be handled by the feature developer (i.e., us) alone. To this end, we have an idea of releasing it as a python package. However, the question is how to invoke the two custom tensor ops in python? The challenge is that Tensorflow's custom op interface has a limited support of argument and return types, while to_dlpack and from_dlpack should have an argument/return type of DLPack object. We work around this by encoding the address of an DLPack object as an integer, so it can be accepted/returned by the custom op interface. Then, we decode it in python or C depending on whether we return it (to_dlpack) or consume it (from_dlpack). |
| 119 | + |
| 120 | +Finally, to achieve the maximal efficiency, we want the conversion happens without memory copy. |
| 121 | + |
| 122 | +For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation). |
| 123 | +For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op). |
0 commit comments