Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing mData between device and staging tensors #130

Closed
wants to merge 1 commit into from

Conversation

alexander-g
Copy link
Contributor

@alexander-g alexander-g commented Feb 5, 2021

Currently, when creating a tensor the data is copied several times:

  • in the Tensor constructor
  • in OpTensorCreate when creating a staged tensor
  • the staged tensor additionally allocates a host-side vulkan buffer to copy the data into

Optionally for input/output tensors:

  • in OpTensorSyncDevice/OpTensorSyncLocal when creating another staged tensor
  • which creates another host-side vulkan buffer

This is 5x the required memory. Additionally, those staging tensors are not destroyed because sequences are never deleted (#36) which creates a memory leak. This is unacceptable when working with large amounts of data. It has already happened that I've run out of host memory several times.

To mitigate this, I've converted Tensor::mData from std::vector to a std::shared_ptr<std::vector> to enable sharing the data between device tensors and staging tensors to reduce memory footprint a little.
In the long term larger refactoring is needed (esp. #36 and #14)

@axsaucedo
Copy link
Member

Interesting @alexander-g - ok so just as a heads up, currently the only reason why the data function was changed from a return by value into a return by reference and not a "return by const ref" is to simplify the Python examples which then allow you to perform operations like tensor[0]++ or tensor[0] = .... Without this it would become harder.

This something that I have been thinking that really the tensor data should only be modifiable through setters and getters, instead of directly as a reference, mainly as this would allow users to know when the Tensor has been modified. In this case, we would be able to actually have the data() function returning a const ref, and then the data_sp() as a friend function so it could be initialised by passing another tensor for example.

I do understand that your proposal in a parallel discussion is to explore removing the vector all-together from inside the Tensor object, and only have the GPU memory as the main memory location. This is something that we could explore but my only concern is what the cost of accessing Host visible memory is compared to storing it in a vector - do you know what I mean?

In theory if the cost of returning just the value that is currently located in host only memory then this could be a way in which we could get rid of the std::vector, and when data() is called, the vector could be created via the underlying data pointer and memory size (offset).

What are your thoughts on this? We could explore merging this initially as I do understand the need for sharing the std::vector across staging and device tensors as a starter as it's key to address this memory leak you identified but I would actually be very keen to explore this discussion further.

Let me know your thoughts and then we'll proceed from there.

@alexander-g
Copy link
Contributor Author

alexander-g commented Feb 7, 2021

This is something that we could explore but my only concern is what the cost of accessing Host visible memory is compared to storing it in a vector - do you know what I mean?

No, I'm not completely sure what you mean.
I want to go even further than you think I want: I suggest to not only remove the vector from tensors but also the operations like tensor[0]++ or tensor[0] = ..., because (no offense but) this is a toy functionality compared to what numpy (or Eigen or xtensor) offers. I've never used them so far, because I'd use numpy on CPU anyway, loading data from disk: numpy, displaying an image: numpy, slicing images: numpy. And if you want to extend these operations this would be basically reinventing the wheel.

I propose to maybe even rename Tensor to something like Buffer which is only a data transfer interface to/from GPU. The API might look like this:

A = mgr.create_buffer(256) #256 bytes
B = mgr.create_buffer(256) #immediately allocated on the device
C = mgr.create_buffer(256)
D = mgr.create_buffer(256)

#no need for OpCreateTensor

sequence.record_transfer_to_device([A,B])
sequence.record_algo(mult_shader, [C,A,B]) #C = A*B
sequence.record_algo(add_shader,  [D,C,B]) #D = C+B
sequence.record_transfer_to_host([D])

#assign np.ones to buffer A, np.arange to buffer B, execute, receive D
d = sequence.eval(np.ones(256), np.arange(256))

assert isinstance(d, np.ndarray)

Buffer C would be device-only (what currently eStorage is)


We could explore merging this initially

Sure

@axsaucedo
Copy link
Member

@alexander-g I've done some further research this weekend, and I have opened an issue with the details, which basically would encompass a redesigned approach in favour of this PR, it would be good to hear your thoughts #136

@axsaucedo
Copy link
Member

@alexander-g closing this issue in favour of #130, would be good to hear your thoughts on the PR as well as on the proposal #130

@axsaucedo axsaucedo closed this Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants