Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Transfer data to cuda with non_blocking=True #31

Merged
merged 2 commits into from
May 1, 2020
Merged

ENH: Transfer data to cuda with non_blocking=True #31

merged 2 commits into from
May 1, 2020

Conversation

scottclowe
Copy link
Contributor

@scottclowe scottclowe commented Apr 22, 2020

Transfering the target data to the cuda device with non_blocking=True should speed things up because the targets can transfer while the model is training.

I was considering adding an __init__ argument to control whether to use a non-blocking transfer, but as far as I am aware there are no negative consequences of always using non_blocking=True, so I have changed it to be always enabled.

Transfering the target data to the cuda device with
non_blocking=True should speed things up because the targets can
transfer while the model is training.
@davidtvs
Copy link
Owner

Thanks for the PR.

I googled around and found no one complaining about negative side effects and in the right conditions it improves performance; which begs the question of why non_blocking defaults to False in the first place.

I created a post in the PyTorch discussion forum asking the question and will wait a few days to see if someone can answer it before merging this PR. I'm guessing that there's no particular reason, but I just want to make sure.

@scottclowe
Copy link
Contributor Author

Yes, I also had a search and tried to see if there were possible negative downsides. I also don't understand why the default is False - I always turn it on for the projects I am doing.

@NaleRaphael
Copy link
Contributor

I haven't found any issue directly related to use non_blocking=True yet.

However, according to the statement in official documentation since v0.4.0, the parameter non_blocking for tensor.cuda()/tensor.to() should be effective only when pin_memory=True is passed to a DataLoader. The default value of pin_memory is False, and I found a PyTorch developer smth in this post explaining why it is False by default:

pinned memory is page-locked memory. It is easy for users to shoot themselves in the foot if they enable page-locked memory for everything, because it cant be pre-empted. That is why we did not make it default True.

For detailed explanation, I found this post saying:

It is possible for pinned memory allocation to fail, so you should always check for errors. The following code excerpt demonstrates allocation of pinned memory with error checking.

And there are some issues about setting pin_memory=True, here is one relating to race condition: pytorch/pytorch#24927


With things mentioned above, I guess this is why non_blocking is set to False by default:

  • Performance can only be improved if non_blocking=True with pin_memory=True (i.e. no effect when pin_memory=False)
  • Potential risk of setting pin_memory=True (though it could accelerate the speed of data loading to device)

Nevertheless, I second this idea since it benefits users.
But I have a little question: would it be better to add this argument into LRFinder.range_test() instead? In my opinion, non_blocking is an argument more related to DataLoader rather than LRFinder itself.

@scottclowe
Copy link
Contributor Author

However, according to the statement in official documentation since v0.4.0, the parameter non_blocking for tensor.cuda()/tensor.to() should be effective only when pin_memory=True is passed to a DataLoader.

Thanks for finding this! Not an obvious place for it to be documented.

But I have a little question: would it be better to add this argument into LRFinder.range_test() instead? In my opinion, non_blocking is an argument more related to DataLoader rather than LRFinder itself.

Yes, we could do that. It does indeed make more sense to add it to LRFinder.range_test() than LRFinder.__init__().

However, there is always a small cost to adding more user-facing parameters to the API (more documentation for the user to read, and a small cognitive burden of considering whether to change this extra parameter). If you think that nobody will ever have a need to change the parameter from a default value of True, perhaps it is not worth paying the cost of adding the ability to customise it. I would be happy to go either way on this though - if you would prefer to err on the side of flexibility, I'll add it as a parameter to LRFinder.range_test().

@NaleRaphael
Copy link
Contributor

NaleRaphael commented Apr 25, 2020

Yeah, you're right. Since non_blocking=True should improve the performance only when pinned memory buffer is used, it seems fine to set it to True by default and no need to expose it as a controllable argument.

And I did a quick search and found some interesting things:

  1. fastai does the same thing (in both v1 and v2):
  2. catalyst also does the same thing:
  3. pytorch-lightning is also planning to implement this feature:
  4. But ignite makes it as a keyword argument and keeps it as False by default (just like pytorch did):

Honestly, it's a bit hard to tell which approach is proper to be applied. But considering things you mentioned, it seems better to keep the API as simple as possible. Also, there is obviously no issue related to using non_blocking=True is found till now. So that it seems fine not to add non_blocking to either LRFinder.__init__() or LRFinder.range_test(). 😉

Besides, I wrote a simple script for this topic, and there are also something interesting worth digging deeper: https://colab.research.google.com/drive/12hE2b9chFeIeHtqcbW0ZCSGLQK0GyFF6
(note that peak memory consumption is not considered currently)

  1. Using non_blocking=True without setting pin_memory=True still improves the time of loading data slightly.
  2. Setting pin_memory=True without using non_blocking=True makes performance worse slightly. (maybe I missed something to make CUDA DMA(direct memory access) work properly...) (UPDATE: It seems this result is not consistent)

Since it's not a benchmark made under strict conditions, it's just for reference only.
BTW, I remember that fastai has done some hard work to make data loading faster. Here is a script you might be interested in, too: https://github.com/fastai/imagenet-fast/blob/faa0f9d/imagenet_nv/fastai_imagenet.py#L171-L206

@veritas9872
Copy link

In the CUDA programming model, asynchronous data transfer always requires pinned memory. Therefore, setting non_blocking=True always sends that data to pinned memory first. This might strain pinned memory if using 3D data or other very large mini-batches.

@davidtvs
Copy link
Owner

@veritas9872 could you give more details on what "strain pinned memory" means?

@veritas9872
Copy link

@davidtvs Pinned memory is non pageable memory allocated by CUDA. Allocating too much of it causes the system to slow down and in extreme cases even to crash. Most 2D deep learning applications do not allocate that much pinned memory so it is not much of a problem. However, the same holds for pre-fetching. The cases where pinned memory can be overloaded are also the cases where pre-fetch is most important, in cases where a large amount of data has to be sent to device.

@davidtvs
Copy link
Owner

@veritas9872 thanks for the explanation.

So if I understand correctly there should be no downside to non_blocking=True in the vast majority of use cases, whether pin_memory is True or False. And in cases where this might prove to be a problem, the user can still set pin_memory=False on the DataLoader in which case even if we use non_blocking=True it has no effect and it's as if it was set to False.

So in summary, it should be safe for us to just have always non_blocking=True. If my logic is wrong, let me know, otherwise, I'll merge this as-is (non_blocking=True) in the coming days.

@NaleRaphael
Copy link
Contributor

So if I understand correctly there should be no downside to ‵non_blocking=True‵ in the vast majority of use cases, whether ‵pin_memory‵ is ‵True‵ or ‵False‵. And in cases where this might prove to be a problem, the user can still set ‵pin_memory=False‵ on the ‵DataLoader‵ in which case even if we use ‵non_blocking=True‵ it has no effect and it's as if it was set to ‵False‵.

That's exactly what I was thinking. Also thanks @veritas9872 for the explanation.

@veritas9872
Copy link

So if I understand correctly there should be no downside to ‵non_blocking=True‵ in the vast majority of use cases, whether ‵pin_memory‵ is ‵True‵ or ‵False‵. And in cases where this might prove to be a problem, the user can still set ‵pin_memory=False‵ on the ‵DataLoader‵ in which case even if we use ‵non_blocking=True‵ it has no effect and it's as if it was set to ‵False‵.

That's exactly what I was thinking. Also thanks @veritas9872 for the explanation.

I agree that non_blocking=True is a good idea but I would like to correct one misconception.

Setting pin_memory=False will still allocate pinned memory before asynchronous transfer. There is simply no going around pinning memory before asynchronous transfer. See cudaMemcpy for details.

There may be an advantage in that only the main process will be able to set pinned memory instead of all of the DataLoader's worker processes. I am not sure how DataLoader handles it. If DataLoader only allocates pinned memory for the main process anyway, there will be no difference.

Finally, there must be an option to setting non_blocking=False and disabling asynchronous transfer and pre-fetch for extremely large 3D volumes or other such cases. While this would also cause severe delays, I have had several cases where using asynchronous transfer simply could not be made to work due to the size of the data.

Training slowly is better than having the computer go dark again and again.

@scottclowe
Copy link
Contributor Author

Okay, thanks for the feedback everyone, especially @veritas9872.

I've now added a non_blocking_transfer argument to range_test, with default value True, which is in the user-facing API and can be changed to False in order to disable using non_blocking for the obj.to(device) step.

This should provide the best user experience, as the transfer will automatically be asynchronous if possible, but the behaviour can be disabled in the rare cases that it causes a problem.

This parameter controls whether to use the non_blocking
flag when transferring data from the CPU to the CUDA device.

In general, it is fine to use a non_blocking transfer, but
the option to disable it is provided in case the user has
problems with very large tensors allocated to pinned memory.
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@1d1e3e7). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #31   +/-   ##
=========================================
  Coverage          ?   66.66%           
=========================================
  Files             ?        2           
  Lines             ?      219           
  Branches          ?        0           
=========================================
  Hits              ?      146           
  Misses            ?       73           
  Partials          ?        0           
Flag Coverage Δ
#unittests 66.66% <0.00%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d1e3e7...a6f3c69. Read the comment docs.

@davidtvs davidtvs merged commit c22e808 into davidtvs:master May 1, 2020
@davidtvs
Copy link
Owner

davidtvs commented May 1, 2020

Merged, thanks @scottclowe for the PR and to everyone else for the discussion 🥇

@NaleRaphael
Copy link
Contributor

NaleRaphael commented May 1, 2020

Setting pin_memory=False will still allocate pinned memory before asynchronous transfer. There is simply no going around pinning memory before asynchronous transfer. See cudaMemcpy for details.

@veritas9872 Thanks for pointing that out!

From the documentation of cudaMemcpy, I found the following statement

Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

Then I found this post said:

UVA enables “Zero-Copy” memory, which is pinned host memory accessible by device code directly, over PCI-Express, without a memcpy.

Back to the post How to Optimize Data Transfers in CUDA C/C++, there is a figure just telling what you said:
image


As for this doubt:

There may be an advantage in that only the main process will be able to set pinned memory instead of all of the DataLoader's worker processes. I am not sure how DataLoader handles it. If DataLoader only allocates pinned memory for the main process anyway, there will be no difference.

I've tried my best to trace the source code, and here is what I found:
There is a pin_memory_thread which is responsible to move data to pinned memory. Here is how it is created: _MultiProcessingDataLoaderIter.__init__(); and how it works: _pin_memory_thread.
Note that there is a comment in it:

# This setting is thread local, and prevents the copy in pin_memory from
# consuming all CPU cores.
torch.set_num_threads(1)

It seems that this line prevents pinned memory being allocated by other worker threads. But it's just a guess, I haven't figure out how to test it.

Anyway, many thanks to all of you. This PR makes me learn a lot. 😄

@scottclowe scottclowe deleted the enh_lazy-loading branch May 2, 2020 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants