ENH: Transfer data to cuda with non_blocking=True #31

scottclowe · 2020-04-22T18:26:11Z

Transfering the target data to the cuda device with non_blocking=True should speed things up because the targets can transfer while the model is training.

I was considering adding an __init__ argument to control whether to use a non-blocking transfer, but as far as I am aware there are no negative consequences of always using non_blocking=True, so I have changed it to be always enabled.

Transfering the target data to the cuda device with non_blocking=True should speed things up because the targets can transfer while the model is training.

davidtvs · 2020-04-22T19:52:17Z

Thanks for the PR.

I googled around and found no one complaining about negative side effects and in the right conditions it improves performance; which begs the question of why non_blocking defaults to False in the first place.

I created a post in the PyTorch discussion forum asking the question and will wait a few days to see if someone can answer it before merging this PR. I'm guessing that there's no particular reason, but I just want to make sure.

scottclowe · 2020-04-22T20:07:40Z

Yes, I also had a search and tried to see if there were possible negative downsides. I also don't understand why the default is False - I always turn it on for the projects I am doing.

NaleRaphael · 2020-04-23T09:31:36Z

I haven't found any issue directly related to use non_blocking=True yet.

However, according to the statement in official documentation since v0.4.0, the parameter non_blocking for tensor.cuda()/tensor.to() should be effective only when pin_memory=True is passed to a DataLoader. The default value of pin_memory is False, and I found a PyTorch developer smth in this post explaining why it is False by default:

pinned memory is page-locked memory. It is easy for users to shoot themselves in the foot if they enable page-locked memory for everything, because it cant be pre-empted. That is why we did not make it default True.

For detailed explanation, I found this post saying:

It is possible for pinned memory allocation to fail, so you should always check for errors. The following code excerpt demonstrates allocation of pinned memory with error checking.

And there are some issues about setting pin_memory=True, here is one relating to race condition: pytorch/pytorch#24927

With things mentioned above, I guess this is why non_blocking is set to False by default:

Performance can only be improved if non_blocking=True with pin_memory=True (i.e. no effect when pin_memory=False)
Potential risk of setting pin_memory=True (though it could accelerate the speed of data loading to device)

Nevertheless, I second this idea since it benefits users.
But I have a little question: would it be better to add this argument into LRFinder.range_test() instead? In my opinion, non_blocking is an argument more related to DataLoader rather than LRFinder itself.

scottclowe · 2020-04-25T08:54:14Z

However, according to the statement in official documentation since v0.4.0, the parameter non_blocking for tensor.cuda()/tensor.to() should be effective only when pin_memory=True is passed to a DataLoader.

Thanks for finding this! Not an obvious place for it to be documented.

But I have a little question: would it be better to add this argument into LRFinder.range_test() instead? In my opinion, non_blocking is an argument more related to DataLoader rather than LRFinder itself.

Yes, we could do that. It does indeed make more sense to add it to LRFinder.range_test() than LRFinder.__init__().

However, there is always a small cost to adding more user-facing parameters to the API (more documentation for the user to read, and a small cognitive burden of considering whether to change this extra parameter). If you think that nobody will ever have a need to change the parameter from a default value of True, perhaps it is not worth paying the cost of adding the ability to customise it. I would be happy to go either way on this though - if you would prefer to err on the side of flexibility, I'll add it as a parameter to LRFinder.range_test().

NaleRaphael · 2020-04-25T15:15:39Z

Yeah, you're right. Since non_blocking=True should improve the performance only when pinned memory buffer is used, it seems fine to set it to True by default and no need to expose it as a controllable argument.

And I did a quick search and found some interesting things:

fastai does the same thing (in both v1 and v2):
- https://github.com/fastai/fastai/blob/735e9ee/fastai/torch_core.py#L120-L123
- https://github.com/fastai/fastai2/blob/2efca9a/fastai2/torch_core.py#L193-L198
catalyst also does the same thing:
- https://github.com/catalyst-team/catalyst/blob/f93beff/catalyst/utils/torch.py#L128-L129
pytorch-lightning is also planning to implement this feature:
- Non-blocking GPU memory transfer Lightning-AI/pytorch-lightning#620
But ignite makes it as a keyword argument and keeps it as False by default (just like pytorch did):
- https://github.com/pytorch/ignite/blob/ad4b8a5/ignite/utils.py#L19-L20

Honestly, it's a bit hard to tell which approach is proper to be applied. But considering things you mentioned, it seems better to keep the API as simple as possible. Also, there is obviously no issue related to using non_blocking=True is found till now. So that it seems fine not to add non_blocking to either LRFinder.__init__() or LRFinder.range_test(). 😉

Besides, I wrote a simple script for this topic, and there are also something interesting worth digging deeper: https://colab.research.google.com/drive/12hE2b9chFeIeHtqcbW0ZCSGLQK0GyFF6
(note that peak memory consumption is not considered currently)

Using non_blocking=True without setting pin_memory=True still improves the time of loading data slightly.
~~Setting pin_memory=True without using non_blocking=True makes performance worse slightly. (maybe I missed something to make CUDA DMA(direct memory access) work properly...)~~ (UPDATE: It seems this result is not consistent)

Since it's not a benchmark made under strict conditions, it's just for reference only.
BTW, I remember that fastai has done some hard work to make data loading faster. Here is a script you might be interested in, too: https://github.com/fastai/imagenet-fast/blob/faa0f9d/imagenet_nv/fastai_imagenet.py#L171-L206

veritas9872 · 2020-04-26T01:29:37Z

In the CUDA programming model, asynchronous data transfer always requires pinned memory. Therefore, setting non_blocking=True always sends that data to pinned memory first. This might strain pinned memory if using 3D data or other very large mini-batches.

davidtvs · 2020-04-29T19:36:08Z

@veritas9872 could you give more details on what "strain pinned memory" means?

veritas9872 · 2020-04-30T01:23:33Z

@davidtvs Pinned memory is non pageable memory allocated by CUDA. Allocating too much of it causes the system to slow down and in extreme cases even to crash. Most 2D deep learning applications do not allocate that much pinned memory so it is not much of a problem. However, the same holds for pre-fetching. The cases where pinned memory can be overloaded are also the cases where pre-fetch is most important, in cases where a large amount of data has to be sent to device.

davidtvs · 2020-04-30T22:02:02Z

@veritas9872 thanks for the explanation.

So if I understand correctly there should be no downside to non_blocking=True in the vast majority of use cases, whether pin_memory is True or False. And in cases where this might prove to be a problem, the user can still set pin_memory=False on the DataLoader in which case even if we use non_blocking=True it has no effect and it's as if it was set to False.

So in summary, it should be safe for us to just have always non_blocking=True. If my logic is wrong, let me know, otherwise, I'll merge this as-is (non_blocking=True) in the coming days.

NaleRaphael · 2020-05-01T00:00:29Z

So if I understand correctly there should be no downside to ‵non_blocking=True‵ in the vast majority of use cases, whether ‵pin_memory‵ is ‵True‵ or ‵False‵. And in cases where this might prove to be a problem, the user can still set ‵pin_memory=False‵ on the ‵DataLoader‵ in which case even if we use ‵non_blocking=True‵ it has no effect and it's as if it was set to ‵False‵.

That's exactly what I was thinking. Also thanks @veritas9872 for the explanation.

veritas9872 · 2020-05-01T04:21:31Z

So if I understand correctly there should be no downside to ‵non_blocking=True‵ in the vast majority of use cases, whether ‵pin_memory‵ is ‵True‵ or ‵False‵. And in cases where this might prove to be a problem, the user can still set ‵pin_memory=False‵ on the ‵DataLoader‵ in which case even if we use ‵non_blocking=True‵ it has no effect and it's as if it was set to ‵False‵.

That's exactly what I was thinking. Also thanks @veritas9872 for the explanation.

I agree that non_blocking=True is a good idea but I would like to correct one misconception.

Setting pin_memory=False will still allocate pinned memory before asynchronous transfer. There is simply no going around pinning memory before asynchronous transfer. See cudaMemcpy for details.

There may be an advantage in that only the main process will be able to set pinned memory instead of all of the DataLoader's worker processes. I am not sure how DataLoader handles it. If DataLoader only allocates pinned memory for the main process anyway, there will be no difference.

Finally, there must be an option to setting non_blocking=False and disabling asynchronous transfer and pre-fetch for extremely large 3D volumes or other such cases. While this would also cause severe delays, I have had several cases where using asynchronous transfer simply could not be made to work due to the size of the data.

Training slowly is better than having the computer go dark again and again.

scottclowe · 2020-05-01T09:23:30Z

Okay, thanks for the feedback everyone, especially @veritas9872.

I've now added a non_blocking_transfer argument to range_test, with default value True, which is in the user-facing API and can be changed to False in order to disable using non_blocking for the obj.to(device) step.

This should provide the best user experience, as the transfer will automatically be asynchronous if possible, but the behaviour can be disabled in the rare cases that it causes a problem.

torch_lr_finder/lr_finder.py

This parameter controls whether to use the non_blocking flag when transferring data from the CPU to the CUDA device. In general, it is fine to use a non_blocking transfer, but the option to disable it is provided in case the user has problems with very large tensors allocated to pinned memory.

codecov-io · 2020-05-01T15:55:50Z

Codecov Report

❗ No coverage uploaded for pull request base (master@1d1e3e7). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master      #31   +/-   ##
=========================================
  Coverage          ?   66.66%           
=========================================
  Files             ?        2           
  Lines             ?      219           
  Branches          ?        0           
=========================================
  Hits              ?      146           
  Misses            ?       73           
  Partials          ?        0

Flag	Coverage Δ
#unittests	`66.66% <0.00%> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d1e3e7...a6f3c69. Read the comment docs.

davidtvs · 2020-05-01T17:06:18Z

Merged, thanks @scottclowe for the PR and to everyone else for the discussion 🥇

NaleRaphael · 2020-05-01T17:55:38Z

Setting pin_memory=False will still allocate pinned memory before asynchronous transfer. There is simply no going around pinning memory before asynchronous transfer. See cudaMemcpy for details.

@veritas9872 Thanks for pointing that out!

From the documentation of cudaMemcpy, I found the following statement

Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing.

Then I found this post said:

UVA enables “Zero-Copy” memory, which is pinned host memory accessible by device code directly, over PCI-Express, without a memcpy.

Back to the post How to Optimize Data Transfers in CUDA C/C++, there is a figure just telling what you said:

As for this doubt:

There may be an advantage in that only the main process will be able to set pinned memory instead of all of the DataLoader's worker processes. I am not sure how DataLoader handles it. If DataLoader only allocates pinned memory for the main process anyway, there will be no difference.

I've tried my best to trace the source code, and here is what I found:
There is a pin_memory_thread which is responsible to move data to pinned memory. Here is how it is created: _MultiProcessingDataLoaderIter.__init__(); and how it works: _pin_memory_thread.
Note that there is a comment in it:

# This setting is thread local, and prevents the copy in pin_memory from
# consuming all CPU cores.
torch.set_num_threads(1)

It seems that this line prevents pinned memory being allocated by other worker threads. But it's just a guess, I haven't figure out how to test it.

Anyway, many thanks to all of you. This PR makes me learn a lot. 😄

ENH: Transfer data to cuda with non_blocking=True

abd27ce

Transfering the target data to the cuda device with non_blocking=True should speed things up because the targets can transfer while the model is training.

davidtvs requested changes May 1, 2020

View reviewed changes

torch_lr_finder/lr_finder.py Outdated Show resolved Hide resolved

davidtvs approved these changes May 1, 2020

View reviewed changes

davidtvs merged commit c22e808 into davidtvs:master May 1, 2020

scottclowe deleted the enh_lazy-loading branch May 2, 2020 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Transfer data to cuda with non_blocking=True #31

ENH: Transfer data to cuda with non_blocking=True #31

scottclowe commented Apr 22, 2020 •

edited

Loading

davidtvs commented Apr 22, 2020

scottclowe commented Apr 22, 2020

NaleRaphael commented Apr 23, 2020

scottclowe commented Apr 25, 2020

NaleRaphael commented Apr 25, 2020 •

edited

Loading

veritas9872 commented Apr 26, 2020

davidtvs commented Apr 29, 2020

veritas9872 commented Apr 30, 2020

davidtvs commented Apr 30, 2020

NaleRaphael commented May 1, 2020

veritas9872 commented May 1, 2020

scottclowe commented May 1, 2020

codecov-io commented May 1, 2020

davidtvs commented May 1, 2020

NaleRaphael commented May 1, 2020 •

edited

Loading

ENH: Transfer data to cuda with non_blocking=True #31

ENH: Transfer data to cuda with non_blocking=True #31

Conversation

scottclowe commented Apr 22, 2020 • edited Loading

davidtvs commented Apr 22, 2020

scottclowe commented Apr 22, 2020

NaleRaphael commented Apr 23, 2020

scottclowe commented Apr 25, 2020

NaleRaphael commented Apr 25, 2020 • edited Loading

veritas9872 commented Apr 26, 2020

davidtvs commented Apr 29, 2020

veritas9872 commented Apr 30, 2020

davidtvs commented Apr 30, 2020

NaleRaphael commented May 1, 2020

veritas9872 commented May 1, 2020

scottclowe commented May 1, 2020

codecov-io commented May 1, 2020

Codecov Report

davidtvs commented May 1, 2020

NaleRaphael commented May 1, 2020 • edited Loading

scottclowe commented Apr 22, 2020 •

edited

Loading

NaleRaphael commented Apr 25, 2020 •

edited

Loading

NaleRaphael commented May 1, 2020 •

edited

Loading