Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing CudaAsyncBuffer with TArray to improve perf #3303

Merged
merged 6 commits into from
Mar 24, 2020

Conversation

weixingzhang
Copy link
Contributor

Description: Describe your changes.
Using TArray instead of CudaAsyncBuffer to passing tensor meta data(mostly, tensor strides/pitches) to CUDA kernels to improve performance

Motivation and Context
Basically, with CudaAsyncBuffer, the sequence at host side is like below.

  1. fill meta data to CPU Pinned memory
  2. call cudaMemcpyAsync to upload these meta data from CPU Pinned memory to GPU memory.
  3. launch CUDA kernel.

On device(GPU) side, the copy in #2 won't be executed until the moment when kernel in #3 is about to run on GPU. Usually, host side runs far ahead of device side. So, it is better to do data uploading early(during kernel launch time) on host side rather than until kernel execution to save time on GPU. To achieve it, we introduced TArray by which the data is passed to CUDA kernels by pass-by-value. On BERT-L training, we saw about 3-5% perf improvement.

@weixingzhang weixingzhang requested a review from a team as a code owner March 23, 2020 22:39
@ke1337
Copy link
Contributor

ke1337 commented Mar 23, 2020

" On BERT-L training, we saw about 3-5% perf improvement." I assume it's for non-fused model right? The change does not seem to touch contrib ops

@ke1337
Copy link
Contributor

ke1337 commented Mar 23, 2020

In general, I have concerns of regression in this change not caught by existing tests. I had added comments on some ops that fixed sized TArray may overflow.

Copy link
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@weixingzhang
Copy link
Contributor Author

" On BERT-L training, we saw about 3-5% perf improvement." I assume it's for non-fused model right? The change does not seem to touch contrib ops

Yes. In BERT-L, we only fused Layernorm, GELU, Transpose+Matmul.

@weixingzhang
Copy link
Contributor Author

In general, I have concerns of regression in this change not caught by existing tests. I had added comments on some ops that fixed sized TArray may overflow.

Agree, I just updated PR to keep CudaAsyncBuffer for non_max_suppression, cudnn_rnn_base, concat, split. Thanks!

ke1337
ke1337 previously approved these changes Mar 24, 2020
Copy link
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@weixingzhang weixingzhang merged commit fef7989 into master Mar 24, 2020
@weixingzhang weixingzhang deleted the wezhan/asyncbuffer branch March 24, 2020 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants