Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` #23332

lezcano · 2023-05-12T12:53:36Z

When working on TorchInductor, I realised that there was a part from XLNetLMHeadModel that was being compiled to CPU code.

This PR should allow to fuse this operation with other CUDA operations in torch.compile. It also should be faster on eager mode, as it has a this implementation has a lower foot-print.

If in-place operations are not allowed even in non-grad context, I still believe that doing ones + tril rather than a ones + tril + zeros + cat should be faster simply due to the number of memory reads/writes.

I tested that this code produces the same results for 0 <= qlen,mlen < 10 and same_length in (True, False).

@ArthurZucker @younesbelkada

…NetLMHeadModel` When working on TorchInductor, I realised that there was a part from `XLNetLMHeadModel` that was being compiled to CPU code. This PR should allow to fuse this operation with other CUDA operations in `torch.compile`. It also should be faster on eager mode, as it has a this implementation has a lower foot-print. If in-place operations are not allowed even in non-grad context, I still believe that doing ones + tril rather than a ones + tril + zeros + cat should be faster simply due to the number of memory reads/writes. I tested that this code produces the same results for `0 <= qlen,mlen < 10` and `same_length in (True, False)`.

HuggingFaceDocBuilderDev · 2023-05-12T13:10:04Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Thanks a lot for this!

amyeroberts

Thanks for adding this!

…NetLMHeadModel` (huggingface#23332) When working on TorchInductor, I realised that there was a part from `XLNetLMHeadModel` that was being compiled to CPU code. This PR should allow to fuse this operation with other CUDA operations in `torch.compile`. It also should be faster on eager mode, as it has a this implementation has a lower foot-print. If in-place operations are not allowed even in non-grad context, I still believe that doing ones + tril rather than a ones + tril + zeros + cat should be faster simply due to the number of memory reads/writes. I tested that this code produces the same results for `0 <= qlen,mlen < 10` and `same_length in (True, False)`.

younesbelkada approved these changes May 12, 2023

View reviewed changes

younesbelkada requested a review from amyeroberts May 12, 2023 13:18

amyeroberts approved these changes May 12, 2023

View reviewed changes

amyeroberts merged commit 7f8b909 into huggingface:main May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` #23332

Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` #23332

lezcano commented May 12, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 12, 2023 •

edited

Loading

younesbelkada left a comment

amyeroberts left a comment

Compute the mask in-place, with less memory reads, and on CUDA on XLNetLMHeadModel #23332

Compute the mask in-place, with less memory reads, and on CUDA on XLNetLMHeadModel #23332

Conversation

lezcano commented May 12, 2023 • edited Loading

HuggingFaceDocBuilderDev commented May 12, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` #23332

Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` #23332

lezcano commented May 12, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 12, 2023 •

edited

Loading