Fix Attention Runtime Error for CLIP model #17729

tianleiwu · 2023-09-28T06:41:01Z

Description

The condition check is not correct

if (is_unidirectional_ && enable_fused_causal_attention_) {  // GPT
   ...
}
else { // BERT
   ...
}

Change it to

if (is_unidirectional_) {  // GPT
    if (enable_fused_causal_attention_) {
       ....
    }
}
else { // BERT
    ...
}

There are two walkarounds using older ORT binary <= 1.16:
(1) Enable fused causal attention by adding an environment variable ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 before running stable diffusion. But the fused causal attention uses fp16 in accumulation so such walkaround might bring precision loss.
(2) Disable attention fusion in CLIP. However, it will bring performance loss.

Motivation and Context

Without the fix, optimized CLIP model of stable diffusion will encounter error in running Attention node:

2023-09-24 16:15:31.206037898 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Attention node. Name:'Attention_0' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu:207 bool onnxruntime::contrib::cuda::FusedMHARunnerFP16v2::mhaImpl::is_flash_attention(int) const interface->mHasCausalMask == false was false.

Note that the bug has been there for a long time. It is just surfaced since we recently added a fusion for CLIP, which will trigger the error.

We will add a comprehensive test for causal attention later to avoid such corner cases.

### Description The condition check is not correct ``` if (is_unidirectional_ && enable_fused_causal_attention_) { // GPT } else { // BERT } ``` Change it to ``` if (is_unidirectional_) { // GPT } else { // BERT } ``` Another walkaround is to enable fused causal attention by adding an environment variable `ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1` before running stable diffusion. ### Motivation and Context Without the fix, optimized CLIP model of stable diffusion will encounter error in running Attention node: 2023-09-24 16:15:31.206037898 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Attention node. Name:'Attention_0' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu:207 bool onnxruntime::contrib::cuda::FusedMHARunnerFP16v2::mhaImpl::is_flash_attention(int) const interface->mHasCausalMask == false was false. Note that the bug has been there for a long time. It is just surfaced since we recently added a fusion for CLIP, which will trigger the error. We will add a comprehensive test for causal attention later to avoid such corner cases.

snnn · 2023-10-02T22:15:19Z

Cherry-picked to the 1.16.1 patch release

### Description The condition check is not correct ``` if (is_unidirectional_ && enable_fused_causal_attention_) { // GPT } else { // BERT } ``` Change it to ``` if (is_unidirectional_) { // GPT } else { // BERT } ``` Another walkaround is to enable fused causal attention by adding an environment variable `ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1` before running stable diffusion. ### Motivation and Context Without the fix, optimized CLIP model of stable diffusion will encounter error in running Attention node: 2023-09-24 16:15:31.206037898 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Attention node. Name:'Attention_0' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu:207 bool onnxruntime::contrib::cuda::FusedMHARunnerFP16v2::mhaImpl::is_flash_attention(int) const interface->mHasCausalMask == false was false. Note that the bug has been there for a long time. It is just surfaced since we recently added a fusion for CLIP, which will trigger the error. We will add a comprehensive test for causal attention later to avoid such corner cases.

fix attention

7a7dd6b

tianleiwu requested review from aciddelgado and wangyems September 28, 2023 06:41

tianleiwu mentioned this pull request Sep 28, 2023

Update optimize_pipeline for SDXL #17536

Merged

4 tasks

wangyems previously approved these changes Sep 28, 2023

View reviewed changes

format

76ad7c6

tianleiwu dismissed wangyems’s stale review via 76ad7c6 September 28, 2023 17:42

tianleiwu requested a review from wangyems September 28, 2023 17:42

wangyems approved these changes Sep 28, 2023

View reviewed changes

aciddelgado approved these changes Sep 28, 2023

View reviewed changes

tianleiwu merged commit 20f96fd into main Sep 28, 2023

tianleiwu deleted the tlwu/fix_attention_if_causal branch September 28, 2023 21:32

tianleiwu added the release:1.16.1 label Sep 29, 2023

snnn removed the release:1.16.1 label Oct 2, 2023

tianleiwu added the release:1.16.2 label Oct 24, 2023

faxu added triage:approved Approved for cherrypicks for release sdxl_llama labels Oct 25, 2023

tianleiwu removed triage:approved Approved for cherrypicks for release release:1.16.2 labels Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Attention Runtime Error for CLIP model #17729

Fix Attention Runtime Error for CLIP model #17729

tianleiwu commented Sep 28, 2023 •

edited

Loading

snnn commented Oct 2, 2023

Fix Attention Runtime Error for CLIP model #17729

Fix Attention Runtime Error for CLIP model #17729

Conversation

tianleiwu commented Sep 28, 2023 • edited Loading

Description

Motivation and Context

snnn commented Oct 2, 2023

tianleiwu commented Sep 28, 2023 •

edited

Loading