- 
                Notifications
    You must be signed in to change notification settings 
- Fork 530
Support multistream of MLA vector operations #1135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
546e6c7    to
    5b20d64      
    Compare
  
    d4eac5b    to
    4967ade      
    Compare
  
    | This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
4967ade    to
    be59b9b      
    Compare
  
    | This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
be59b9b    to
    dcea3aa      
    Compare
  
    dcea3aa    to
    980b356      
    Compare
  
    | This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
    
      
        1 similar comment
      
    
  
    | This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
980b356    to
    b7e7700      
    Compare
  
    b7e7700    to
    fd0f6fa      
    Compare
  
    | This pull request has conflicts, please resolve those before we can evaluate the pull request. | 
With the expected overlaping being:
```
              | q_rmsnorm |  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV |    matmul W_UQ     | split | matmul W_KV_T |
```
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
    fd0f6fa    to
    c89bf7b      
    Compare
  
    | there is a pr for refactor. I suggest to block this PR until that one is merged #1169 | 
| decode_k_pe, decode_k_nope = self.exec_kv( | ||
| hidden_states_or_kv_c_normed, cos, sin, kv_cache, | ||
| attn_metadata.slot_mapping) | ||
| with npu_stream_switch("mla_secondary", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this npu_stream_switch can be used inside the torchair right? if that's so, can we further optimize the dbo path of deepseek to gain more performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with torchair colleagues, the with torch.npu.stream(...): scheme is not supported by torchair graph mode while npu_stream_switch only works under graph mode, so I recommend leaving this to later PRs:
- to provide a joint version of npu_stream_switchencapsulating both cases, as well as,
- adding graph mode support for dbo
| hidden_states_or_q_c = self.q_a_layernorm(ckq) | ||
| use_multistream_mla = (self.enable_multistream_mla | ||
| and attn_metadata is not None | ||
| and attn_metadata.num_decodes > 0) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, multistream mla will not be triggered if there are only prefill requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, current multistream utilities only support graph mode, i.e. npu_stream_switch returns contextlib.nullcontext() on non-graph mode, which is the case for prefill.
…)" This reverts commit e72f94e.
…)" This reverts commit e72f94e. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
    ### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
    ### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
    ### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
    ### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
    
What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected overlaping being:
Currently, the
IndexByTensoroperators introduced by computation ofcosandsincan't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed beforematmul W_UQto avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation ofcosandsinup to the first layer.Does this PR introduce any user-facing change?
Controlled by
torchair_graph_config.enable_multistream_mla, defaulted to False.How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.