Enable fx tracing for Mistral #30209

zucchini-nlp · 2024-04-12T08:21:53Z

What does this PR do?

Fixes #30083. As per title enables tracing for Mistral. Apparently Mistral was already traceable for "sdpa" attention, as it is similar to Llama which is already working. I enabled for "eager" attention also, which failed because mistral uses "sliding window" here

Tests passing:

tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_torch_fx
tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_torch_fx_output_loss

HuggingFaceDocBuilderDev · 2024-04-12T08:45:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

michaelbenayoun · 2024-04-12T09:28:12Z

Some other modeling files are based on test_modeling_mistral.py. You need to update them as well. It is easy to do it, can you please run: make fix-copies? It should basically enable fx symbolic tracing for other models that should support it.

zucchini-nlp · 2024-04-15T08:46:54Z

@michaelbenayoun Done! Fix copies added tracing for MoE models also, which was a bit unexpected. Anyway, I just removed a line with dynamic control flow from MoE models, and checked that it was not necessary (even if top-x is an empty tensor)

michaelbenayoun

LGTM on my side.
Let's see what @ArthurZucker or @amyeroberts have to say about the top_x change.

amyeroberts

Thanks for working on this!

If you invoke the case when top_x.shape[0] == 0 e.g. by setting

idx, top_x = torch.where(torch.zeros_like(expert_mask[expert_idx]))

in the lines above, do this still work in the tracing and non-tracing case?

zucchini-nlp · 2024-04-16T10:55:34Z

@amyeroberts Yes, for me it is working fine for me when it's empty tensor for 'top_x'

amyeroberts

Thanks for adding and confirming top_x behaviour!

michaelbenayoun · 2024-04-16T15:26:24Z

So, just confirming: we can merge with the removing of the top_x line?

ArthurZucker

Cool! yeah let's remove it if that still produces correct behaviour and supports fx!

zucchini-nlp · 2024-04-17T09:38:42Z

Merging now, since removal of "top_x" is approved

* tracing for mistral * typo * fix copies

bozheng-hit · 2024-04-25T10:56:06Z

This PR introduces a bug for Qwen2MoE GPTQ models, maybe revert it for the modeling_qwen2_moe.py file? @ArthurZucker

ArthurZucker · 2024-04-25T12:01:44Z

? without a reproducer and a stack trace shows the error?

bozheng-hit · 2024-04-25T13:37:05Z

? without a reproducer and a stack trace shows the error?

The code to reproduce to error is here:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The error information is here, and the model successfully generates after I revert the change for modeling_qwen2_moe.py.

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ArthurZucker · 2024-04-26T09:33:05Z

This seems to be using GPTQ and quantisation. Can you open a separate issue and ping @younesbelkada and @SunMarc

* tracing for mistral * typo * fix copies

zucchini-nlp added 2 commits April 12, 2024 09:59

tracing for mistral

c609326

typo

84fde60

zucchini-nlp requested a review from michaelbenayoun April 12, 2024 08:21

fix copies

9792336

michaelbenayoun approved these changes Apr 15, 2024

View reviewed changes

michaelbenayoun requested review from amyeroberts and ArthurZucker April 15, 2024 09:25

amyeroberts reviewed Apr 16, 2024

View reviewed changes

amyeroberts approved these changes Apr 16, 2024

View reviewed changes

ArthurZucker reviewed Apr 17, 2024

View reviewed changes

ArthurZucker approved these changes Apr 17, 2024

View reviewed changes

zucchini-nlp merged commit 304c6a1 into huggingface:main Apr 17, 2024
18 checks passed

zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request Apr 18, 2024

Enable fx tracing for Mistral (huggingface#30209)

467825d

* tracing for mistral * typo * fix copies

ArthurZucker pushed a commit that referenced this pull request Apr 22, 2024

Enable fx tracing for Mistral (#30209)

9d2f9ad

* tracing for mistral * typo * fix copies

ydshieh pushed a commit that referenced this pull request Apr 23, 2024

Enable fx tracing for Mistral (#30209)

8083fca

* tracing for mistral * typo * fix copies

bozheng-hit mentioned this pull request Apr 27, 2024

Inference bug of the MoE GPTQ models #30515

Closed

4 tasks

itazap pushed a commit that referenced this pull request May 14, 2024

Enable fx tracing for Mistral (#30209)

f45664d

* tracing for mistral * typo * fix copies

zhengyaowei mentioned this pull request Aug 7, 2024

Skip non-selected experts for mixtral and qwen2_moe #32429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fx tracing for Mistral #30209

Enable fx tracing for Mistral #30209

zucchini-nlp commented Apr 12, 2024

HuggingFaceDocBuilderDev commented Apr 12, 2024

michaelbenayoun commented Apr 12, 2024

zucchini-nlp commented Apr 15, 2024

michaelbenayoun left a comment

amyeroberts left a comment

zucchini-nlp commented Apr 16, 2024

amyeroberts left a comment

michaelbenayoun commented Apr 16, 2024

ArthurZucker left a comment

zucchini-nlp commented Apr 17, 2024

bozheng-hit commented Apr 25, 2024

ArthurZucker commented Apr 25, 2024

bozheng-hit commented Apr 25, 2024

ArthurZucker commented Apr 26, 2024

Enable fx tracing for Mistral #30209

Enable fx tracing for Mistral #30209

Conversation

zucchini-nlp commented Apr 12, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 12, 2024

michaelbenayoun commented Apr 12, 2024

zucchini-nlp commented Apr 15, 2024

michaelbenayoun left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

michaelbenayoun commented Apr 16, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 17, 2024

bozheng-hit commented Apr 25, 2024

ArthurZucker commented Apr 25, 2024

bozheng-hit commented Apr 25, 2024

ArthurZucker commented Apr 26, 2024