Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

RonanKMcGovern · 2023-12-13T14:23:11Z

System Info

v1.3.0 running on runpod with an A6000 48GB RAM

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run tgi docker image with this config:

--model-id Trelis/Mistral-7B-Instruct-v0.2 --trust-remote-code --port 8080

It seems that the sliding window being null in config.json is perhaps causing an issue.

Error:
2023-12-13T14:17:01.083701660Z The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2023-12-13T14:17:01.083705387Z The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2023-12-13T14:17:01.083707872Z Traceback (most recent call last):
2023-12-13T14:17:01.083711038Z
2023-12-13T14:17:01.083713232Z File "/opt/conda/bin/text-generation-server", line 8, in
2023-12-13T14:17:01.083716157Z sys.exit(app())
2023-12-13T14:17:01.083719153Z
2023-12-13T14:17:01.083721136Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
2023-12-13T14:17:01.083724132Z server.serve(
2023-12-13T14:17:01.083726337Z
2023-12-13T14:17:01.083728541Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 215, in serve
2023-12-13T14:17:01.083736516Z asyncio.run(
2023-12-13T14:17:01.083739352Z
2023-12-13T14:17:01.083741466Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2023-12-13T14:17:01.083743830Z return loop.run_until_complete(main)
2023-12-13T14:17:01.083745944Z
2023-12-13T14:17:01.083748228Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
2023-12-13T14:17:01.083751024Z return future.result()
2023-12-13T14:17:01.083753799Z
2023-12-13T14:17:01.083770611Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 161, in serve_inner
2023-12-13T14:17:01.083774418Z model = get_model(
2023-12-13T14:17:01.083776502Z
2023-12-13T14:17:01.083778546Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 299, in get_model
2023-12-13T14:17:01.083786421Z return FlashMistral(
2023-12-13T14:17:01.083788495Z
2023-12-13T14:17:01.083790669Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 424, in init
2023-12-13T14:17:01.083793895Z super(FlashMistral, self).init(
2023-12-13T14:17:01.083795959Z
2023-12-13T14:17:01.083797872Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 318, in init
2023-12-13T14:17:01.083799966Z SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
2023-12-13T14:17:01.083802060Z
2023-12-13T14:17:01.083804054Z TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
2023-12-13T14:17:01.083806189Z rank=0
2023-12-13T14:17:01.087578056Z 2023-12-13T14:17:01.087255Z ERROR text_generation_launcher: Shard 0 failed to start

Expected behavior

Expect tgi to run like v0.1

The text was updated successfully, but these errors were encountered:

theonesud · 2023-12-13T15:20:52Z

+1.
I'm trying to run the new Mistral 0.2 GPTQ using huggingface TGI's docker container (see both 1.1.0 and 1.3 outputs below) and am facing some issues. Can someone help me debug this issue? Is it an issue with the TGI docker image, the model itself or am I doing something stupid?

System Config:
Ubuntu 22.04
GPU 3060Ti - 8GB VRAM
Cuda 12.2
Docker and Nvidia container toolkit

Command:
docker run --gpus all --shm-size 1g -v $PWD/model:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --revision gptq-8bit--1g-actorder_True --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Error:

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/**init**.py", line 252, in get_model
return FlashMistral(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_mistral.py", line 312, in **init**
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
rank=0

I can post the entire error log if someone wants.
If I try with the newer version of TGI (1.3), I get a different error

Command:
docker run --gpus all --shm-size 1g -v $PWD/model:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --revision gptq-8bit--1g-actorder_True --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Error:

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 366, in <listcomp>
    MistralLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 300, in __init__
    self.self_attn = MistralAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 177, in __init__
    self.query_key_value = load_attention(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 106, in load_attention
    return _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 139, in _load_gqa
    get_linear(weight, bias=None, quantize=config.quantize)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 330, in get_linear
    linear = ExllamaQuantLinear(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/gptq/exllamav2.py", line 145, in __init__
    assert qzeros.shape == (
AssertionError

gameveloster · 2023-12-14T02:55:37Z

Seeing the same error on 1.1.1 when using TheBloke/Mistral-7B-Instruct-v0.2-AWQ

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

maziyarpanahi · 2023-12-14T20:30:12Z

Seeing the same error with mistralai/Mixtral-8x7B-Instruct-v0.1.

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 318, in __init__
    SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

It was working fine, but I believe it happened after one of these changes:

UPDATE: falling back to revision f1ca00645f0b1565c7f9a1c863d2be6ebf896b04 fixes the issue. There must be something with that new sliding window having changed to null

oakkas84 · 2023-12-14T21:17:36Z

Having same issue. Will there a fix for that?

OlivierDehaene · 2023-12-14T23:23:39Z

#1348 fixes the issue.

oakkas84 · 2023-12-15T16:33:52Z

I am still getting this error

maziyarpanahi · 2023-12-16T08:43:11Z

There is something going on with that sliding_window being set to null, I would recommend waiting for now until it's either resolved or reverted: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/37

This is the last revision (commit), and the model works without any issue: f1ca00645f0b1565c7f9a1c863d2be6ebf896b04

OlivierDehaene · 2023-12-18T09:14:20Z

There was indeed an issue with null but this is fixed on 1.3.3

existme · 2023-12-18T13:23:17Z

@OlivierDehaene 1.3.3 is not available through Sagemaker SDK. Would you happen to know how we can deploy using 1.3.3 as a Sagemaker inference endpoint?

oakkas84 · 2023-12-18T14:38:06Z

@OlivierDehaene 1.3.3 is not available through Sagemaker SDK. Would you happen to know how we can deploy using 1.3.3 as a Sagemaker inference endpoint?

I have the same question.

MikeWinkelmannXL2 · 2023-12-19T13:57:43Z

Until 1.3.3 is available on Sagemaker you could use 1.3.1 in conjunction with the REVISION parameter to revert to an older commit of Mixtral, for instance this one worked for me. However, the wrong "sliding_window": 32768 parameter might have some strange effects.

oakkas84 · 2023-12-19T14:30:58Z

Looks like we don't have access to newer versions of tgi image through SageMaker yet as
Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 1.0.3, 1.1.0, 1.2.0, 1.3.1, 0.6, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3.

existme · 2023-12-19T22:35:55Z

@MikeWinkelmannXL2 Thank you very much Mike. The revision did the trick for now until they release the DLC.
I managed to deploy with the following config:

    config = {
        'HF_MODEL_ID':            'mistralai/Mixtral-8x7B-Instruct-v0.1'
        'SM_NUM_GPUS':            json.dumps(8),  # Number of GPU used per replica

        'REVISION': "e0bbb53cee412aba95f3b3fa4fc0265b1a0788b2", #  <=====

        'MAX_INPUT_LENGTH':       json.dumps(24000),  # Max length of input text
        'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),  # Number of tokens for the prefill operation.
        'MAX_TOTAL_TOKENS':       json.dumps(32000),  # Max length of the generation (including input text)
        'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
    }
    llm_model = HuggingFaceModel(
            role=execution_role,
            image_uri=image_uri,
            env=config,
            model_data="",
            sagemaker_session=sagemaker_session
    )

oakkas84 · 2023-12-19T22:51:15Z

config = {
        'HF_MODEL_ID':            'mistralai/Mixtral-8x7B-Instruct-v0.1'
        'SM_NUM_GPUS':            json.dumps(8),  # Number of GPU used per replica

        'REVISION': "e0bbb53cee412aba95f3b3fa4fc0265b1a0788b2", #  <=====

        'MAX_INPUT_LENGTH':       json.dumps(24000),  # Max length of input text
        'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),  # Number of tokens for the prefill operation.
        'MAX_TOTAL_TOKENS':       json.dumps(32000),  # Max length of the generation (including input text)
        'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
    }
    llm_model = HuggingFaceModel(
            role=execution_role,
            image_uri=image_uri,
            env=config,
            model_data="",
            sagemaker_session=sagemaker_session
    )

Thanks a lot for the details @existme. That indeed made the trick.

- TEI: Issue with HuggingFace secrets again - TGI-Mixtral: huggingface/text-generation-inference#1342

* Fix algolia_indexer synthetic monitor * Fix db_to_sheet * Fix dbt_duckdb * Remove dbt_sqlite This depends on meltano, an example that is currently not tested, and it also looks up an NFS. * Fix mini_dalle_slackbot * Fix news_summarizer * Fix dreambooth_app * Fix instructor * Fix webscraper * Fix a bunch of "huggingface" secrets * Fix db_to_sheet * Revert changes to environment_name * Remove unused import for lints * Fix TGI synmon token * Fix TEI and TGI-Mixtral - TEI: Issue with HuggingFace secrets again - TGI-Mixtral: huggingface/text-generation-inference#1342

OlivierDehaene closed this as completed Dec 14, 2023

existme mentioned this issue Dec 16, 2023

notebooks/deploy-mixtral.ipynb issue philschmid/llm-sagemaker-sample#8

Open

LvffY mentioned this issue Dec 17, 2023

Unable to deploy huggingface-llm 1.3.3 aws/sagemaker-python-sdk#4332

Closed

ekzhang added a commit to modal-labs/modal-examples that referenced this issue Jan 3, 2024

Fix TEI and TGI-Mixtral

682fb98

- TEI: Issue with HuggingFace secrets again - TGI-Mixtral: huggingface/text-generation-inference#1342

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

RonanKMcGovern commented Dec 13, 2023

theonesud commented Dec 13, 2023

gameveloster commented Dec 14, 2023

maziyarpanahi commented Dec 14, 2023 •

edited

Loading

oakkas84 commented Dec 14, 2023

OlivierDehaene commented Dec 14, 2023

oakkas84 commented Dec 15, 2023

maziyarpanahi commented Dec 16, 2023

OlivierDehaene commented Dec 18, 2023

existme commented Dec 18, 2023

oakkas84 commented Dec 18, 2023

MikeWinkelmannXL2 commented Dec 19, 2023

oakkas84 commented Dec 19, 2023

existme commented Dec 19, 2023

oakkas84 commented Dec 19, 2023

Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

Comments

RonanKMcGovern commented Dec 13, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

theonesud commented Dec 13, 2023

gameveloster commented Dec 14, 2023

maziyarpanahi commented Dec 14, 2023 • edited Loading

oakkas84 commented Dec 14, 2023

OlivierDehaene commented Dec 14, 2023

oakkas84 commented Dec 15, 2023

maziyarpanahi commented Dec 16, 2023

OlivierDehaene commented Dec 18, 2023

existme commented Dec 18, 2023

oakkas84 commented Dec 18, 2023

MikeWinkelmannXL2 commented Dec 19, 2023

oakkas84 commented Dec 19, 2023

existme commented Dec 19, 2023

oakkas84 commented Dec 19, 2023

maziyarpanahi commented Dec 14, 2023 •

edited

Loading