Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral v0.2 flash attention issue: unsupported operand type(s) for /: 'NoneType' and 'int' #1342

Closed
2 of 4 tasks
RonanKMcGovern opened this issue Dec 13, 2023 · 14 comments
Closed
2 of 4 tasks

Comments

@RonanKMcGovern
Copy link

System Info

v1.3.0 running on runpod with an A6000 48GB RAM

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run tgi docker image with this config:

--model-id Trelis/Mistral-7B-Instruct-v0.2 --trust-remote-code --port 8080

It seems that the sliding window being null in config.json is perhaps causing an issue.

Error:
2023-12-13T14:17:01.083701660Z The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2023-12-13T14:17:01.083705387Z The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
2023-12-13T14:17:01.083707872Z Traceback (most recent call last):
2023-12-13T14:17:01.083711038Z
2023-12-13T14:17:01.083713232Z File "/opt/conda/bin/text-generation-server", line 8, in
2023-12-13T14:17:01.083716157Z sys.exit(app())
2023-12-13T14:17:01.083719153Z
2023-12-13T14:17:01.083721136Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
2023-12-13T14:17:01.083724132Z server.serve(
2023-12-13T14:17:01.083726337Z
2023-12-13T14:17:01.083728541Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 215, in serve
2023-12-13T14:17:01.083736516Z asyncio.run(
2023-12-13T14:17:01.083739352Z
2023-12-13T14:17:01.083741466Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2023-12-13T14:17:01.083743830Z return loop.run_until_complete(main)
2023-12-13T14:17:01.083745944Z
2023-12-13T14:17:01.083748228Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
2023-12-13T14:17:01.083751024Z return future.result()
2023-12-13T14:17:01.083753799Z
2023-12-13T14:17:01.083770611Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 161, in serve_inner
2023-12-13T14:17:01.083774418Z model = get_model(
2023-12-13T14:17:01.083776502Z
2023-12-13T14:17:01.083778546Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 299, in get_model
2023-12-13T14:17:01.083786421Z return FlashMistral(
2023-12-13T14:17:01.083788495Z
2023-12-13T14:17:01.083790669Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 424, in init
2023-12-13T14:17:01.083793895Z super(FlashMistral, self).init(
2023-12-13T14:17:01.083795959Z
2023-12-13T14:17:01.083797872Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 318, in init
2023-12-13T14:17:01.083799966Z SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
2023-12-13T14:17:01.083802060Z
2023-12-13T14:17:01.083804054Z TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
2023-12-13T14:17:01.083806189Z rank=0
2023-12-13T14:17:01.087578056Z 2023-12-13T14:17:01.087255Z ERROR text_generation_launcher: Shard 0 failed to start

Expected behavior

Expect tgi to run like v0.1

@theonesud
Copy link

+1.
I'm trying to run the new Mistral 0.2 GPTQ using huggingface TGI's docker container (see both 1.1.0 and 1.3 outputs below) and am facing some issues. Can someone help me debug this issue? Is it an issue with the TGI docker image, the model itself or am I doing something stupid?

System Config:
Ubuntu 22.04
GPU 3060Ti - 8GB VRAM
Cuda 12.2
Docker and Nvidia container toolkit

Command:
docker run --gpus all --shm-size 1g -v $PWD/model:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --revision gptq-8bit--1g-actorder_True --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Error:

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/**init**.py", line 252, in get_model
return FlashMistral(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_mistral.py", line 312, in **init**
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
rank=0

I can post the entire error log if someone wants.
If I try with the newer version of TGI (1.3), I get a different error

Command:
docker run --gpus all --shm-size 1g -v $PWD/model:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --revision gptq-8bit--1g-actorder_True --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Error:

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 366, in <listcomp>
    MistralLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 300, in __init__
    self.self_attn = MistralAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 177, in __init__
    self.query_key_value = load_attention(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 106, in load_attention
    return _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 139, in _load_gqa
    get_linear(weight, bias=None, quantize=config.quantize)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 330, in get_linear
    linear = ExllamaQuantLinear(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/gptq/exllamav2.py", line 145, in __init__
    assert qzeros.shape == (
AssertionError

@gameveloster
Copy link

Seeing the same error on 1.1.1 when using TheBloke/Mistral-7B-Instruct-v0.2-AWQ

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

@maziyarpanahi
Copy link
Contributor

maziyarpanahi commented Dec 14, 2023

Seeing the same error with mistralai/Mixtral-8x7B-Instruct-v0.1.

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 318, in __init__
    SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

It was working fine, but I believe it happened after one of these changes:
image

UPDATE: falling back to revision f1ca00645f0b1565c7f9a1c863d2be6ebf896b04 fixes the issue. There must be something with that new sliding window having changed to null

@oakkas84
Copy link

Having same issue. Will there a fix for that?

@OlivierDehaene
Copy link
Member

#1348 fixes the issue.

@oakkas84
Copy link

I am still getting this error

@maziyarpanahi
Copy link
Contributor

There is something going on with that sliding_window being set to null, I would recommend waiting for now until it's either resolved or reverted: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/37

This is the last revision (commit), and the model works without any issue: f1ca00645f0b1565c7f9a1c863d2be6ebf896b04

@OlivierDehaene
Copy link
Member

There was indeed an issue with null but this is fixed on 1.3.3

@existme
Copy link

existme commented Dec 18, 2023

@OlivierDehaene 1.3.3 is not available through Sagemaker SDK. Would you happen to know how we can deploy using 1.3.3 as a Sagemaker inference endpoint?

@oakkas84
Copy link

@OlivierDehaene 1.3.3 is not available through Sagemaker SDK. Would you happen to know how we can deploy using 1.3.3 as a Sagemaker inference endpoint?

I have the same question.

@MikeWinkelmannXL2
Copy link

Until 1.3.3 is available on Sagemaker you could use 1.3.1 in conjunction with the REVISION parameter to revert to an older commit of Mixtral, for instance this one worked for me. However, the wrong "sliding_window": 32768 parameter might have some strange effects.

@oakkas84
Copy link

Looks like we don't have access to newer versions of tgi image through SageMaker yet as
Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 1.0.3, 1.1.0, 1.2.0, 1.3.1, 0.6, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3.

@existme
Copy link

existme commented Dec 19, 2023

@MikeWinkelmannXL2 Thank you very much Mike. The revision did the trick for now until they release the DLC.
I managed to deploy with the following config:

    config = {
        'HF_MODEL_ID':            'mistralai/Mixtral-8x7B-Instruct-v0.1'
        'SM_NUM_GPUS':            json.dumps(8),  # Number of GPU used per replica

        'REVISION': "e0bbb53cee412aba95f3b3fa4fc0265b1a0788b2", #  <=====

        'MAX_INPUT_LENGTH':       json.dumps(24000),  # Max length of input text
        'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),  # Number of tokens for the prefill operation.
        'MAX_TOTAL_TOKENS':       json.dumps(32000),  # Max length of the generation (including input text)
        'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
    }
    llm_model = HuggingFaceModel(
            role=execution_role,
            image_uri=image_uri,
            env=config,
            model_data="",
            sagemaker_session=sagemaker_session
    )

@oakkas84
Copy link

config = {
        'HF_MODEL_ID':            'mistralai/Mixtral-8x7B-Instruct-v0.1'
        'SM_NUM_GPUS':            json.dumps(8),  # Number of GPU used per replica

        'REVISION': "e0bbb53cee412aba95f3b3fa4fc0265b1a0788b2", #  <=====

        'MAX_INPUT_LENGTH':       json.dumps(24000),  # Max length of input text
        'MAX_BATCH_PREFILL_TOKENS': json.dumps(32000),  # Number of tokens for the prefill operation.
        'MAX_TOTAL_TOKENS':       json.dumps(32000),  # Max length of the generation (including input text)
        'MAX_BATCH_TOTAL_TOKENS': json.dumps(512000),
    }
    llm_model = HuggingFaceModel(
            role=execution_role,
            image_uri=image_uri,
            env=config,
            model_data="",
            sagemaker_session=sagemaker_session
    )

Thanks a lot for the details @existme. That indeed made the trick.

ekzhang added a commit to modal-labs/modal-examples that referenced this issue Jan 3, 2024
- TEI: Issue with HuggingFace secrets again
- TGI-Mixtral: huggingface/text-generation-inference#1342
ekzhang added a commit to modal-labs/modal-examples that referenced this issue Jan 3, 2024
* Fix algolia_indexer synthetic monitor

* Fix db_to_sheet

* Fix dbt_duckdb

* Remove dbt_sqlite

This depends on meltano, an example that is currently not tested,
and it also looks up an NFS.

* Fix mini_dalle_slackbot

* Fix news_summarizer

* Fix dreambooth_app

* Fix instructor

* Fix webscraper

* Fix a bunch of "huggingface" secrets

* Fix db_to_sheet

* Revert changes to environment_name

* Remove unused import for lints

* Fix TGI synmon token

* Fix TEI and TGI-Mixtral

- TEI: Issue with HuggingFace secrets again
- TGI-Mixtral: huggingface/text-generation-inference#1342
gongy pushed a commit to modal-labs/modal-examples that referenced this issue Jan 5, 2024
* Fix algolia_indexer synthetic monitor

* Fix db_to_sheet

* Fix dbt_duckdb

* Remove dbt_sqlite

This depends on meltano, an example that is currently not tested,
and it also looks up an NFS.

* Fix mini_dalle_slackbot

* Fix news_summarizer

* Fix dreambooth_app

* Fix instructor

* Fix webscraper

* Fix a bunch of "huggingface" secrets

* Fix db_to_sheet

* Revert changes to environment_name

* Remove unused import for lints

* Fix TGI synmon token

* Fix TEI and TGI-Mixtral

- TEI: Issue with HuggingFace secrets again
- TGI-Mixtral: huggingface/text-generation-inference#1342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants