Error loading Llama-2-70b gptq weights from local directory #728

hmcp22 · 2023-07-28T14:22:58Z

System Info

Docker deployment version 0.9.4
Hardware: AWS g5.12xlarge

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Running using docker-compose with the following compose file:

version: "3.5"
services:
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:0.9.4
    container_name: text-generation-inference
    entrypoint: text-generation-launcher
    restart: always
    stdin_open: true 
    tty: true 
    env_file:
      - tgi.env
    shm_size: '1gb'
    ports:
      - 8080:80
    volumes:
      - type: bind
        source: /home/ubuntu/efs/llm_downloads
        target: /llm_downloads
    deploy:
      resources:
        reservations:
          devices: 
            - driver: nvidia
              count: all
              # device_ids: ['0', '3']
              capabilities: [gpu]
networks:
  default:
    driver: bridge

and the following env variables in the tgi.env file:

MODEL_ID=/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True
QUANTIZE=gptq
GPTQ_BITS=4
GPTQ_GROUPSIZE=128
SHARDED=true
NUM_SHARD=4
MAX_CONCURRENT_REQUESTS=128
MAX_BEST_OF=5
MAX_STOP_SEQUENCES=4 
MAX_INPUT_LENGTH=4000 
MAX_TOTAL_TOKENS=8192
WAITING_SERVED_RATIO=1.2 
MAX_BATCH_TOTAL_TOKENS=16000 
MAX_WAITING_TOKENS=20

MAX_BATCH_PREFILL_TOKENS=4096

HUGGINGFACE_HUB_CACHE=/llm_downloads/tgi_hf_cache

Which gives the following error:

2023-07-28T13:20:04.621775Z  INFO text_generation_launcher: Args { model_id: "/llm_downloads/TheBloke/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 5, max_stop_sequences: 4, max_input_length: 4000, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: Some(16000), max_waiting_tokens: 20, hostname: "fdd9e32f6611", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/llm_downloads/tgi_hf_cache"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-28T13:20:04.621814Z  INFO text_generation_launcher: Sharding model on 4 processes
2023-07-28T13:20:04.621894Z  INFO download: text_generation_launcher: Starting download process.
2023-07-28T13:20:06.114282Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-28T13:20:06.423909Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-28T13:20:06.424273Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2023-07-28T13:20:06.424404Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2023-07-28T13:20:06.424931Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-07-28T13:20:11.906010Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.979402Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.980989Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:11.984688Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits
AttributeError: 'Weights' object has no attribute 'gptq_bits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gptq_bits does not exist

2023-07-28T13:20:12.431682Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 216, in _get_gptq_params
    bits = self.gptq_bits

AttributeError: 'Weights' object has no attribute 'gptq_bits'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 67, in __init__
    model = FlashLlamaForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in __init__
    self.model = FlashLlamaModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 331, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 204, in __init__
    self.query_key_value = _load_gqa(config, prefix, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in _load_gqa
    weight = weights.get_multi_weights_col(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 133, in get_multi_weights_col
    bits, groupsize = self._get_gptq_params()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 219, in _get_gptq_params
    raise e

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 212, in _get_gptq_params
    bits = self.get_tensor("gptq_bits").item()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 65, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 52, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight gptq_bits does not exist
 rank=3
2023-07-28T13:20:12.530175Z ERROR text_generation_launcher: Shard 3 failed to start
2023-07-28T13:20:12.530205Z  INFO text_generation_launcher: Shutting down shards
2023-07-28T13:20:12.555460Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
2023-07-28T13:20:12.555658Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=2
2023-07-28T13:20:12.614893Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: ShardCannotStart

Expected behavior

Expect the model to load correctly.
I did a little digging into where the error was happening and I can see it's when it tries to load the gptq config settings in the _get_gptq_params method in
the server/text_generation_server/utils/weights.py file.
I'm not entirely sure why it doesn't seem to pick up these settings from the local dir as the quantize_config.json file does exist there.
I modified the _get_gptq_params method to revert to getting this from env variables if it errors (see below) as was the case before this last release. I rebuilt the image and this seems to successfully load the model

    def _get_gptq_params(self) -> Tuple[int, int]:
        try:
            bits = self.get_tensor("gptq_bits").item()
            groupsize = self.get_tensor("gptq_groupsize").item()
        except (SafetensorError, RuntimeError) as e:
            try:
                bits = self.gptq_bits
                groupsize = self.gptq_groupsize
            except Exception:
                try:
                    import os
                    bits = int(os.getenv("GPTQ_BITS"))
                    groupsize = int(os.getenv("GPTQ_GROUPSIZE"))
                except Exception:
                    raise e

        return bits, groupsize

The text was updated successfully, but these errors were encountered:

yadamonk · 2023-07-28T18:18:10Z

@hmcp22 Does the model work as expected with your fix? I was told that the 70b GPTQ model can't be sharded on 4 GPUs and somewhere else I read that it produces gibberish on 4 A10Gs.

hmcp22 · 2023-07-28T23:18:42Z

Yep, working as expected and getting coherent outputs

152334H · 2023-07-30T18:24:31Z

you can just replace _get_gptq_params() with

def _get_gptq_params(self) -> Tuple[int, int]:
    return self.gptq_bits, self.gptq_groupsize

because these attribs are set in advance by _set_gptq_params, and fix _set_gptq_params to read from the local directory with,

    def _set_gptq_params(self, model_id):
        p = Path(model_id)/'quantize_config.json'
        try:
            if p.exists(): data = json.loads(p.read_text())
            else:
                filename = hf_hub_download(model_id, filename="quantize_config.json")
                with open(filename, "r") as f:
                    data = json.load(f)
            self.gptq_bits = data["bits"]
            self.gptq_groupsize = data["group_size"]
        except Exception as e:
            raise

beacuse hf_hub_download doesn't work with local model IDs

Narsil · 2023-07-31T12:34:56Z

This should have fixed it:
#738

Can you confirm? (--pull latest if you're using docker).

hmcp22 · 2023-08-01T17:24:32Z

This should have fixed it: #738

Can you confirm? (--pull latest if you're using docker).

Can confirm this is working with the latest docker image now

zhaohb · 2023-08-02T13:01:34Z

@Narsil @hmcp22 I test latest image, but get same error.

Narsil · 2023-08-03T08:27:14Z

Does your local mode have quantization_config.json in the directory ?

Currently TGI expects:

Either gptq_bits within the model weights
Or quantization_config.json file.

zhaohb · 2023-08-03T13:03:09Z

@Narsil yes, this is my issue #766

Narsil · 2023-08-03T15:22:25Z

Ok I will close this, and we can move the discussion over #766

yunll · 2023-10-08T12:56:08Z

Does your local mode have quantization_config.json in the directory ?

Currently TGI expects:

Either gptq_bits within the model weights

Or quantization_config.json file.

how to get quantization_config.json

hmcp22 changed the title ~~Error loading Llama-2-70b gptq weigths from local directory~~ Error loading Llama-2-70b gptq weights from local directory Jul 28, 2023

Narsil closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error loading Llama-2-70b gptq weights from local directory #728

Error loading Llama-2-70b gptq weights from local directory #728

hmcp22 commented Jul 28, 2023 •

edited

Loading

yadamonk commented Jul 28, 2023 •

edited

Loading

hmcp22 commented Jul 28, 2023

152334H commented Jul 30, 2023

Narsil commented Jul 31, 2023

hmcp22 commented Aug 1, 2023

zhaohb commented Aug 2, 2023

Narsil commented Aug 3, 2023

zhaohb commented Aug 3, 2023

Narsil commented Aug 3, 2023

yunll commented Oct 8, 2023

Error loading Llama-2-70b gptq weights from local directory #728

Error loading Llama-2-70b gptq weights from local directory #728

Comments

hmcp22 commented Jul 28, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

yadamonk commented Jul 28, 2023 • edited Loading

hmcp22 commented Jul 28, 2023

152334H commented Jul 30, 2023

Narsil commented Jul 31, 2023

hmcp22 commented Aug 1, 2023

zhaohb commented Aug 2, 2023

Narsil commented Aug 3, 2023

zhaohb commented Aug 3, 2023

Narsil commented Aug 3, 2023

yunll commented Oct 8, 2023

hmcp22 commented Jul 28, 2023 •

edited

Loading

yadamonk commented Jul 28, 2023 •

edited

Loading