We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug
There is a misleading error when deploying models in small MIG partitions
To Reproduce
Expected output
Have the inference service running or having a detailed error in TGIS about why the model is not working.
** Actual error**
�[2m2024-06-24T10:33:54.484964Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m TGIS Commit hash: �[2m2024-06-24T10:33:54.484984Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Launcher args: Args { model_name: "/mnt/models/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: Some(448), max_new_tokens: 384, max_batch_size: 64, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None } �[2m2024-06-24T10:33:54.484997Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES �[2m2024-06-24T10:33:54.485049Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Saving fast tokenizer for `/mnt/models/` to `/tmp/74657ff2-73b1-45f2-b8d5-a7302a63f862` /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( �[2m2024-06-24T10:33:56.397996Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using configured max_sequence_length: 448 �[2m2024-06-24T10:33:56.398022Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True �[2m2024-06-24T10:33:56.398340Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting shard 0 Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. Shard 0: warnings.warn( Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama Shard 0: supports_causal_lm = True, supports_seq2seq_lm = False Shard 0: Traceback (most recent call last): Shard 0: Shard 0: File "/opt/tgis/bin/text-generation-server", line 8, in <module> Shard 0: sys.exit(app()) Shard 0: ^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 75, in serve Shard 0: raise e Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 56, in serve Shard 0: server.serve( Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 388, in serve Shard 0: asyncio.run( Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 190, in run Shard 0: return runner.run(main) Shard 0: ^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 118, in run Shard 0: return self._loop.run_until_complete(task) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete Shard 0: return future.result() Shard 0: ^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 267, in serve_inner Shard 0: model = get_model( Shard 0: ^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 126, in get_model Shard 0: return CausalLM(model_name, revision, deployment_framework, dtype, quantize, model_config, max_sequence_length) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 558, in __init__ Shard 0: inference_engine = get_inference_engine_class(deployment_framework)( Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/inference_engine/hf_transformers.py", line 76, in __init__ Shard 0: self.model = model_class.from_pretrained(**kwargs).requires_grad_(False).eval() Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained Shard 0: return model_class.from_pretrained( Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3375, in from_pretrained Shard 0: model = cls(config, *model_args, **model_kwargs) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__ Shard 0: self.model = LlamaModel(config) Shard 0: ^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__ Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp> Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 701, in __init__ Shard 0: self.mlp = LlamaMLP(config) Shard 0: ^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 219, in __init__ Shard 0: self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 98, in __init__ Shard 0: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/utils/_device.py", line 77, in __torch_function__ Shard 0: return func(*args, **kwargs) Shard 0: ^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":830, please report a bug to PyTorch. Shard 0: �[2m2024-06-24T10:34:00.379801Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shard 0 failed: ExitStatus(unix_wait_status(256)) �[2m2024-06-24T10:34:00.400918Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shutting down shards
Workaround
Having the model deployed in a bigger partition.
The text was updated successfully, but these errors were encountered:
🤸♀️
Sorry, something went wrong.
No branches or pull requests
Describe the bug
There is a misleading error when deploying models in small MIG partitions
To Reproduce
Expected output
Have the inference service running or having a detailed error in TGIS about why the model is not working.
** Actual error**
Workaround
Having the model deployed in a bigger partition.
The text was updated successfully, but these errors were encountered: