Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TGI doesn't start due to permission denied #861

Open
2 of 6 tasks
ksandowi opened this issue Sep 22, 2024 · 5 comments
Open
2 of 6 tasks

[Bug] TGI doesn't start due to permission denied #861

ksandowi opened this issue Sep 22, 2024 · 5 comments
Assignees
Labels
bug Something isn't working Dev Kube

Comments

@ksandowi
Copy link

ksandowi commented Sep 22, 2024

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-other (Please let us know in description)

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

tag v1.0

Description

The issue affects XEON on both SPR and EMR.
After modification of chatqna.yaml to run all services in TD (protected by TDX), all services running successfully except TGI service which fails during model downloading. It worked fine in previous (v0.9 and v0.8) versions

Reproduce steps

On a platform with TDX enabled, modify ~/GenAIExamples/ChatQnA/kubernetes/intel/cpu/xeon/manifest/chatqna.yaml, so it run all services in TD:

 spec:
  •  runtimeClassName: kata-qemu-tdx
     securityContext:
    

Raw log

│ {"timestamp":"2024-09-22T09:31:24.526652Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"Intel/neural-chat-7b-v3-3\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n  │
│    num_shard: None,\n    quantize: None,\n    speculate: None,\n    dtype: None,\n    trust_remote_code: false,\n    max_concurrent_requests: 128,\n    max_best_of: 2,\n    max_stop_sequences: 4,\n    ma │
│ x_top_n_tokens: 5,\n    max_input_tokens: None,\n    max_input_length: None,\n    max_total_tokens: None,\n    waiting_served_ratio: 0.3,\n    max_batch_prefill_tokens: None,\n    max_batch_total_tokens: │
│  None,\n    max_waiting_tokens: 20,\n    max_batch_size: None,\n    cuda_graphs: Some(\n        [\n            0,\n        ],\n    ),\n    hostname: \"chatqna-tgi-69d8bd845-xpnvj\",\n    port: 2080,\n    │
│  shard_uds_path: \"/tmp/text-generation-server\",\n    master_addr: \"localhost\",\n    master_port: 29500,\n    huggingface_hub_cache: Some(\n        \"/data\",\n    ),\n    weights_cache_override: None │
│ ,\n    disable_custom_kernels: false,\n    cuda_memory_fraction: 1.0,\n    rope_scaling: None,\n    rope_factor: None,\n    json_output: true,\n    otlp_endpoint: None,\n    otlp_service_name: \"text-gen │
│ eration-inference.router\",\n    cors_allow_origin: [],\n    api_key: None,\n    watermark_gamma: None,\n    watermark_delta: None,\n    ngrok: false,\n    ngrok_authtoken: None,\n    ngrok_edge: None,\n │
│     tokenizer_config_path: None,\n    disable_grammar_support: false,\n    env: false,\n    max_client_batch_size: 4,\n    lora_adapters: None,\n    usage_stats: On,\n}"},"target":"text_generation_launch │
│ er"}                                                                                                                                                                                                        │
│ {"timestamp":"2024-09-22T09:31:24.527102Z","level":"INFO","fields":{"message":"Token file not found \"/tmp/.cache/huggingface/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/lo │
│ cal/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55},"target":"hf_hub"}                                                                                          │
│ {"timestamp":"2024-09-22T09:31:24.527452Z","level":"INFO","fields":{"message":"Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts  │
│ in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`."},"target":"text_generation_launc │
│ her"}                                                                                                                                                                                                       │
│ {"timestamp":"2024-09-22T09:31:24.527476Z","level":"INFO","fields":{"message":"Default `max_input_tokens` to 4095"},"target":"text_generation_launcher"}                                                    │
│ {"timestamp":"2024-09-22T09:31:24.527490Z","level":"INFO","fields":{"message":"Default `max_total_tokens` to 4096"},"target":"text_generation_launcher"}                                                    │
│ {"timestamp":"2024-09-22T09:31:24.527504Z","level":"INFO","fields":{"message":"Default `max_batch_prefill_tokens` to 4145"},"target":"text_generation_launcher"}                                            │
│ {"timestamp":"2024-09-22T09:31:24.528473Z","level":"INFO","fields":{"message":"Starting check and download process for Intel/neural-chat-7b-v3-3"},"target":"text_generation_launcher","span":{"name":"down │
│ load"},"spans":[{"name":"download"}]}                                                                                                                                                                       │
│ {"timestamp":"2024-09-22T09:31:28.345023Z","level":"ERROR","fields":{"message":"Download encountered an error: \nOpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k\nOp │
│ enBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 5, in <module>\n    from t │
│ ext_generation_server.cli import app\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 10, in <module>\n    from text_generation_server.utils.adapter import parse_lo │
│ ra_adapters\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/__init__.py\", line 2, in <module>\n    from text_generation_server.utils.dist import initialize_torch_distribut │
│ ed\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/dist.py\", line 6, in <module>\n    from text_generation_server.utils.import_utils import SYSTEM\n  File \"/opt/conda/lib │
│ /python3.10/site-packages/text_generation_server/utils/import_utils.py\", line 59, in <module>\n    import intel_extension_for_pytorch  # noqa: F401\n  File \"/opt/conda/lib/python3.10/site-packages/inte │
│ l_extension_for_pytorch/__init__.py\", line 122, in <module>\n    from . import xpu\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/__init__.py\", line 20, in <module>\n │
│     from .utils import *\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/utils.py\", line 6, in <module>\n    from .. import frontend\n  File \"/opt/conda/lib/python3.10 │
│ /site-packages/intel_extension_for_pytorch/frontend.py\", line 9, in <module>\n    from .nn import utils\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/__init__.py\", li │
│ ne 1, in <module>\n    from .modules import FrozenBatchNorm2d\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/modules/__init__.py\", line 8, in <module>\n    from ...cpu. │
│ nn.linear_fuse_eltwise import IPEXLinearEltwise\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/nn/linear_fuse_eltwise.py\", line 3, in <module>\n    from intel_extensio │
│ n_for_pytorch.nn.utils._weight_prepack import (\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/__init__.py\", line 1, in <module>\n    from intel_extension_for_pyt │
│ orch.nn.utils import _weight_prepack\n  File \"/opt/conda/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py\", line 5, in <module>\n    import pkg_resources\n  File \"/ │
│ opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 3663, in <module>\n    def _initialize_master_working_set():\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__ini │
│ t__.py\", line 3646, in _call_aside\n    f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 3675, in _initialize_master_working_set\n    working_set =  │
│ _declare_state('object', 'working_set', WorkingSet._build_master())\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 653, in _build_master\n    ws = cls()\n  File \"/op │
│ t/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 646, in __init__\n    self.add_entry(entry)\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 702, │
│  in add_entry\n    for dist in find_distributions(entry, True):\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 2333, in find_on_path\n    yield from factory(fullpath) │
│ \n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 2397, in distributions_from_metadata\n    yield Distribution.from_location(\n  File \"/opt/conda/lib/python3.10/site-p │
│ ackages/pkg_resources/__init__.py\", line 2947, in from_location\n    )._reload_version()\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 3367, in _reload_version\n    │
│  md_version = self._get_version()\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 3133, in _get_version\n    return _version_from_file(lines)\n  File \"/opt/conda/lib/ │
│ python3.10/site-packages/pkg_resources/__init__.py\", line 2892, in _version_from_file\n    line = next(iter(version_lines), '')\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.p │
│ y\", line 3129, in _get_metadata\n    yield from self.get_metadata_lines(name)\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 2209, in get_metadata_lines\n    return  │
│ yield_lines(self.get_metadata(name))\n  File \"/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py\", line 2196, in get_metadata\n    with open(self.path, encoding='utf-8', errors=\"replace │
\") as f:\nPermissionError: [Errno 13] Permission denied: '/opt/conda/lib/python3.10/site-packages/.wh.certifi-2022.12.7-py3.11.egg-info'"},"target":"text_generation_launcher","span":{"name":"download"}, │
│ "spans":[{"name":"download"}]}                                                                                                                                                                              │
│ Error: DownloadError                                                                                                                                                                                        │
@yongfengdu
Copy link
Collaborator

Looks like tgi is trying to access more locations after enabled TDX.
Could you try comment out these lines for tgi (Line 1289-1298)?
https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/kubernetes/intel/cpu/xeon/manifest/chatqna.yaml#L1289
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault

@ksandowi
Copy link
Author

That is good workaround. TGI service downloaded all required files to start successfully

@yongfengdu
Copy link
Collaborator

At 1.0 release, we enabled securityContext by default (#258). This will run the pod with non-root user, and with root file system readonly.

At the TGI pod start, it will download the required model to a emptyDir mounted volume /data.
Without TDX/Kata, this works fine, from the log, I can't identify whether it's the permission issue of /data, or other parts of the file system.

Easy fix would be disable the securityContext, but that need discussion whether that's a good way to go.

Before an official fix, you can use the above workaround while enabling TDX/kata.

@ksandowi
Copy link
Author

Thanks a lot. Workaround works for me.
One more info.
TGI service starts successfully if I deploy it using kubernetes as runc and as kata-qemu (runtimeClass=kata-qemu). It only fails one it is run as kata-qemu-tdx (runtimeClass=kata-qemu-tdx).

@eero-t
Copy link
Contributor

eero-t commented Oct 2, 2024

@ksandowi Could you track down which exact part of the SecurityContext conflicts with TDX, by removing security settings piece by piece and checking whether TGI still fails?

(I would assume it to be either runAs*, seccompProfile, or readOnlyRootFilesystem` settings.)

@joshuayao joshuayao added the bug Something isn't working label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dev Kube
Projects
None yet
Development

No branches or pull requests

6 participants