Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error is thrown when model download fails due to insufficient space #2742

Open
atbe opened this issue Jan 8, 2025 · 6 comments
Open

No error is thrown when model download fails due to insufficient space #2742

atbe opened this issue Jan 8, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@atbe
Copy link

atbe commented Jan 8, 2025

Describe the bug

When downloading a model, the file_download.py file does not throw an error when there is not enough space.

This is problematic in environments like sglang, where the server does not exit even though the model weights never finish downloading sgl-project/sglang#2801

Reproduction

Download weights on a device without enough space and observe as there is no indication of the model download failure.

Logs

No response

System info

- any version of `huggingface_hub`

At the very least, I think that there should be a flag of some sorts to throw an exception here, or throw an exception completely.

@atbe atbe added the bug Something isn't working label Jan 8, 2025
@atbe
Copy link
Author

atbe commented Jan 8, 2025

How this impacts sglang: sgl-project/sglang#2801

@hanouticelina
Copy link
Contributor

Hi @atbe, sorry you encountered this issue!
A bit of context : we made the choice to not raise an exception when the user does not have enough disk space to avoid unconditionally blocking downloads in valid setups. In some environments, the data returned by shutil.disk_usage(path).free may not accurately reflect actual space availability. By warning, we allow for more flexibility in these edge cases.

Another alternative is to manually check for sufficient disk space on your side before calling snapshot_download() or any other downloading function. That way you have explicit control over how to handle insufficient space error.
Let me know what do you think.

@atbe
Copy link
Author

atbe commented Jan 9, 2025

thanks for the reply @hanouticelina !

Another alternative is to manually check for sufficient disk space on your side before calling snapshot_download() or any other downloading function. That way you have explicit control over how to handle insufficient space error.

I do think its a bit odd that the server doesn't exit when serving legitimately fails (in this case due to insufficient space), don't you feel the same? You could check for space manually, but that just feels like a hack compared to getting the server to properly detect that it failed to start and exiting.

@hanouticelina
Copy link
Contributor

Hi @atbe,
I reproduced the scenario using a container where we call snapshot_download while having not enough disk space for the files we're downloading.
Here is the traceback I get :

Fetching 14 files:   0%|                                        | 0/14 [00:00<?, ?it/s]/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3945.44 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3421.60 MB free disk space.
  warnings.warn(
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3864.73 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3421.60 MB free disk space.
  warnings.warn(
LICENSE: 100%|████████████████████████████████████| 11.3k/11.3k [00:00<00:00, 28.6MB/s]
generation_config.json: 100%|█████████████████████████| 243/243 [00:00<00:00, 5.31MB/s]
.gitattributes: 100%|█████████████████████████████| 1.52k/1.52k [00:00<00:00, 18.4MB/s]
README.md: 100%|██████████████████████████████████| 6.00k/6.00k [00:00<00:00, 80.2MB/s]
config.json: 100%|████████████████████████████████████| 663/663 [00:00<00:00, 6.11MB/s]
model.safetensors.index.json: 100%|████████████████| 27.8k/27.8k [00:00<00:00, 140MB/s]
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3864.73 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3404.50 MB free disk space.                 | 0.00/27.8k [00:00<?, ?B/s]
  warnings.warn(
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3556.38 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3404.50 MB free disk space.
  warnings.warn(
tokenizer_config.json: 100%|██████████████████████| 7.30k/7.30k [00:00<00:00, 49.9MB/s]
merges.txt: 100%|█████████████████████████████████| 1.67M/1.67M [00:00<00:00, 4.44MB/s]
vocab.json: 100%|█████████████████████████████████| 2.78M/2.78M [00:00<00:00, 3.05MB/s]
tokenizer.json: 100%|█████████████████████████████| 7.03M/7.03M [00:01<00:00, 5.13MB/s]
model-00002-of-00004.safetensors:  23%|██▋         | 881M/3.86G [00:26<01:30, 33.1MB/s]
model-00004-of-00004.safetensors:  23%|██▊         | 818M/3.56G [00:26<01:27, 31.2MB/s]
model-00001-of-00004.safetensors:  23%|██▋         | 902M/3.95G [00:26<01:29, 33.9MB/s]
Fetching 14 files:  43%|█████████████▋                  | 6/14 [00:26<00:35,  4.48s/it]
model-00003-of-00004.safetensors:  22%|██▋         | 849M/3.86G [00:26<01:34, 32.0MB/s]
Traceback (most recent call last):  1%|           | 21.0M/3.56G [00:01<02:48, 21.0MB/s]
  File "<string>", line 7, in <module>|██▊         | 818M/3.56G [00:26<01:19, 34.3MB/s]
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnsafetensors:  22%|██▋         | 849M/3.86G [00:26<01:49, 27.4MB/s]
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py", line 296, in snapshot_download
    thread_map(
  File "/opt/venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py", line 270, in _inner_hf_hub_download
    return hf_hub_download(
           ^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 860, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1009, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1543, in _download_to_tmp_and_move
    http_get(
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 455, in http_get
    temp_file.write(chunk)
OSError: [Errno 28] No space left on device

As you can see, the script does properly signals the failure by raising an OSError: [Errno 28] No space left on device. I'm maybe missing something, but it's the responsability of the server application to catch this error, handle the failure (with custom logging for example) and shutdown the server. We don't throw the error at the beginning for the reasons mentioned in this comment.

@Wauplin
Copy link
Contributor

Wauplin commented Jan 14, 2025

Agree with @hanouticelina here. The check you are referring to here is made before actually downloading the file to warn the user early. We don't want to raise an exception at this stage for the reason explained above. But in any case, an exception will be raised when the disk space will actually be used.

@rdodev
Copy link

rdodev commented Jan 23, 2025

FWIW, I agree with @atbe . This is not my application code. I shouldn't have to guess if the error is a false positive (or even care if it is). There should be an option to force exit on disk space error -- even if the default remains the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants