Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model downloads just *hang* #1186

Closed
2 of 4 tasks
vgoklani opened this issue Oct 22, 2023 · 45 comments
Closed
2 of 4 tasks

Model downloads just *hang* #1186

vgoklani opened this issue Oct 22, 2023 · 45 comments

Comments

@vgoklani
Copy link

vgoklani commented Oct 22, 2023

System Info

tag=ghcr.io/huggingface/text-generation-inference:1.1.0
model_id=HuggingFaceH4/zephyr-7b-alpha
# model_id=mistralai/Mistral-7B-Instruct-v0.1
validation_workers=2
sharded=false
max_concurrent_requests=4
max_total_tokens=4096
hostname=0.0.0.0
port=3001
gpuDeviceString="device=0"

docker run --net=host -i -t \
    --gpus ${gpuDeviceString} \
    --shm-size 32g \
    --volume ${MODELS_DIRECTORY}/deeplearning/huggingface:/data \
    -e HUGGING_FACE_HUB_TOKEN=${HUGGINGFACE_TOKEN} \
    ${tag} \
    --model-id ${model_id} \
    --validation-workers ${validation_workers} \
    --sharded ${sharded} \
    --max-concurrent-requests ${max_concurrent_requests} \
    --max-total-tokens ${max_total_tokens} \
    --hostname ${hostname} \
    --port ${port}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run the docker script above, and then check the download folder. I can see the first 3 safetensors download successfully, and then it just hangs.

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:1.1.0
2023-10-22T13:01:16.673958Z  INFO text_generation_launcher: Args { model_id: "HuggingFaceH4/zephyr-7b-alpha", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 4, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 3001, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-22T13:01:16.674057Z  INFO download: text_generation_launcher: Starting download process.
2023-10-22T13:01:21.453462Z  INFO text_generation_launcher: Download file: model-00001-of-00008.safetensors

2023-10-22T13:01:48.297410Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00001-of-00008.safetensors in 0:00:26.

2023-10-22T13:01:48.297570Z  INFO text_generation_launcher: Download: [1/8] -- ETA: 0:03:02

2023-10-22T13:01:48.298299Z  INFO text_generation_launcher: Download file: model-00002-of-00008.safetensors

2023-10-22T13:02:45.464068Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00002-of-00008.safetensors in 0:00:57.

2023-10-22T13:02:45.464139Z  INFO text_generation_launcher: Download: [2/8] -- ETA: 0:04:12

2023-10-22T13:02:45.464530Z  INFO text_generation_launcher: Download file: model-00003-of-00008.safetensors

2023-10-22T13:03:11.724879Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00003-of-00008.safetensors in 0:00:26.

2023-10-22T13:03:11.724954Z  INFO text_generation_launcher: Download: [3/8] -- ETA: 0:03:03.333335

2023-10-22T13:03:11.725362Z  INFO text_generation_launcher: Download file: model-00004-of-00008.safetensors

When I switched the model_id to mistralai/Mistral-7B-Instruct-v0.1 the first tensor gets downloaded and then the process hangs.

Expected behavior

I expect this to just work out of the box, since it's using docker, and there is literally nothing for me to screw up ;)

@vgoklani
Copy link
Author

vgoklani commented Oct 22, 2023

To expand on this further, I restarted the download for mistralai/Mistral-7B-Instruct-v0.1 and it created a new tempfile and made it to 9.3G before hanging. There are no logs, and no errors, which makes it impossible to debug. Moreover, I'm happy to manually download the files via wget (which is both transparent and fast) but the download structure is very confusing. Where do I put the downloaded pytorch tensors? Is it possible to run the tokenizer client-side and just pass in the input_ids?

@OlivierDehaene
Copy link
Member

Are you sure you have enough space in the path linked to ${MODELS_DIRECTORY} env var?

@vgoklani
Copy link
Author

@OlivierDehaene There is no space issues (over 1T free). Is it possible to enable logging so we can see the error?

@vgoklani
Copy link
Author

The downloader is definitely flakey...

When downloading teknium/OpenHermes-2-Mistral-7B which was two files, the first file downloaded immediately, but the second file hung at 4.3G.

We then killed and restarted the docker process and of course it re-downloaded the first file, and then another copy of the second file, then hung again at 4.3G.

We then deleted the tmp files and re-ran the download, and it worked correctly...

This morning we tried downloading HuggingFaceH4/zephyr-7b-beta and got stuck on the 4th file

We disabled the HF_HUB_ENABLE_HF_TRANSFER

-e HF_HUB_ENABLE_HF_TRANSFER="false" \

but still don't see any logging. It's impossible for us to debug this when we can't the errors.

Please help.

@StephennFernandes
Copy link

@vgoklani hey were you able to fix this ? facing similar issue

@vgoklani
Copy link
Author

vgoklani commented Oct 28, 2023

@StephennFernandes Something is definitely flakey, unfortunately we can't see the logging.

One thing to do is disable the HF_HUB_ENABLE_HF_TRANSFER:

-e HF_HUB_ENABLE_HF_TRANSFER="false" \

and then when the downloads get stuck, kill the docker process, delete the tmp* files and just restart. Do this several times and just cross your fingers and hope it eventually finishes successfully.

The problem is most likely server-side, but unfortunately its hard to be sure since we can't see the logs.

@vgoklani
Copy link
Author

vgoklani commented Nov 6, 2023

Any updates on this @OlivierDehaene

I know your team is busy, but we're trying to download this:

TheBloke/deepseek-coder-33B-instruct-AWQ

the first file downloads successfully in the temp directory, we see 6.4G and then nothing happens. The second file does not start downloading, and there are no updates on the console. We then kill the download and restart, it then re-downloads the first file in a new temp folder, and then just hangs again.

Could you please help?

We removed the AWQ file and went to the full 33B model, but the same thing keeps happening.

@babeal
Copy link

babeal commented Nov 16, 2023

We are seeing the same issue with vilsonrodrigues/falcon-7b-instruct-sharded. You can see where it's hanging with the following commands:

docker exec -it <container-id> /bin/bash
ps -ef

The process output

root         1     0  0 15:58 ?        00:00:00 text-generation-launcher --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 1 --quantize bitsandbytes
root         8     1  3 15:58 ?        00:00:43 /opt/conda/bin/python3.9 /opt/conda/bin/text-generation-server download-weights vilsonrodrigues/falcon-7b-instruct-sharded --extension .safetensors --logger-level INFO --json-output

Then run strace to see the last output of the download-weights command

apt-get update && apt-get-install strace -y && strace -p 8

Always hangs at the exact same spot

write(2, "\rDownloading (\342\200\246)of-00015.safet"..., 97

Update:
Occurs with HF_HUB_ENABLE_HF_TRANSFER: "0" ; testing with 1 and it's completed the download a few times, so still testing.

@VfBfoerst
Copy link

Cheers, are you all behind a corporate proxy? We are issuing the same problem (behind a corporate proxy), maybe the proxy is the problem.

@vgoklani
Copy link
Author

nope, tested in multiple environments!

@erfanium
Copy link

erfanium commented Nov 22, 2023

Same here. no activity from docker container
This is my docker-compose.yml file for TheBloke/deepseek-coder-1.3B-base-AWQ model

version: "3.8"
services:
  hf-inference:
    restart: always
    image: ghcr.io/huggingface/text-generation-inference:1.1.1
    command: --model-id TheBloke/deepseek-coder-1.3B-base-AWQ
    volumes:
      - "./data:/data"
    ports:
      - 8080:80
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

model has already downloaded in a temp file (854MB of 895MB downloaded)

@erfanium
Copy link

@VfBfoerst No, I use plain internet without proxy, and I got the same issue

@poojitharamachandra
Copy link

hi, any solution yet?

@Narsil
Copy link
Collaborator

Narsil commented Nov 30, 2023

We are never able to reproduce.

Can you provide more information about what environment you are running in ?
Also run BOTH

HF_HUB_ENABLE_HF_TRANSFER=0 AND =1 (in the docker environment).

It really looks like some kind of environment issue, but we really cannot help if we cannot reproduce.

@erfanium
Copy link

erfanium commented Nov 30, 2023

@Narsil
This problem originates from the huggingface-hub downloader module. I've tried using both HF_HUB_ENABLE_HF_TRANSFER=0 and HF_HUB_ENABLE_HF_TRANSFER=1, but the issue persists in both cases.

Reproducing this issue is quite challenging and problem doesn't always occur (especially if you have a high-quality internet connection).
I think you can start by testing the downloader with an unstable internet connection.

@vgoklani
Copy link
Author

vgoklani commented Dec 1, 2023

The downloads are indeed unstable, sometimes they work and sometimes they don't :)

We don't have any issues using huggingface_hub.snapshot_download and our data-center has a very stable internet connection :)

@Narsil
Copy link
Collaborator

Narsil commented Dec 1, 2023

We deploy many models per day without any hitch, ever.
I don't doubt you are seeing issues, but from my perspective I cannot see the problem therefore I cannot help.

Everything is using hf_hub_download not snapshot_download although it shouldn't make any difference.
There can technically be an issue with hf_transfer if it tries to open too many file descriptor for your machine, but deactivating it doesn't seem to help, therefore you're using raw python at this point.

I'm sorry but I really cannot help with the problems in your environment. At this point just try to isolate the problem so we can actually help.

Lauching text-generation-server download-weights for instance will only start that process, that could help in figuring it out. Please deactivate HF_TRANSFER so we stay in pure Python and you can see which part of requests is blocking.

I'm seriously expecting a bug in the environment. Something limiting the descriptors, some hard CPU rate limiting of the host, some firewall issue.

@VfBfoerst
Copy link

Hey, I found a workaround (at least in my case).
I started the container with podman (image: ghcr.io/huggingface/text-generation-inference:1.2). But instead of using the default entrypoint, I overwrote it with an intaractive bash (--entrypoing /bin/bash).
Then I started the download manually:
text-generation-server download-weights bigcode/starcoder
This download succeeds and it downloads all weights as expected - did not stop after the first weight which was the case after downloading it with text-generation-launcher.
Then, I started the launcher:
text-generation-launcher --model-id bigcode/starcoder

It detected the already downloaded weights and started the webserver. After that, the generation via API worked as expected.

@vgoklani
Copy link
Author

I was watching this video on Mixtral where the speaker is discussing TGI and then he says "well there are some issues downloading weights when using TGI" ...

https://youtu.be/EXFbZfp8xCI?si=TEv3dvWgI3hZKwWJ&t=928

He's running this on RunPod, which i believe has a "very stable network"

@peterschmidt85
Copy link

I'm experiencing the same issue on certain networks with MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ.
The model is downloaded 100% but hangs at the end before converting the temp file into the actual model file.

MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq
2023-12-26T20:32:14.790191Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-26T20:32:14.790275Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ` do not contain malicious code.
2023-12-26T20:32:14.790469Z  INFO download: text_generation_launcher: Starting download process.
2023-12-26T20:32:18.089652Z  INFO text_generation_launcher: Download file: model.safetensors

Hangs forever

du -ha /data

23G     /data/tmppkarr8tk
4.0K    /data/version.txt
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/refs/main
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/refs
0       /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09/adapter_config.json
0       /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09/medusa_lm_head.pt
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/snapshots/0f81ba4680ccd2bce163334b93305d40b9e27b09
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/snapshots
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/blobs
32K     /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ
0       /data/.locks/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/5dec37d3dc6c82673e6d65871a7fcd581b165dbabce4eaf0a1fa83d96fd4b213.
lock
4.0K    /data/.locks/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ
8.0K    /data/.locks
23G     /data

If I invoke text-generation-server download-weights TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ manually, the issue doesn't occur. Also, if I run text-generation-server download-weights I actually see the progress of download – regardless whether I enable or disable HF_HUB_ENABLE_HF_TRANSFER. Does it mean that downloading is not accelerated?

@BEpresent
Copy link

I was watching this video on Mixtral where the speaker is discussing TGI and then he says "well there are some issues downloading weights when using TGI" ...

https://youtu.be/EXFbZfp8xCI?si=TEv3dvWgI3hZKwWJ&t=928

He's running this on RunPod, which i believe has a "very stable network"

Encountering the same issue while waiting for Mixtral to download without any feedback. It keeps stuck in the download phase Download file: model ... (waited over an hour, and tested on 3 different cloud instances)

@josephrocca
Copy link

@Narsil I can reliably reproduce this on Runpod with ghcr.io/huggingface/text-generation-inference:1.3.4 and:

--quantize gptq --model-id TheBloke/goliath-120b-GPTQ --revision gptq-3bit--1g-actorder_True --max-total-tokens 4096

Please give it a try.

Also, like peterschmidt85, I find that the value of the HF_HUB_ENABLE_HF_TRANSFER environment variable has no effect on whether or not logs are shown, both when using Runpod's environment variable settings, and when adding it before text-generation-server download-weights during manual download.

@RDearnaley
Copy link

RDearnaley commented Jan 5, 2024

@Narsil I'm also experiencing this issue, for downloading Mixtral-8x7B-Instruct-v0.1 (both the original, and also a duplicate I made on HF), under Kubernetes on Azure using the ghcr.io/huggingface/text-generation-inference:1.3.4 image. Let me know if you want a copy of the Kubernetes yaml files — it's extremely replicable. That model has 19 ~4.9GB .safetensor chunks, the download consistently hangs after either 12, 13, or 14 chunks. I have exec-ed to the pod and confirmed that the downloader process is still running (it eventually times out around 20 minutes later), and that the machine has plenty of disk space. Looking in the downloaded blobs directory, it appears to have all full-sized chunks with no partial chunk, so possibly the hang is during the process of starting the download of the next chunk.

I found #354 from back in May which sounded very similar. A poster on that issue thread claimed to have worked around the issue by passing the option --net=host to Docker. I tried what I believe is the Kubernetes equivalent:

spec:
  template:
    spec:
      hostNetwork: true

without success. I also confirmed that if I exec into the pod and manually run the downloader command:

/opt/conda/bin/python3.10 /opt/conda/bin/text-generation-server download-weights mistralai/Mixtral-8x7B-Instruct-v0.1 --extension .safetensors --logger-level INFO --json-output

then that completes just fine, and the text-generation-loader spawns another weights download, which picks up the downloaded model. So now I have a running pod that I can only restart or scale via a slow manual process, thus defeating the entire point of Kubernetes.

This issue has already been open for 2 1/2 months, and is a blocker, sadly, so I'll be switching us over to using vLLM (yes, I searched, and they don't have a similar issue reported).

@OlivierDehaene
Copy link
Member

@RDearnaley I wouldn't say that the issue is "extremely replicable". We have yet to see a reproducible example and we do not have the issue in any of our prod environements, our integration tests or even our dev VMs. It's extremely hard to make progress on this issue in these conditions.

Even in this thread, we have seen numerous talented people trying to reproduce/find out what is hapenning and nobody has a clue yet besides @peterschmidt85 who found that it might be linked to temporary files.

For example you state:

I also confirmed that if I exec into the pod and manually run the downloader command then that completes just fine.

The launcher runs the same command.

So now I have a running pod that I can only restart or scale via a slow manual process, thus defeating the entire point of Kubernetes.

so I'll be switching us over to using vLLM (yes, I searched, and they don't have a similar issue reported).

Sassy comments do not ever lead to magic fixes, it just leads to open source dev burnout.
Be aware that they also use the huggingface_hub lib to download the weights.

@RDearnaley
Copy link

RDearnaley commented Jan 13, 2024 via email

@josephrocca
Copy link

josephrocca commented Jan 13, 2024

The sass was definitely uncalled for.

@OlivierDehaene @Narsil Here is a step by step replication which always produces the infinite "hang" for me, and I've tried it probably a dozen times with different machines on Runpod, including Community Cloud and Secure Cloud. This is exactly what I did several times a week ago, and I've just tested these steps again today on two different runpod machines (one in Community, and one in Secure).

  1. Visit https://www.runpod.io/console/gpu-secure-cloud and click the "deploy" button for a 2x RTX 4090 machine.
  2. It'll take you to a page with a "Customize Deployment" button - click it.
  3. Paste ghcr.io/huggingface/text-generation-inference:1.3.4 in place of the Docker Image Name input.
  4. Paste --quantize gptq --model-id TheBloke/goliath-120b-GPTQ --revision gptq-3bit--1g-actorder_True --max-total-tokens 4096 in place of the Docker Command input.
  5. Increase Volume Disk to 200GB
  6. Click "Set Overrides" to save those inputs.
  7. Click "Continue"
  8. Click "Deploy"

Here's a video of that process:

simplescreenrecorder-2024-01-13_19.34.44.mp4

And then click the "Pods" page via the button in the side bar and the machine will be listed. Once it starts up, click the logs button and go to the "Container Logs" tab. You'll see it either:

  1. Hang forever, or:
  2. (Very rarely) hang for a while and then show an error (which hints at using HF_HUB_ENABLE_HF_TRANSFER=0, but that doesn't change it for me), and then retry and hang forever.

So far as I can see this is 100% replicable with the above steps. That is of course not to say that this makes it easy to fix, but I hope this helps the investigation somewhat.

@maziyarpanahi
Copy link
Contributor

maziyarpanahi commented Jan 14, 2024

I have a simple setup for my personal tests: no K8, local storage, no proxy, and basic Docker, running TGI with several 70B models and many smaller ones. The downloads generally work smoothly. However, occasionally, I encounter a model with one large model.safetensor file. It's not split into smaller files, and while downloading, the speed often drops from 900 MBit/s to 0 after 20-30GB, making it challenging to maintain a single download stream. This results in frequent stops and restarts, creating multiple tmp_ directories each time.

35G     data/tmp656691ko
23G     data/tmp_d9232pz
23G     data/tmpexwlks4h
35G     data/tmpgv2_dt0_
23G     data/tmph7k75wio

These are my attempts to download a model that has one 36GB model.safetensor file. It even died twice after downloading almost 35GB. I don't have any issue downloading even larger models if they have been sharded to max 10GB in size for instance.

PS: I forgot to share my personal workarounds. If I use that download via AutoModelForCausalLM then I just copy the files to my TGI data/ volume. If it's my own model, I try to reshard it by saving it via max shard set to 9G. Either of these solves my problem of downloading 1 large file.

@martinkozle
Copy link

Maybe this comment isn't very useful as I don't have any new insights to add.
I had the same issue where I restarted the container a bunch of times and waited and it never finished downloading. Then I decided to try the suggestion here to change the entrypoint, exec into the container and manually run the command I found in ps -ef, and it worked first time...
I currently only have 2 deployments so I can manually work around this, but it is definitely an annoying bug.

@lenntt
Copy link

lenntt commented Jan 19, 2024

I've been struggling with this issue for a while now, getting very similar errors and no problem downloading via text-generation-server download-weights $model

it was the following (facepalm incoming, but hope it helps others): I blindly copied over volume=$PWD/data from the docs, but ./data didn't exist. Changing the volume to an existing location location seemed to start the download for me.
A clearer error would've been nice.

I still got some unclear hangs, but restart gets me a bit further, so downloading bit by bit but at least it gets me started :)

@erfanium
Copy link

erfanium commented Jan 21, 2024

I'm happy to manually download the files via wget (which is both transparent and fast) but the download structure is very confusing.

I totally agree with this suggestion from @vgoklani. Seams this issue won't be resolved anytime soon, so it's really helpful to be able to serve models from a pre-downloaded model directory. without having to use hugginface_hub

@mssalvatore
Copy link

mssalvatore commented Jan 23, 2024

The culprit

I think I've tracked this down, and it's not at all what I expected. I've tracked it down to the progress bar in huggingface_hub:

https://github.com/huggingface/huggingface_hub/blob/e80cb600406127a9eb357f1843457a9fc4675e89/src/huggingface_hub/file_download.py#L544-L551

What I did

Like others have mentioned, I ran text-generation-server download-weights manually and did not see any failures. I attempted to use pudb to track down what was failing, but as no failures were encountered that was a dead end.

With the magic of print debugging, I added my own debug statements and wrote them out to a file. Here's a diff of my changes:

diff --git a/src/huggingface_hub/file_download.py b/src/huggingface_hub/file_download.py
index abc82e1..428afe1 100644
--- a/src/huggingface_hub/file_download.py
+++ b/src/huggingface_hub/file_download.py
@@ -18,6 +18,11 @@ from pathlib import Path
 from typing import Any, BinaryIO, Dict, Generator, Literal, Optional, Tuple, Union
 from urllib.parse import quote, urlparse
 
+def write_debug(msg: str):
+    with open("/data/debug-log.txt", "a") as f:
+        f.write(msg)
+        f.flush()
+
 import requests
 from filelock import FileLock
 
@@ -375,6 +380,7 @@ def _raise_if_offline_mode_is_enabled(msg: Optional[str] = None):
 def _request_wrapper(
     method: HTTP_METHOD_T, url: str, *, follow_relative_redirects: bool = False, **params
 ) -> requests.Response:
+    write_debug(f"{method}: {url}\n")
     """Wrapper around requests methods to add several features.
 
     What it does:
@@ -422,7 +428,12 @@ def _request_wrapper(
         return response
 
     # Perform request and return if status_code is not in the retry list.
-    response = get_session().request(method=method, url=url, **params)
+    write_debug("calling get_session()\n")
+    s = get_session()
+    write_debug("finished get_session()\n")
+    write_debug("Making request\n")
+    response = s.request(method=method, url=url, **params)
+    write_debug("finished request\n")
     hf_raise_for_status(response)
     return response
 
@@ -460,6 +471,7 @@ def http_get(
                     " available in your environment. Try `pip install hf_transfer`."
                 )
 
+    #import pudb; pu.db
     initial_headers = headers
     headers = copy.deepcopy(headers) or {}
     if resume_size > 0:
@@ -512,6 +524,7 @@ def http_get(
                     "using `pip install -U hf_transfer`."
                 )
             try:
+                write_debug("before hf_transfer.download()\n")
                 hf_transfer.download(
                     url=url,
                     filename=temp_file.name,
@@ -522,7 +535,9 @@ def http_get(
                     max_retries=5,
                     **({"callback": progress.update} if supports_callback else {}),
                 )
+                write_debug("after hf_transfer.download()\n")
             except Exception as e:
+                write_debug("hf_transfer.download() runtime error\n")
                 raise RuntimeError(
                     "An error occurred while downloading using `hf_transfer`. Consider"
                     " disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling."
@@ -538,14 +553,24 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks
+                    write_debug("got chunk\n")
+                    write_debug(f"len(chunk): {len(chunk)}\n")
+                    write_debug("updating progress\n")
                     progress.update(len(chunk))
+                    write_debug("progress updated\n")
                     temp_file.write(chunk)
+                    write_debug("temp_file.write() returned\n")
                     new_resume_size += len(chunk)
                     # Some data has been downloaded from the server so we reset the number of retries.
                     _nb_retries = 5
+                else:
+                    write_debug("not chunk!!\n")
+            write_debug("after chunk iter\n")
         except (requests.ConnectionError, requests.ReadTimeout) as e:
+            write_debug(f"Chunk iter error {e}\n")
             # If ConnectionError (SSLError) or ReadTimeout happen while streaming data from the server, it is most likely
             # a transient error (network outage?). We log a warning message and try to resume the download a few times
             # before giving up. Tre retry mechanism is basic but should be enough in most cases.
@@ -1457,6 +1482,8 @@ def hf_hub_download(
                 _check_disk_space(expected_size, os.path.dirname(blob_path))
                 if local_dir is not None:
                     _check_disk_space(expected_size, local_dir)
+            else:
+                write_debug("Warning! Expected_size is None\n")
 
             http_get(
                 url_to_download,

I then ran the modified container with the debugging output and monitored both the temporary file size and the output file. I was able to see that the tiiuae/falcon-7b-instruct model consistently hung. Each time it hung, the last thing printed in my output file was "updating progress". Note that the chunk size was always exactly the same.

Hypothesis

It's my hypothesis that, because the docker container is not displaying stdout, the STDOUT buffer or some other output/tty buffer fills up and, once full, progress bar updates block, causing the download to inexplicably hang.

Why this might make sense

  1. It seems to be what my debug logging indicates, consistently.
  2. Commenting out the progress bar update line allows downloads to complete, consistently.
  3. It would explain why when text-generation-server download-weights is run manually, no one can reproduce the issue.
  4. It would explain why users on "very stable networks" can still reproduce the issue (i.e. it's not a networking issue).

Why this might not make sense

  1. If a buffer were a certain size, I would expect it to block at the same time during each download. However, my downloads hung at different points.
  2. I would expect this issue to be more widespread and reproducible than it appears to be if it really had to do with the progress bar.

UPDATE!

To further test this hypothesis, I added the following loop just before the download iterates:

@@ -538,14 +553,29 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("Trying to fill buffers...")
+            for i in range (0, 10_000_000):
+                write_debug(f"progress: {i}\n")
+                print("a"*64)
+                print("a"*64, file=sys.stderr)
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks

The program consistently freezes after 1006 iterations of the new loop.

@skydiablo
Copy link

skydiablo commented Jan 24, 2024

just an idea, try to run the container with --log-driver none ... seems to be the default logging driver is some json-file stuff, with an maximum buffer size. this can be blocking at the end, also maybe some parameters like logging in a unblocking state will help or increase the buffer size? -> --log-opt mode=non-blocking --log-opt max-buffer-size=4m

see: https://docs.docker.com/config/containers/logging/configure/

@mssalvatore
Copy link

@skydiablo That looks like it should work, but it doesn't seem to, which casts some doubt on my hypothesis. The documentation for --log-opt mode=non-blocking suggests it should solve the problem if my hypothesis is correct.

I did another experiment to try and confirm my hypothesis. Again, here's my loop:

@@ -538,14 +553,29 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("Trying to fill buffers...")
+            for i in range (0, 10_000_000):
+                write_debug(f"progress: {i}\n")
+                print("a"*64)
+                print("a"*64, file=sys.stderr)
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks
  1. I removed the line printing to STDOUT and the behavior is the same (hangs after 1006 loops), so it appears that STDERR is the culprit. This is consistent with the hypothesis because tqdm displays the progress bar to STDERR by default.
  2. I reduced the number of bytes from 64 to 32 and then to 16 to see how many loops would complete. The results are as follows:
bytes per loop # loops completed
64 1006
32 1981
16 3836

This is the behavior even when --log-driver none or --log-opt mode=non-blocking. I really would have expected the non-blocking mode to have solved the issue.

Unfortunately, the original issue with an unmodded container does not appear reproducible for me at the present moment. The next time I'm able to reproduce the issue with the original container I'll give these logging options a shot to see if they help.

@skydiablo
Copy link

i tried this already and in my case this isnt helped :(

@mssalvatore
Copy link

Same for me. Changing the log driver or mode does not seem to have an effect. The only thing I can find that's consistent is that the version where I commented out the progress bar update always works, whereas the download hangs more often than not when I allow the progress bar to update.

@skydiablo
Copy link

btw, i do not see any progress as well?
image
???

@mssalvatore
Copy link

btw, i do not see any progress as well? image ???

Right. The progress bar is being sent to STDERR, which isn't displayed.

@skydiablo
Copy link

and there is no way to force to show it, like 2> &1 or something like this?

@puppetm4st3r
Copy link

for me with -e HF_HUB_ENABLE_HF_TRANSFER="true" at least I can download at faster speeds then the hang occurs with the 2nd o 3rd safetensor file, so i can run the process until all safetensors files are donwloaded, but for models with no shard files you will need a bw of a lot mb/sec

@mssalvatore
Copy link

mssalvatore commented Jan 25, 2024

I submitted huggingface/huggingface_hub#2000 to huggingface_hub. In between now and when the fix makes its way into TGI, you can try running your containers with -e HF_HUB_DISABLE_PROGRESS_BARS=1 -e HF_HUB_ENABLE_HF_TRANSFER=0 and see if that resolves the issue.

@OlivierDehaene
Copy link
Member

OlivierDehaene commented Jan 25, 2024

@mssalvatore it seems that you are right.
The size of this buffer also seem to be system dependent, hence why we didn't run into it.

#1486 should fix the issue.

Wauplin pushed a commit to huggingface/huggingface_hub that referenced this issue Jan 25, 2024
* Disable tqdm progress bar if no TTY attached

When dockerized applications write to STDOUT/STDERR, the applications
can block due to logging back pressure (see
https://docs.docker.com/config/containers/logging/configure/#configure-the-delivery-mode-of-log-messages-from-container-to-log-driver6

HuggingFace's TGI container is one such example (see huggingface/text-generation-inference#1186).

Setting tqdm's `disable=None` will disable the progress bar if no tty is
attached and help to resolve TGI's issue #1186.

References:
    huggingface/text-generation-inference#1186 (comment)
    huggingface/text-generation-inference#1186 (comment)

* Disable tqdm progress bar if no TTY attached in lfs.py
@OlivierDehaene
Copy link
Member

Can you guys test ghcr.io/huggingface/text-generation-inference:sha-9c320e2 and tell me if the issue is fixed?

@dtpowl
Copy link

dtpowl commented Jan 25, 2024

It's fixed for me. Thank you!

@OlivierDehaene
Copy link
Member

As 1.4 is out with the fix I will close this issue.
Thanks a lot to @mssalvatore, your help was essential in fixing this issue and is greatly appreciated.

helena-intel pushed a commit to helena-intel/text-generation-inference-hf that referenced this issue Feb 1, 2024
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this issue Apr 29, 2024
@lwabish
Copy link

lwabish commented May 24, 2024

Met this again with docker compose file:

version: "3.8"

services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:2.0.3
    volumes:
      - ./data:/data
    restart: "no"
    ports:
      - 8080:80
    command: --model-id teknium/OpenHermes-2.5-Mistral-7B
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests