Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not able to use push_to_hub during tpu training #2851

Closed
yiyixuxu opened this issue Mar 27, 2023 · 8 comments
Closed

not able to use push_to_hub during tpu training #2851

yiyixuxu opened this issue Mar 27, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@yiyixuxu
Copy link
Collaborator

Describe the bug

Not able to use ---push_to_hub option for TPU training

getting error

Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:59 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].

This is not a unique train_text_to_image_flax.py script. I'm just using it as an example. Basically, this line will always fail when called during training on a tpu https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py#L584

Reproduction

run the train_text_to_image_flax script here with this command

https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-flaxjax

export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export dataset_name="lambdalabs/pokemon-blip-captions"
export OUTPUT_DIR="/pokemon"
export HUB_MODEL_ID="pokemon-lora"

python3 train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --max_train_steps=150 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub \
  --hub_model_id=${HUB_MODEL_ID} 

Logs

Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:38 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:48 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:13:58 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:08 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:18 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:28 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:38 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:48 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:14:58 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:09 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:19 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:29 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:39 - ERROR - huggingface_hub.repository - Waiting for the following commands to 

finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:49 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].
03/27/2023 23:15:59 - ERROR - huggingface_hub.repository - Waiting for the following commands to finish before shutting down: [[push command, status code: running, in progress. PID: 772274]].

System Info

tpu-v4-8

@yiyixuxu yiyixuxu added the bug Something isn't working label Mar 27, 2023
@yiyixuxu
Copy link
Collaborator Author

ohh, sometimes it works (despite the error message)
https://huggingface.co/YiYiXu/fill-circle-controlnet

@sayakpaul
Copy link
Member

@Wauplin cc.

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 28, 2023

Hi @yiyixuxu, I took a look at the code (both the script and huggingface_hub internals).

About the error message, here's why it's happening:

  1. The script is using repo.push_to_hub(..., blocking=False) at the end of the training. blocking=False means the push is run in the background. However, since it is the last line that is executed, the script exits just after.
  2. huggingface_hub prevents the script from exiting if all commands are not completed. This is done using atexit.register(self.wait_for_commands)
  3. In wait_for_commands, a while loop checks every seconds if the commands are all completed. If not, it logs an error message "waiting for the following commands to complete (...)" and wait for 10 seconds.

=> So actually, this is not an error and the script works exactly as expected. If you wait long enough, the push_to_hub command will eventually be completed and your script will gracefully exit.

=> I think the only problem is that we log the "waiting for..." message as an ERROR which is misleading. Since it has been implemented 18 months ago (huggingface/huggingface_hub#315) and that it's still quite used, I'm a bit reluctant in changing it without a second opinion. @LysandreJik @sgugger is that still used a lot in transformers as well? Would it be ok to log only a warning to make it less scary for users?

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 28, 2023

Another short term solution for diffusers is to set repo.push_to_hub(..., blocking=True) which will block the script at the end of the training instead of running it in the background.

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 28, 2023

I also opened a related issue (#2860) to update the training scripts. It's not about solving an issue but more about improving the UX.

@sgugger
Copy link
Contributor

sgugger commented Mar 28, 2023

We don't rely on the level of the log in Transformers, so it's completely fine for me if it's downgraded to warning.

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 28, 2023

Ok, thanks for the quick feedback @sgugger. I think I'll update the log level then. I created an issue for it: huggingface/huggingface_hub#1412.

@yiyixuxu
Copy link
Collaborator Author

thanks @Wauplin for the clarification! and yeah downgrade to the warning will be really helpful:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants