The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

Davidham3 · 2023-12-08T09:23:09Z

🐛 Bug

After setting the trainer.ckpt_path variable to resume the fine-tuning process, the fine-tuning scheduler appears to save an Encoder_ft_schedule.yaml file on each machine.

However, as all machines are utilizing the same shared remote storage during my training process, this concurrent saving action results in an Input/Output error. It may be necessary to decorate the ScheduleImplMixin.save_schedule static method with @rank_zero_only to ensure that the file is saved only once by the primary node.

To address this issue, I decorated the ScheduleImplMixin.save_schedule static method with @rank_zero_only, ensuring that the file is only saved once by the primary node. After making this change, the model training proceeded without further I/O errors.

Environment

Fine-Tuning Scheduler Version (e.g., 0.1.0): 2.1.1
Lightning Version (e.g., 1.5.0): 2.1.1
PyTorch Version (e.g., 2.0): 2.1.1
Python version (e.g., 3.11): 3.9
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.8
GPU models and configuration: A100 80g
How you installed PyTorch (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.

speediedan · 2023-12-20T18:29:33Z

Thanks for finding and submitting this issue @Davidham3!

As you observed, some codepaths were incorrectly invoking ScheduleImplMixin.save_schedule (and ScheduleImplMixin.gen_ft_schedule actually) from non-rank_zero_only guarded contexts. I've explicitly guarded both those methods themselves rather than relying upon the rank-zero guarding of upstream codepaths. The fix for this is in today's patch release (2.1.2).

Thanks again for your contribution. Feel free to reach out anytime if you have other issues or want to share more about your use case.

Best of luck with your work!

…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.

Davidham3 added the bug Something isn't working label Dec 8, 2023

speediedan closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

Davidham3 commented Dec 8, 2023

speediedan commented Dec 20, 2023

The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

Comments

Davidham3 commented Dec 8, 2023

🐛 Bug

Environment

speediedan commented Dec 20, 2023