You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After setting the trainer.ckpt_path variable to resume the fine-tuning process, the fine-tuning scheduler appears to save an Encoder_ft_schedule.yaml file on each machine.
However, as all machines are utilizing the same shared remote storage during my training process, this concurrent saving action results in an Input/Output error. It may be necessary to decorate the ScheduleImplMixin.save_schedule static method with @rank_zero_only to ensure that the file is saved only once by the primary node.
To address this issue, I decorated the ScheduleImplMixin.save_schedule static method with @rank_zero_only, ensuring that the file is only saved once by the primary node. After making this change, the model training proceeded without further I/O errors.
Environment
Fine-Tuning Scheduler Version (e.g., 0.1.0): 2.1.1
Lightning Version (e.g., 1.5.0): 2.1.1
PyTorch Version (e.g., 2.0): 2.1.1
Python version (e.g., 3.11): 3.9
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.8
GPU models and configuration: A100 80g
How you installed PyTorch (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.
Thanks for finding and submitting this issue @Davidham3!
As you observed, some codepaths were incorrectly invoking ScheduleImplMixin.save_schedule (and ScheduleImplMixin.gen_ft_schedule actually) from non-rank_zero_only guarded contexts. I've explicitly guarded both those methods themselves rather than relying upon the rank-zero guarding of upstream codepaths. The fix for this is in today's patch release (2.1.2).
Thanks again for your contribution. Feel free to reach out anytime if you have other issues or want to share more about your use case.
…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.
🐛 Bug
After setting the trainer.ckpt_path variable to resume the fine-tuning process, the fine-tuning scheduler appears to save an Encoder_ft_schedule.yaml file on each machine.
However, as all machines are utilizing the same shared remote storage during my training process, this concurrent saving action results in an Input/Output error. It may be necessary to decorate the ScheduleImplMixin.save_schedule static method with @rank_zero_only to ensure that the file is saved only once by the primary node.
To address this issue, I decorated the ScheduleImplMixin.save_schedule static method with @rank_zero_only, ensuring that the file is only saved once by the primary node. After making this change, the model training proceeded without further I/O errors.
Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: