-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add world_size in if clause in load_from_http #1396
Conversation
hi @tobiasfshr , thanks for your contribution. Please sign the CLA. |
signed it. |
Hi, could you explain why |
Sometimes the environment variable LOCAL_RANK is set even before ddp is fully initialized by pytorch, so in get_dist_info the condition |
Thanks for your explanation. |
Theoretically, |
yes one could also remove L281. But i assume there was a reason for adding this in the first place? |
Your consideration is reasonable, I consulted the author, who felt that this line |
okay so i'll remove L281 and the explanation but keep the 'or world_size == 1' in the if condition, since this seems like a better behavior to me (if world_size == 1 it doesn't make sense to execute .barrier() regardless of the rank value). Sounds good? |
In fact. if the world_size is 1, the rank should be 0 when we remove the L281. |
okay removed the check. My point was more that the additional check for world_size would make the code a little more robust to unexpected values of rank, but i get your point after removing L281 rank should always be 0 if world_size = 1. |
* add world_size in if clause * add explanation * remove LOCAL_RANK check
Hi @tobiasfshr !First of all, we want to express our gratitude for your significant PR in the MMCV project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR. We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/raweFPmdzG If you are Chinese or have WeChat,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:) |
Motivation
I'm using mmcv within another codebase and i have issues with loading a checkpoint from http in multi-GPU setting.
Specifically i get the error:
This is because get_dist_info does not correctly get world_size in L280, but rank is correctly inferred via L281. Hence, the variable checkpoint will not be defined on return (rank > 0 but world_size == 1).
Modification
I can mitigate the problem by modifying this line:
mmcv/mmcv/runner/checkpoint.py
Line 282 in f22c9eb
to
if rank == 0 or world_size == 1: