Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add world_size in if clause in load_from_http #1396

Merged
merged 3 commits into from
Nov 19, 2021

Conversation

tobiasfshr
Copy link
Contributor

Motivation

I'm using mmcv within another codebase and i have issues with loading a checkpoint from http in multi-GPU setting.
Specifically i get the error:

File "/mmcv/runner/checkpoint.py", line 290, in load_from_http
return checkpoint
UnboundLocalError: local variable 'checkpoint' referenced before assignment

This is because get_dist_info does not correctly get world_size in L280, but rank is correctly inferred via L281. Hence, the variable checkpoint will not be defined on return (rank > 0 but world_size == 1).

Modification

I can mitigate the problem by modifying this line:

if rank == 0:

to

if rank == 0 or world_size == 1:

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@CLAassistant
Copy link

CLAassistant commented Oct 11, 2021

CLA assistant check
All committers have signed the CLA.

@tobiasfshr tobiasfshr changed the title add world_size in if clause in load_from_htttp add world_size in if clause in load_from_http Oct 11, 2021
@zhouzaida
Copy link
Collaborator

CLA assistant check Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.You have signed the CLA already but the status is still pending? Let us recheck it.

hi @tobiasfshr , thanks for your contribution. Please sign the CLA.

@tobiasfshr
Copy link
Contributor Author

signed it.

@zhouzaida
Copy link
Collaborator

zhouzaida commented Oct 12, 2021

This is because get_dist_info does not correctly get world_size in L280, but rank is correctly inferred via L281. Hence, the variable checkpoint will not be defined on return (rank > 0 but world_size == 1).

Hi, could you explain why get_dist_info does not correctly get world_size in L280 and why the case rank > 0 but world_size == 1 happens

@tobiasfshr
Copy link
Contributor Author

tobiasfshr commented Oct 12, 2021

Sometimes the environment variable LOCAL_RANK is set even before ddp is fully initialized by pytorch, so in get_dist_info the condition if dist.is_available() and dist.is_initialized(): will fail, and thus return world_size = 1 and rank = 0. However, in L281 rank is set to LOCAL_RANK and thus rank will be > 0 for the non-master processes.

@zhouzaida
Copy link
Collaborator

Sometimes the environment variable LOCAL_RANK is set even before ddp is fully initialized by pytorch, so in get_dist_info the condition if dist.is_available() and dist.is_initialized(): will fail, and thus return world_size = 1 and rank = 0. However, in L281 rank is set to LOCAL_RANK and thus rank will be > 0 for the non-master processes.

Thanks for your explanation.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@zhouzaida zhouzaida requested review from nbei and ZwwWayne October 12, 2021 13:41
@zhouzaida zhouzaida mentioned this pull request Oct 16, 2021
11 tasks
@zhouzaida zhouzaida mentioned this pull request Oct 27, 2021
13 tasks
@zhouzaida
Copy link
Collaborator

Sometimes the environment variable LOCAL_RANK is set even before ddp is fully initialized by pytorch, so in get_dist_info the condition if dist.is_available() and dist.is_initialized(): will fail, and thus return world_size = 1 and rank = 0. However, in L281 rank is set to LOCAL_RANK and thus rank will be > 0 for the non-master processes.

Theoretically, rank = int(os.environ.get('LOCAL_RANK', rank)) is redundant because the model state of rank 0 will broadcast to other processes before forward pass.

@tobiasfshr
Copy link
Contributor Author

yes one could also remove L281. But i assume there was a reason for adding this in the first place?

@zhouzaida
Copy link
Collaborator

Your consideration is reasonable, I consulted the author, who felt that this line rank = int(os.environ.get('LOCAL_RANK', rank)) could be left out. From the principle of DDP, this line is not necessary to be added.

@tobiasfshr
Copy link
Contributor Author

okay so i'll remove L281 and the explanation but keep the 'or world_size == 1' in the if condition, since this seems like a better behavior to me (if world_size == 1 it doesn't make sense to execute .barrier() regardless of the rank value). Sounds good?

@zhouzaida
Copy link
Collaborator

okay so i'll remove L281 and the explanation but keep the 'or world_size == 1' in the if condition, since this seems like a better behavior to me (if world_size == 1 it doesn't make sense to execute .barrier() regardless of the rank value). Sounds good?

In fact. if the world_size is 1, the rank should be 0 when we remove the L281.

@tobiasfshr
Copy link
Contributor Author

okay removed the check. My point was more that the additional check for world_size would make the code a little more robust to unexpected values of rank, but i get your point after removing L281 rank should always be 0 if world_size = 1.

@ZwwWayne ZwwWayne merged commit bf6ba25 into open-mmlab:master Nov 19, 2021
zhouzaida pushed a commit that referenced this pull request Nov 27, 2021
* add world_size in if clause

* add explanation

* remove LOCAL_RANK check
@OpenMMLab-Coodinator
Copy link

Hi @tobiasfshr !First of all, we want to express our gratitude for your significant PR in the MMCV project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.

We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us :https://discord.gg/raweFPmdzG

If you are Chinese or have WeChat,welcome to join our community on WeChat. You can add our assistant :openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends:)
Thank you again for your contribution❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants