Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOCAL_RANK not being set in slurm #6797

Closed
Queuecumber opened this issue Apr 1, 2021 · 22 comments · Fixed by #6802
Closed

LOCAL_RANK not being set in slurm #6797

Queuecumber opened this issue Apr 1, 2021 · 22 comments · Fixed by #6802
Labels
bug Something isn't working environment: slurm help wanted Open to be worked on priority: 0 High priority task

Comments

@Queuecumber
Copy link
Contributor

🐛 Bug

A lot of the PTL tooling around multiprocess depends on a specific environment variable: LOCAL_RANK being set correctly, it seems that when running in slurm this isnt set causing it to return the default of 0 for all processes which makes every process do things that should only be done on rank 0, like log stuff.

Also I'm a little unclear about the name of that variable, if I have multiple nodes, only the global rank 0 not the local rank should be logging and saving checkpoints etc.

To Reproduce

Run in slurm (cant really do it w/ colab), a good way to easily see it is to use the Wandb logger, you'll see that each process makes a new run on the Wandb UI which means that @rank_zero_experiment didnt work properly, and you can confirm this by printing LOCAL_RANK which is defaulted to 0 if unset, it will always give back 0.

Expected behavior

LOCAL_RANK is set correctly or the rest of the tooling is aware of the global rank of the process

Environment

Will update if it's really necessary

@Queuecumber Queuecumber added bug Something isn't working help wanted Open to be worked on labels Apr 1, 2021
@Queuecumber
Copy link
Contributor Author

Adding the following to the SLURMEnvironment contrustructor does seem to work:

os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
rank_zero_only.rank = int(os.environ["SLURM_PROCID"])

but I would like to get some feedback from the developers before proposing this as a solution.

@ananthsub
Copy link
Contributor

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Apr 1, 2021

That appears to be correct, in __init__ of SLURMEnvironment:

def __init__(self):
        super().__init__()
        os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
        rank_zero_only.rank = int(os.environ["SLURM_PROCID"])
        print(f"SLURM LOCALID: {os.environ['SLURM_LOCALID']}")

prints SLURM LOCALID: 0 on rank 0, SLURM LOCALID: 3 on rank 3, etc.

@Queuecumber
Copy link
Contributor Author

I think this is a fairly straightforward fix and it makes sense to delegate this responsibility to the SLURMConnector I just want to double check that

  1. __init__ is the right place to do it
  2. that there isnt a better way to get rank_zero_only.rank set correctly (this follows a precedent set by the GPUAccelerator)
  3. that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

@ananthsub
Copy link
Contributor

@awaelchli what do you think about setting the rank info on the cluster environment / training type plugin during the Trainer init?currently we wait for setup_environment in the training plugin to be called, but if someone has launched with SLURM or torchelastic, this information is already available based on the environment variables. Only in the case of spawn do we not know this yet, but we can handle that accordingly

@awaelchli
Copy link
Contributor

You mean setting the rank_zero_only.rank right? If the LOCAL_RANK variable is set, it will already default to that
I would avoid the following line in the above snippet:
os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
In my opinion Lightning should not write environment variables, only read them, unless there is a very good reason.
If the user needs the local rank, they can access it through the trainer.local_rank and the cluster environment will conveniently translate the slurm local id to that.

@awaelchli
Copy link
Contributor

We can think of a way to set rank_zero_only.rank earlier so that the rank_zero_only etc. decorators work as soon as possible.

@Queuecumber
Copy link
Contributor Author

We can think of a way to set rank_zero_only.rank earlier so that the rank_zero_only etc. decorators work as soon as possible.

It's not that it needs to be set sooner it's that it's not set correctly at all. LOCAL_RANK isn't set in a slurm environment, so if you take a look at the line you linked, that uses a default value of 0 if the environment variable isn't set, which means that every process thinks it's the rank 0 process and writes logs/saves checkpoints/etc.

As far as I can tell there's no machinery around getting the slurm assigned rank into rank_zero.rank so that's what we would need to fix this.

Also I'm a little concerned about that variable being called LOCAL_RANK which seems to imply that rank 0 on every machine should be doing logging, checkpointing, whatever else. It should only be the global rank zero that does those things.

@awaelchli
Copy link
Contributor

yes it defaults to 0 but that's the only assumption you can make if you don't know where else to get the rank information from.
the idea is that later on it gets overwritten, for example here in ddp:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L174

It's not that it needs to be set sooner it's that it's not set correctly at all.

no I think what you are experiencing is really just because it's not set sooner. I wish I had slurm but I can't test it :(

@awaelchli
Copy link
Contributor

In your post you mention wandb creating multiple runs.
I recently fixed this: #6380
Does your version of Lightning include this fix?

@ananthsub
Copy link
Contributor

@awaelchli i'm not sure if just setting it earlier will work for all use cases, as the rank zero utilities are used outside of the trainer (e.g. in the loggers). a quickfix would be to cover the slurm local rank when we set rank_zero_only.rank but that will break when another cluster sets this differently. we could create a separate dataclass for these settings and then populate the cluster environment based on that?

@Queuecumber
Copy link
Contributor Author

@awaelchli

In your post you mention wandb creating multiple runs.
I recently fixed this: #6380
Does your version of Lightning include this fix?

I'll check on that later today, it may be related. I updated recently (and I also reproduced this on the comet logger). I'll try pulling the latest and check without my patch.

no I think what you are experiencing is really just because it's not set sooner. I wish I had slurm but I can't test it :(

I was checking it in the first call to the experiment property of the wandb logger and it was still 0 at that point, which is pretty late. As far as I could tell there's nothing that reads the env variable set by slurm and assigns it.

@ananthsub

but that will break when another cluster sets this differently

Exactly which is why I think it makes sense to encapsulate all slurm functionality in a slurm class and you can make another one for another cluster, I guess SLURMEnvironment would be that class.

How do we feel about the local rank issue though? Am I interpreting it correctly that global 0 not local 0 should be doing the logging and stuff?

@ananthsub
Copy link
Contributor

@ananthsub
Copy link
Contributor

ananthsub commented Apr 2, 2021

that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

yes the naming can be improved. in DDP we set this to be the global rank: https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L173-L174

@Queuecumber
Copy link
Contributor Author

that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

yes the naming can be improved. in DDP we set this to be the global rank:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L173-L174

OK it sounds like its maybe a holdover from when the ddp spawn stuff was the only supported method of parallelism in which case local 0 and global 0 are the same

What did you have in mind for the quick fix? I'm happy to try to take care of it

@ananthsub
Copy link
Contributor

#6802 would be the quick fix - does this resolve the issues you're seeing?

@Queuecumber
Copy link
Contributor Author

Off the top of my head it looks right, but why not just let the environment classes set that field directly?

@ananthsub
Copy link
Contributor

The environment classes could be initialized later than when the rank_zero_only module is used

@Queuecumber
Copy link
Contributor Author

Might make sense to make sure that's one of the first things that happens

@ananthsub
Copy link
Contributor

@awaelchli what do you think about adding a global_rank property to the cluster environment interface and the training type plugin interface? currently global rank is only on the parallel plugin but for single device it's trivial.

this can be optional so that the lightning/spawn environment doesn't need to handle it until the child processes are spawned.

for torchelastic and slurm, the global rank is available from the env vars. then the cluster environment can be the source of truth, and we can go from cluster environment => training type plugin => accelerator => accelerator connector => rank zero only.rank setter in the trainer init itself.

@awaelchli
Copy link
Contributor

It's possible but some plugins will need change the global rank and so the cluster env will not be anymore the source of truth.

Example: ddp2 sets global_rank = node_rank
Example 2: Horovod sets it to hvd.rank()
Example 3: TPU sets it to xm.get_ordinal()

So this would require a setter to.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Apr 5, 2021

Shouldn't those then have corresponding environment classes that get initialized early? That would ensure nothing uses the wrong rank if these accelerators get initialized later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working environment: slurm help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
4 participants