LOCAL_RANK not being set in slurm #6797

Queuecumber · 2021-04-01T19:38:23Z

🐛 Bug

A lot of the PTL tooling around multiprocess depends on a specific environment variable: LOCAL_RANK being set correctly, it seems that when running in slurm this isnt set causing it to return the default of 0 for all processes which makes every process do things that should only be done on rank 0, like log stuff.

Also I'm a little unclear about the name of that variable, if I have multiple nodes, only the global rank 0 not the local rank should be logging and saving checkpoints etc.

To Reproduce

Run in slurm (cant really do it w/ colab), a good way to easily see it is to use the Wandb logger, you'll see that each process makes a new run on the Wandb UI which means that @rank_zero_experiment didnt work properly, and you can confirm this by printing LOCAL_RANK which is defaulted to 0 if unset, it will always give back 0.

Expected behavior

LOCAL_RANK is set correctly or the rest of the tooling is aware of the global rank of the process

Environment

Will update if it's really necessary

Queuecumber · 2021-04-01T20:39:08Z

Adding the following to the SLURMEnvironment contrustructor does seem to work:

os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
rank_zero_only.rank = int(os.environ["SLURM_PROCID"])

but I would like to get some feedback from the developers before proposing this as a solution.

ananthsub · 2021-04-01T23:43:40Z

What's the value you see for SLURM_LOCALID ? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/environments/slurm_environment.py#L75-L76

Queuecumber · 2021-04-01T23:49:46Z

That appears to be correct, in __init__ of SLURMEnvironment:

def __init__(self):
        super().__init__()
        os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
        rank_zero_only.rank = int(os.environ["SLURM_PROCID"])
        print(f"SLURM LOCALID: {os.environ['SLURM_LOCALID']}")

prints SLURM LOCALID: 0 on rank 0, SLURM LOCALID: 3 on rank 3, etc.

Queuecumber · 2021-04-02T17:01:43Z

I think this is a fairly straightforward fix and it makes sense to delegate this responsibility to the SLURMConnector I just want to double check that

__init__ is the right place to do it
that there isnt a better way to get rank_zero_only.rank set correctly (this follows a precedent set by the GPUAccelerator)
that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

ananthsub · 2021-04-02T21:04:09Z

@awaelchli what do you think about setting the rank info on the cluster environment / training type plugin during the Trainer init?currently we wait for setup_environment in the training plugin to be called, but if someone has launched with SLURM or torchelastic, this information is already available based on the environment variables. Only in the case of spawn do we not know this yet, but we can handle that accordingly

awaelchli · 2021-04-02T21:13:00Z

You mean setting the rank_zero_only.rank right? If the LOCAL_RANK variable is set, it will already default to that
I would avoid the following line in the above snippet:
os.environ["LOCAL_RANK"] = os.environ["SLURM_PROCID"]
In my opinion Lightning should not write environment variables, only read them, unless there is a very good reason.
If the user needs the local rank, they can access it through the trainer.local_rank and the cluster environment will conveniently translate the slurm local id to that.

awaelchli · 2021-04-02T21:16:04Z

We can think of a way to set rank_zero_only.rank earlier so that the rank_zero_only etc. decorators work as soon as possible.

Queuecumber · 2021-04-02T21:28:48Z

We can think of a way to set rank_zero_only.rank earlier so that the rank_zero_only etc. decorators work as soon as possible.

It's not that it needs to be set sooner it's that it's not set correctly at all. LOCAL_RANK isn't set in a slurm environment, so if you take a look at the line you linked, that uses a default value of 0 if the environment variable isn't set, which means that every process thinks it's the rank 0 process and writes logs/saves checkpoints/etc.

As far as I can tell there's no machinery around getting the slurm assigned rank into rank_zero.rank so that's what we would need to fix this.

Also I'm a little concerned about that variable being called LOCAL_RANK which seems to imply that rank 0 on every machine should be doing logging, checkpointing, whatever else. It should only be the global rank zero that does those things.

awaelchli · 2021-04-02T21:39:34Z

yes it defaults to 0 but that's the only assumption you can make if you don't know where else to get the rank information from.
the idea is that later on it gets overwritten, for example here in ddp:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L174

It's not that it needs to be set sooner it's that it's not set correctly at all.

no I think what you are experiencing is really just because it's not set sooner. I wish I had slurm but I can't test it :(

awaelchli · 2021-04-02T21:42:22Z

In your post you mention wandb creating multiple runs.
I recently fixed this: #6380
Does your version of Lightning include this fix?

ananthsub · 2021-04-02T21:45:38Z

@awaelchli i'm not sure if just setting it earlier will work for all use cases, as the rank zero utilities are used outside of the trainer (e.g. in the loggers). a quickfix would be to cover the slurm local rank when we set rank_zero_only.rank but that will break when another cluster sets this differently. we could create a separate dataclass for these settings and then populate the cluster environment based on that?

Queuecumber · 2021-04-02T21:48:54Z

@awaelchli

In your post you mention wandb creating multiple runs.
I recently fixed this: #6380
Does your version of Lightning include this fix?

I'll check on that later today, it may be related. I updated recently (and I also reproduced this on the comet logger). I'll try pulling the latest and check without my patch.

no I think what you are experiencing is really just because it's not set sooner. I wish I had slurm but I can't test it :(

I was checking it in the first call to the experiment property of the wandb logger and it was still 0 at that point, which is pretty late. As far as I could tell there's nothing that reads the env variable set by slurm and assigns it.

@ananthsub

but that will break when another cluster sets this differently

Exactly which is why I think it makes sense to encapsulate all slurm functionality in a slurm class and you can make another one for another cluster, I guess SLURMEnvironment would be that class.

How do we feel about the local rank issue though? Am I interpreting it correctly that global 0 not local 0 should be doing the logging and stuff?

ananthsub · 2021-04-02T22:40:20Z

@awaelchli @Queuecumber I'll send out a quickfix to at least mitigate the issue.
This environment code is duplicated in a few spots so we will need to refactor this: https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/connectors/accelerator_connector.py#L465-L474
https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/trainer/connectors/accelerator_connector.py#L309-L312

I agree that the cluster environment + subclasses should be the source of truth for this metadata

ananthsub · 2021-04-02T22:48:53Z

that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

yes the naming can be improved. in DDP we set this to be the global rank: https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L173-L174

Queuecumber · 2021-04-02T23:54:45Z

that the semantics of LOCAL_RANK are actually supposed to be a global rank (in which case there should be some refactoring around the name of that variable but probably outside the scope of a PR related to this issue)

yes the naming can be improved. in DDP we set this to be the global rank:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bb9ace43334ad50e3758d9cff08ad34216c7d4da/pytorch_lightning/plugins/training_type/ddp.py#L173-L174

OK it sounds like its maybe a holdover from when the ddp spawn stuff was the only supported method of parallelism in which case local 0 and global 0 are the same

What did you have in mind for the quick fix? I'm happy to try to take care of it

ananthsub · 2021-04-03T00:53:02Z

#6802 would be the quick fix - does this resolve the issues you're seeing?

Queuecumber · 2021-04-03T00:54:56Z

Off the top of my head it looks right, but why not just let the environment classes set that field directly?

ananthsub · 2021-04-03T01:31:41Z

The environment classes could be initialized later than when the rank_zero_only module is used

Queuecumber · 2021-04-03T03:36:53Z

Might make sense to make sure that's one of the first things that happens

ananthsub · 2021-04-05T05:55:57Z

@awaelchli what do you think about adding a global_rank property to the cluster environment interface and the training type plugin interface? currently global rank is only on the parallel plugin but for single device it's trivial.

this can be optional so that the lightning/spawn environment doesn't need to handle it until the child processes are spawned.

for torchelastic and slurm, the global rank is available from the env vars. then the cluster environment can be the source of truth, and we can go from cluster environment => training type plugin => accelerator => accelerator connector => rank zero only.rank setter in the trainer init itself.

awaelchli · 2021-04-05T07:44:34Z

It's possible but some plugins will need change the global rank and so the cluster env will not be anymore the source of truth.

Example: ddp2 sets global_rank = node_rank
Example 2: Horovod sets it to hvd.rank()
Example 3: TPU sets it to xm.get_ordinal()

So this would require a setter to.

Queuecumber · 2021-04-05T12:33:04Z

Shouldn't those then have corresponding environment classes that get initialized early? That would ensure nothing uses the wrong rank if these accelerators get initialized later

Queuecumber added bug Something isn't working help wanted Open to be worked on labels Apr 1, 2021

ananthsub mentioned this issue Apr 2, 2021

[fix] Better support for rank_zero_only setting for SLURM and torchelastic #6802

Merged

11 tasks

tchaton added priority: 0 High priority task environment: slurm labels Apr 6, 2021

srib mentioned this issue Apr 6, 2021

global process count incorrect with elastic, fault tolerant training #6853

Closed

carmocca closed this as completed in #6802 Apr 7, 2021

ananthsub mentioned this issue Apr 7, 2021

Remove hardcoding of rank_zero_only.rank in accelerator connector #6878

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOCAL_RANK not being set in slurm #6797

LOCAL_RANK not being set in slurm #6797

Queuecumber commented Apr 1, 2021

Queuecumber commented Apr 1, 2021

ananthsub commented Apr 1, 2021

Queuecumber commented Apr 1, 2021 •

edited

Loading

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 2, 2021

awaelchli commented Apr 2, 2021

awaelchli commented Apr 2, 2021

Queuecumber commented Apr 2, 2021

awaelchli commented Apr 2, 2021

awaelchli commented Apr 2, 2021

ananthsub commented Apr 2, 2021

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 2, 2021

ananthsub commented Apr 2, 2021 •

edited

Loading

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 3, 2021

Queuecumber commented Apr 3, 2021

ananthsub commented Apr 3, 2021

Queuecumber commented Apr 3, 2021

ananthsub commented Apr 5, 2021

awaelchli commented Apr 5, 2021

Queuecumber commented Apr 5, 2021 •

edited

Loading

LOCAL_RANK not being set in slurm #6797

LOCAL_RANK not being set in slurm #6797

Comments

Queuecumber commented Apr 1, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Queuecumber commented Apr 1, 2021

ananthsub commented Apr 1, 2021

Queuecumber commented Apr 1, 2021 • edited Loading

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 2, 2021

awaelchli commented Apr 2, 2021

awaelchli commented Apr 2, 2021

Queuecumber commented Apr 2, 2021

awaelchli commented Apr 2, 2021

awaelchli commented Apr 2, 2021

ananthsub commented Apr 2, 2021

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 2, 2021

ananthsub commented Apr 2, 2021 • edited Loading

Queuecumber commented Apr 2, 2021

ananthsub commented Apr 3, 2021

Queuecumber commented Apr 3, 2021

ananthsub commented Apr 3, 2021

Queuecumber commented Apr 3, 2021

ananthsub commented Apr 5, 2021

awaelchli commented Apr 5, 2021

Queuecumber commented Apr 5, 2021 • edited Loading

Queuecumber commented Apr 1, 2021 •

edited

Loading

ananthsub commented Apr 2, 2021 •

edited

Loading

Queuecumber commented Apr 5, 2021 •

edited

Loading