fix logging on rank 0 only #2425

awaelchli · 2020-06-30T01:06:41Z

What does this PR do?

Fixes #2131 and the issue with WandB logger discussed on slack.

On ddp loggers work like this:

logger creates experiment and files on trainer > 0
logger does nothing on rank > 0
the user code can access self.logger.experiment.some_log_method, but on rank > 0 it is a no-op.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-06-30T01:27:11Z

Codecov Report

Merging #2425 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #2425   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          69      69           
  Lines        5505    5527   +22     
======================================
+ Hits         4866    4888   +22     
  Misses        639     639

williamFalcon · 2020-06-30T02:22:22Z

there is no way to know... this is why we init the loggers in init and then set the rank in train.
then the loggers that are not rank 0 do nothin

awaelchli · 2020-06-30T04:49:45Z

In my current approach I added a decorator on top of Logger.experiment that returns DummyExperiment on rank > 0.
The problem that we have is logger.name and logger.version are None on rank > 0 for some loggers that have these properties dependent on the experiment, and I think that's the reason why the horovod tests fail. these tests have some code that explicitly construct paths outside the Trainer/Logger.

pep8speaks · 2020-06-30T19:15:39Z

Hello @awaelchli! Thanks for updating this PR.

In the file tests/loggers/test_all.py:

Line 151:5: E265 block comment should start with '# '

Comment last updated at 2020-06-30 21:47:43 UTC

williamFalcon · 2020-06-30T20:51:43Z

@tgaddair

trainer_options = {'default_root_dir': '/tmp/pytest-of-runner/pytest-0/test_horovod_cpu_implicit0', 'deterministic': True, 'gradient_clip_val': 1.0, 'limit_train_batches': 0.4, ...}
on_gpu = False

    def _run_horovod(trainer_options, on_gpu=False):
        """Execute the training script across multiple workers in parallel."""
        tutils.reset_seed()
        cmdline = [
            'horovodrun',
            '-np', '2',
            sys.executable, TEST_SCRIPT,
            '--trainer-options', shlex.quote(json.dumps(trainer_options))
        ]
        if on_gpu:
            cmdline += ['--on-gpu']
        exit_code = subprocess.call(' '.join(cmdline), shell=True, env=os.environ.copy())
>       assert exit_code == 0
E       assert 1 == 0
E         +1
E         -0

tests/models/test_horovod.py:53: AssertionError

any thoughts?

tgaddair · 2020-06-30T20:59:47Z

@williamFalcon looks like this error is coming from rank 1 (from the stdout capture in the CI logs):

Tue Jun 30 20:51:58 2020[1]<stderr>:Traceback (most recent call last):
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 67, in <module>
Tue Jun 30 20:51:58 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 54, in run_test_from_config
Tue Jun 30 20:51:58 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/develop_pipelines.py", line 61, in run_model_test
Tue Jun 30 20:51:58 2020[1]<stderr>:    pretrained_model = load_model_from_checkpoint(logger, trainer.checkpoint_callback.dirpath)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/develop_utils.py", line 61, in load_model_from_checkpoint
Tue Jun 30 20:51:58 2020[1]<stderr>:    hparams_path = os.path.join(path_expt_dir, TensorBoardLogger.NAME_HPARAMS_FILE)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/opt/hostedtoolcache/Python/3.6.10/x64/lib/python3.6/posixpath.py", line 80, in join
Tue Jun 30 20:51:58 2020[1]<stderr>:    a = os.fspath(a)
Tue Jun 30 20:51:58 2020[1]<stderr>:TypeError: expected str, bytes or os.PathLike object, not NoneType

I'll need to dig into the code of this PR a bit to see what's the cause.

williamFalcon · 2020-06-30T21:00:59Z

ok awesome. maybe there's a property that we should be assigning on all ranks but now only assign on 0?

awaelchli · 2020-06-30T21:01:27Z

@williamFalcon if you look further down in the errors you will find this


Tue Jun 30 20:49:23 2020[0]<stderr>:  warnings.warn(*args, **kwargs)
--
1003 | Tue Jun 30 20:49:24 2020[1]<stderr>:Traceback (most recent call last):
1004 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/models/data/horovod/train_default_model.py", line 67, in <module>
1005 | Tue Jun 30 20:49:24 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
1006 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/models/data/horovod/train_default_model.py", line 54, in run_test_from_config
1007 | Tue Jun 30 20:49:24 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
1008 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/base/develop_pipelines.py", line 61, in run_model_test
1009 | Tue Jun 30 20:49:24 2020[1]<stderr>:    pretrained_model = load_model_from_checkpoint(logger, trainer.checkpoint_callback.dirpath)
1010 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/base/develop_utils.py", line 61, in load_model_from_checkpoint
1011 | Tue Jun 30 20:49:24 2020[1]<stderr>:    hparams_path = os.path.join(path_expt_dir, TensorBoardLogger.NAME_HPARAMS_FILE)
1012 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/opt/conda/lib/python3.7/posixpath.py", line 80, in join
1013 | Tue Jun 30 20:49:24 2020[1]<stderr>:    a = os.fspath(a)
1014 | Tue Jun 30 20:49:24 2020[1]<stderr>:TypeError: expected str, bytes or os.PathLike object, not NoneType

It's trying to os.path.join(..) path parts that are None.
on rank > 0 the logger dir is not defined because logger.version / logger.name are None, I think that is the core of the problem.

williamFalcon · 2020-06-30T21:24:10Z

ummmm, i'm now not sure if this is our bug haha since only the global_rank=0 should have anything about disk writes.
@tgaddair sounds like horovod is assuming every process does?

Ok, let's put a TODO to fix these tests @tgaddair @awaelchli in a different PR? right now this is blocking our minor release. However, we need to take a look at this ASAP.

In the meantime merging to master to test the overall changes to loggers before releasing.

Thank you both for looking into this! excited to finally add long-awaited tests to move to a stable v 1.0.0

awaelchli · 2020-06-30T21:45:44Z

@williamFalcon I think you need to move the pytest.skip to test_horovod_cpu and test_horovod_cpu_implicit

tgaddair · 2020-06-30T21:54:22Z

@williamFalcon Horovod works differently than DDP, in that everything is run in parallel with Horovod (as opposed to only certain portions of the code). So in this case, likely something in this test that is only being executed on rank 0 for DDP is being executed on rank 1 for Horovod. Feel free to assign a ticket to me, and I can dig into this test further.

awaelchli added the bug Something isn't working label Jun 30, 2020

mergify bot requested a review from a team June 30, 2020 01:06

awaelchli linked an issue Jun 30, 2020 that may be closed by this pull request

wandLogger().name is None in DDP mode #2131

Closed

awaelchli changed the title ~~remove loggers on rank > 0~~ fix logging on rank 0 only Jun 30, 2020

awaelchli added the priority: 0 High priority task label Jun 30, 2020

awaelchli marked this pull request as ready for review June 30, 2020 14:42

williamFalcon force-pushed the bugfix/rank-zero-loggers branch from 2fa87f6 to 34f90c7 Compare June 30, 2020 20:18

awaelchli and others added 15 commits June 30, 2020 17:22

fix and test for ddp block logging rank > 0

b918e3e

rename

c39e4e5

use the dummy logger

825d179

dummy logger test

1f4a615

set the logger in model

356b0f9

decorator for rank zero experiment

023b78e

simplify check

216c994

simplify

87008ec

fix problem with None in checkpoint path

fc507d8

revert configure logger

424ecaf

unused import

542a834

offline

6a433d8

try rank 0 decorator in checkpoint

8509260

try fix test

d9a5480

imgs

c80403f

williamFalcon added 5 commits June 30, 2020 17:22

add asserts to make sure log zero only saves checkpoints

4dc63e4

add asserts to make sure log zero only saves checkpoints

97cd69b

add asserts to make sure log zero only saves checkpoints

3ce1e25

add asserts to make sure log zero only saves checkpoints

0a9529e

add asserts to make sure log zero only saves checkpoints

b08cdf6

williamFalcon force-pushed the bugfix/rank-zero-loggers branch from 7d7088b to b08cdf6 Compare June 30, 2020 21:22

fix tpu tests

6df3d22

fix tpu tests

7272ae6

awaelchli mentioned this pull request Jun 30, 2020

Fix horovod tests that try to access filepath on global rank > 0 #2436

Closed

williamFalcon merged commit 145670f into master Jun 30, 2020

Borda added this to the 0.8.x milestone Jun 30, 2020

williamFalcon deleted the bugfix/rank-zero-loggers branch July 5, 2020 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix logging on rank 0 only #2425

fix logging on rank 0 only #2425

awaelchli commented Jun 30, 2020 •

edited

Loading

codecov bot commented Jun 30, 2020 •

edited

Loading

williamFalcon commented Jun 30, 2020

awaelchli commented Jun 30, 2020 •

edited

Loading

pep8speaks commented Jun 30, 2020 •

edited

Loading

williamFalcon commented Jun 30, 2020

tgaddair commented Jun 30, 2020

williamFalcon commented Jun 30, 2020

awaelchli commented Jun 30, 2020

williamFalcon commented Jun 30, 2020 •

edited

Loading

awaelchli commented Jun 30, 2020 •

edited

Loading

tgaddair commented Jun 30, 2020

fix logging on rank 0 only #2425

fix logging on rank 0 only #2425

Conversation

awaelchli commented Jun 30, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jun 30, 2020 • edited Loading

Codecov Report

williamFalcon commented Jun 30, 2020

awaelchli commented Jun 30, 2020 • edited Loading

pep8speaks commented Jun 30, 2020 • edited Loading

Comment last updated at 2020-06-30 21:47:43 UTC

williamFalcon commented Jun 30, 2020

tgaddair commented Jun 30, 2020

williamFalcon commented Jun 30, 2020

awaelchli commented Jun 30, 2020

williamFalcon commented Jun 30, 2020 • edited Loading

awaelchli commented Jun 30, 2020 • edited Loading

tgaddair commented Jun 30, 2020

awaelchli commented Jun 30, 2020 •

edited

Loading

codecov bot commented Jun 30, 2020 •

edited

Loading

awaelchli commented Jun 30, 2020 •

edited

Loading

pep8speaks commented Jun 30, 2020 •

edited

Loading

williamFalcon commented Jun 30, 2020 •

edited

Loading

awaelchli commented Jun 30, 2020 •

edited

Loading