Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix logging on rank 0 only #2425

Merged
merged 22 commits into from
Jun 30, 2020
Merged

fix logging on rank 0 only #2425

merged 22 commits into from
Jun 30, 2020

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 30, 2020

What does this PR do?

Fixes #2131 and the issue with WandB logger discussed on slack.

On ddp loggers work like this:

  • logger creates experiment and files on trainer > 0
  • logger does nothing on rank > 0
  • the user code can access self.logger.experiment.some_log_method, but on rank > 0 it is a no-op.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli added the bug Something isn't working label Jun 30, 2020
@mergify mergify bot requested a review from a team June 30, 2020 01:06
@awaelchli awaelchli linked an issue Jun 30, 2020 that may be closed by this pull request
@awaelchli awaelchli changed the title remove loggers on rank > 0 fix logging on rank 0 only Jun 30, 2020
@codecov
Copy link

codecov bot commented Jun 30, 2020

Codecov Report

Merging #2425 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #2425   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          69      69           
  Lines        5505    5527   +22     
======================================
+ Hits         4866    4888   +22     
  Misses        639     639           

@williamFalcon
Copy link
Contributor

there is no way to know... this is why we init the loggers in init and then set the rank in train.
then the loggers that are not rank 0 do nothin

@awaelchli
Copy link
Contributor Author

awaelchli commented Jun 30, 2020

In my current approach I added a decorator on top of Logger.experiment that returns DummyExperiment on rank > 0.
The problem that we have is logger.name and logger.version are None on rank > 0 for some loggers that have these properties dependent on the experiment, and I think that's the reason why the horovod tests fail. these tests have some code that explicitly construct paths outside the Trainer/Logger.

@awaelchli awaelchli added the priority: 0 High priority task label Jun 30, 2020
@awaelchli awaelchli marked this pull request as ready for review June 30, 2020 14:42
@pep8speaks
Copy link

pep8speaks commented Jun 30, 2020

Hello @awaelchli! Thanks for updating this PR.

Line 151:5: E265 block comment should start with '# '

Comment last updated at 2020-06-30 21:47:43 UTC

@williamFalcon williamFalcon force-pushed the bugfix/rank-zero-loggers branch from 2fa87f6 to 34f90c7 Compare June 30, 2020 20:18
@williamFalcon
Copy link
Contributor

@tgaddair

trainer_options = {'default_root_dir': '/tmp/pytest-of-runner/pytest-0/test_horovod_cpu_implicit0', 'deterministic': True, 'gradient_clip_val': 1.0, 'limit_train_batches': 0.4, ...}
on_gpu = False

    def _run_horovod(trainer_options, on_gpu=False):
        """Execute the training script across multiple workers in parallel."""
        tutils.reset_seed()
        cmdline = [
            'horovodrun',
            '-np', '2',
            sys.executable, TEST_SCRIPT,
            '--trainer-options', shlex.quote(json.dumps(trainer_options))
        ]
        if on_gpu:
            cmdline += ['--on-gpu']
        exit_code = subprocess.call(' '.join(cmdline), shell=True, env=os.environ.copy())
>       assert exit_code == 0
E       assert 1 == 0
E         +1
E         -0

tests/models/test_horovod.py:53: AssertionError

any thoughts?

@tgaddair
Copy link
Contributor

@williamFalcon looks like this error is coming from rank 1 (from the stdout capture in the CI logs):

Tue Jun 30 20:51:58 2020[1]<stderr>:Traceback (most recent call last):
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 67, in <module>
Tue Jun 30 20:51:58 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 54, in run_test_from_config
Tue Jun 30 20:51:58 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/develop_pipelines.py", line 61, in run_model_test
Tue Jun 30 20:51:58 2020[1]<stderr>:    pretrained_model = load_model_from_checkpoint(logger, trainer.checkpoint_callback.dirpath)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/develop_utils.py", line 61, in load_model_from_checkpoint
Tue Jun 30 20:51:58 2020[1]<stderr>:    hparams_path = os.path.join(path_expt_dir, TensorBoardLogger.NAME_HPARAMS_FILE)
Tue Jun 30 20:51:58 2020[1]<stderr>:  File "/opt/hostedtoolcache/Python/3.6.10/x64/lib/python3.6/posixpath.py", line 80, in join
Tue Jun 30 20:51:58 2020[1]<stderr>:    a = os.fspath(a)
Tue Jun 30 20:51:58 2020[1]<stderr>:TypeError: expected str, bytes or os.PathLike object, not NoneType

I'll need to dig into the code of this PR a bit to see what's the cause.

@williamFalcon
Copy link
Contributor

ok awesome. maybe there's a property that we should be assigning on all ranks but now only assign on 0?

@awaelchli
Copy link
Contributor Author

@williamFalcon if you look further down in the errors you will find this


Tue Jun 30 20:49:23 2020[0]<stderr>:  warnings.warn(*args, **kwargs)
--
1003 | Tue Jun 30 20:49:24 2020[1]<stderr>:Traceback (most recent call last):
1004 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/models/data/horovod/train_default_model.py", line 67, in <module>
1005 | Tue Jun 30 20:49:24 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
1006 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/models/data/horovod/train_default_model.py", line 54, in run_test_from_config
1007 | Tue Jun 30 20:49:24 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
1008 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/base/develop_pipelines.py", line 61, in run_model_test
1009 | Tue Jun 30 20:49:24 2020[1]<stderr>:    pretrained_model = load_model_from_checkpoint(logger, trainer.checkpoint_callback.dirpath)
1010 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/drone/src/tests/base/develop_utils.py", line 61, in load_model_from_checkpoint
1011 | Tue Jun 30 20:49:24 2020[1]<stderr>:    hparams_path = os.path.join(path_expt_dir, TensorBoardLogger.NAME_HPARAMS_FILE)
1012 | Tue Jun 30 20:49:24 2020[1]<stderr>:  File "/opt/conda/lib/python3.7/posixpath.py", line 80, in join
1013 | Tue Jun 30 20:49:24 2020[1]<stderr>:    a = os.fspath(a)
1014 | Tue Jun 30 20:49:24 2020[1]<stderr>:TypeError: expected str, bytes or os.PathLike object, not NoneType


It's trying to os.path.join(..) path parts that are None.
on rank > 0 the logger dir is not defined because logger.version / logger.name are None, I think that is the core of the problem.

@williamFalcon williamFalcon force-pushed the bugfix/rank-zero-loggers branch from 7d7088b to b08cdf6 Compare June 30, 2020 21:22
@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 30, 2020

ummmm, i'm now not sure if this is our bug haha since only the global_rank=0 should have anything about disk writes.
@tgaddair sounds like horovod is assuming every process does?

Ok, let's put a TODO to fix these tests @tgaddair @awaelchli in a different PR? right now this is blocking our minor release. However, we need to take a look at this ASAP.

In the meantime merging to master to test the overall changes to loggers before releasing.

Thank you both for looking into this! excited to finally add long-awaited tests to move to a stable v 1.0.0

@awaelchli
Copy link
Contributor Author

awaelchli commented Jun 30, 2020

@williamFalcon I think you need to move the pytest.skip to test_horovod_cpu and test_horovod_cpu_implicit

@tgaddair
Copy link
Contributor

@williamFalcon Horovod works differently than DDP, in that everything is run in parallel with Horovod (as opposed to only certain portions of the code). So in this case, likely something in this test that is only being executed on rank 0 for DDP is being executed on rank 1 for Horovod. Feel free to assign a ticket to me, and I can dig into this test further.

@williamFalcon williamFalcon merged commit 145670f into master Jun 30, 2020
@Borda Borda added this to the 0.8.x milestone Jun 30, 2020
@williamFalcon williamFalcon deleted the bugfix/rank-zero-loggers branch July 5, 2020 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wandLogger().name is None in DDP mode
5 participants