Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New logger connector code #7882

Merged
merged 26 commits into from
Jun 8, 2021
Merged

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Jun 8, 2021

What does this PR do?

Introduces the new code for #7631

Next PRs:

  • Update the loops to use this
  • Delete previous code

cc: @SkafteNicki please review our heavy use of Metric in result_new.py

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@carmocca carmocca added feature Is an improvement or enhancement refactor labels Jun 8, 2021
@carmocca carmocca added this to the v1.4 milestone Jun 8, 2021
@carmocca carmocca self-assigned this Jun 8, 2021
@pep8speaks
Copy link

pep8speaks commented Jun 8, 2021

Hello @carmocca! Thanks for updating this PR.

Line 192:13: W503 line break before binary operator

Comment last updated at 2021-06-08 19:35:08 UTC

@codecov
Copy link

codecov bot commented Jun 8, 2021

Codecov Report

Merging #7882 (e2c8c64) into master (b74f8ac) will decrease coverage by 6%.
The diff coverage is 56%.

@@           Coverage Diff           @@
##           master   #7882    +/-   ##
=======================================
- Coverage      92%     86%    -6%     
=======================================
  Files         202     204     +2     
  Lines       13159   13630   +471     
=======================================
- Hits        12155   11788   -367     
- Misses       1004    1842   +838     

# reset result collection for next epoch
self.trainer.result_collection.reset(metrics=True)

def teardown(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that something we should do in the loops maybe?
just saying, no change requersted

Copy link
Contributor Author

@carmocca carmocca Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer will call this:

    def _post_dispatch(self):
        self.accelerator.post_dispatch(self)
        self.accelerator.teardown()
        self.logger_connector.teardown()

@awaelchli
Copy link
Contributor

very nice!

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Great job @carmocca

Evaluation metric updates
"""

def prepare_eval_loop_results(self, metrics: Mapping[str, _METRIC]) -> None:
Copy link
Contributor

@tchaton tchaton Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be move to utils section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It's tied to evaluation

@mergify mergify bot added the has conflicts label Jun 8, 2021
Comment on lines +308 to +310
self.trainer.train_loop.results.cpu()
self.trainer.evaluation_loop._val_results.cpu()
self.trainer.evaluation_loop._test_results.cpu()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli do you think each loop should run this on their teardown?
instead of the logger connector doing it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these results are the resultscollection right? I think yes, if the loops own this results collection I think they should handle the teardown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Do you think I should do that change here or you do it when it's all merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we shouldn't hold this back. Added a TODO to our POC.

elif step is None:
# added metrics by Lightning for convenience
scalar_metrics['epoch'] = self.trainer.current_epoch
step = self.trainer.global_step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of Codecov warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected considering we haven't integrated the loops to use these new files.

You can take a peek at #7631 for the full integration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes a few unit tests for the code in Result but LoggerConnectorNew is completely unused

@mergify mergify bot added the has conflicts label Jun 8, 2021
@mergify mergify bot removed the has conflicts label Jun 8, 2021
Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM 😃 minor comment

@awaelchli awaelchli mentioned this pull request Jun 8, 2021
11 tasks
@tchaton tchaton enabled auto-merge (squash) June 8, 2021 19:33
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the integration with torchmetrics seems nice and definitely clears up some of the older logging code :]



# TODO(@carmocca): Remove `New` suffix
class LoggerConnectorNew:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: for renamings i've always been a fan of _v2 instead of New in the worst case circumstance that there are 3 or more versions around at the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip! Will keep it in mind for future refactors

Comment on lines +107 to +109
# added metrics by Lightning for convenience
scalar_metrics['epoch'] = self.trainer.current_epoch
step = self.trainer.global_step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b questions:

  • why does step indicate we should epoch to scalar metrics?
  • what happens if epoch was already in scalar metrics as another value? if the user does self.log("epoch", ...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does step indicate we should epoch to scalar metrics?

Didn't get this, can you rephrase?

what happens if epoch was already in scalar metrics as another value?

Well, as you see here, it gets overwritten.

It wouldn't be a bad idea to check if it's there already. Will do in a future PR

Comment on lines +35 to +38
class MetricSource(LightningEnum):
CALLBACK = "callback"
PBAR = "pbar"
LOG = "log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these metric sources, or are they destinations?

in the future, i can imagine we want to configure multiple loggers, but want to log metrics to only some of them. how could this be extended for that use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these metric sources, or are they destinations?

Mmm depends on how you think about it.

Do you self.log(prog_bar=True) into the progress bar metrics (destination) or are the progress bar metrics gathered from the prog_bar=True logged values (source)?

If you have an idea for a less confusing name, we can rename without issues 👍

in the future, i can imagine we want to configure multiple loggers, but want to log metrics to only some of them. how could this be extended for that use case?

In that case, I think it's easier for us to filter at the logger level. _Metadata could have a field for the target logger

@tchaton tchaton merged commit b214442 into master Jun 8, 2021
@tchaton tchaton deleted the refactor/new-logger-connector-code branch June 8, 2021 20:20

def check_fn(v):
if v.grad_fn is not None:
raise MisconfigurationException(f'You returned a tensor with `grad_fn`. The extra values are {extra}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this checks for? Is it a misconfiguration for users to return tensors with gradients enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the previous implementation detached extras (silently) internally, however, I feel like it's better for the users to know they are passing tensors with graphs. This facilitates catching potential errors on the user side and it's not much to ask the user to detach if necessary.

Do you think this constraint should be relaxed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure - just noticed that quite a few of our users were relying on this (silent) behavior and will see exceptions with the latest version of the code here.

Could you share an example of how/what a users should do now? Is the idea they want to call detach on non-loss tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea they want to call detach on non-loss tensors?

Exactly:

def training_step(...):
    ...
    # `something_else` is considered an "extra"
    return {'loss': loss, 'something_else': something_else.detach()}

facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Jun 16, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
## OSS
Note these issues are being solved in OSS here: https://github.com/PyTorchLightning/pytorch-lightning/pull/7994/files#

## Manual
- `speed_monitor.py` - `Result.unpack_batch_size` has been removed, moved to new implementation.
- `fully_sharded.py` - There was a refactor for plugins, so updated corresponding function to keep reduced memory usage.
- `hive_writing_classy.py`, `hive_writing_faim.py`, `hive_writing_xrayvideo.py` - Same as `speed_monitor.py`.
- [Temporary] Uncommented misconfiguration exception. See Lightning-AI/pytorch-lightning#7882 (review).
- Update `TestModel` to detach appropriately.
- Manually `detach` metrics stored in ResultStore.

## Automatic
### New commit log messages
  f7459f53 DeepSpeed Infinity Update (#7234)
  03e7bdf8 Improve `LightningModule` hook tests (#7944)
  3a0ed02b Properly handle parent modules w/ parameters in `BaseFinetuning` callback (#7931)
  ce93d8bc Handle errors due to uninitailized parameters (#7642)
  cca0e753 remove parsing comments (#7958)
  898fb56b added on_test_start() documentation (#7962)
  22d82661 Seed all workers when using DDP (#7942)
  436fc53c Improve `LightningDataModule` hook test and fix `dataloader_idx` argument (#7941)
  6b7b4047 deprecate hpc_load() and integrate it with restore() (#7955)
  20a5e09e fix myst-parser warning blocking docs ci (#7967)
  f15ea601 update chlog + legacy chpt (#7954)
  59d0c656 Add dataclass support to `apply_to_collection` (#7935)
  cdd01f32 LightningCLI support for argument links applied on instantiation (#7895)
  6856cced Remove rank_zero_only on DataModule prepare_data (#7945)
  96433d03 IPU Integration 5/5 (#7867)
  42c7f272 refactor checkpoint loading for training type plugins (#7928)
  ac4eb0a0 `is_overridden` improvements (#7918)
  9e932f4d Delete `on_after_backward` unused argument (#7925)
  8b738693 Deprecate the default `EarlyStopping` callback monitor value (#7907)
  c1eac483 split `restore_training_state` into logical parts [2 / 2] (#7900)
  d209b689 split `restore_training_state` into logical parts [1 / 2] (#7901)
  111287b4 add pre-commit hooks (#7906)
  839019a3 Remove legacy teardown check in train loop (#7917)
  b45a89a2 Clean-up after logger connector redesign 2/2 (#7631)
  07b69231 Remove fn check for ipu output (#7915)
  580a3b5e Remove dead code (#7910)
  df812398 Clean-up after logger connector redesign 1/2 (#7909)
  ec4f8856 Enable logger connector re-design (#7891)
  15be9865 add logger to __all__ (#6854)
  6fee9262 Deprecate `LightningDataModule` lifecycle properties (#7657)
  764d2c77 refactor CheckpointConnector.restore_weights  (#7862)
  7f4ef6d1 Fix logs overwriting issue for remote fs (#7889)
  c310ce66 Logger connector re-design `_Metadata.reduce_fx` fixes. (#7890)
  b214442e New logger connector code (#7882)

Reviewed By: yifuwang

Differential Revision: D29105294

fbshipit-source-id: 990b2a4a7333908d676de193f5ec930cb50b8a19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants