Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1/n Generalize internal checks for Accelerator in Trainer - remove trainer._device_type #11001

Closed
wants to merge 4 commits into from

Conversation

four4fish
Copy link
Contributor

@four4fish four4fish commented Dec 9, 2021

What does this PR do?

#11002 could be follow ups for this

Part of #10821

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Comment on lines 1653 to 1652
rank_zero_info(
f"GPU available: {torch.cuda.is_available()}, used: {isinstance(self.accelerator, GPUAccelerator)}"
)

num_tpu_cores = (
self.tpu_cores if self.tpu_cores is not None and self._device_type == _AcceleratorType.TPU else 0
self.tpu_cores if self.tpu_cores is not None and isinstance(self.accelerator, TPUAccelerator) else 0
)
rank_zero_info(f"TPU available: {_TPU_AVAILABLE}, using: {num_tpu_cores} TPU cores")

num_ipus = self.ipus if self.ipus is not None else 0
rank_zero_info(f"IPU available: {_IPU_AVAILABLE}, using: {num_ipus} IPUs")

if torch.cuda.is_available() and self._device_type != _AcceleratorType.GPU:
if torch.cuda.is_available() and isinstance(self.accelerator, GPUAccelerator):
rank_zero_warn(
"GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.",
category=PossibleUserWarning,
)

if _TPU_AVAILABLE and self._device_type != _AcceleratorType.TPU:
if _TPU_AVAILABLE and isinstance(self.accelerator, TPUAccelerator):
rank_zero_warn(
"TPU available but not used. Set the `tpu_cores` flag in your trainer"
" `Trainer(tpu_cores=8)` or script `--tpu_cores=8`."
)

if (
_IPU_AVAILABLE
and self._device_type != _AcceleratorType.IPU
and not isinstance(self.accelerator, IPUAccelerator)
):
if _IPU_AVAILABLE and not isinstance(self.accelerator, IPUAccelerator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find this logging pretty verbose. For instance, I don't want to see "TPU available: False, using: 0 TPU cores" on every run. This is only useful if there are TPUs available and I'm not using them. I assume that if there are not TPUs available but I have specified them, then an exception would be thrown before this is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. What about when users pass (accelerator="auto", devices="x")?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simply lower the logging level to maybe debug here? I also find it to verbose most of the times.

@kaushikb11 same goes here. Usually people know on what kind of machine they submitted their job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree it's too verbose, and why do we have _log_device_info() with all the warnings in trainer? Should we move _log_device_info to the end of the accelerator_connector init, or having the warning relocate to select functions in accelerator connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a issue #11014 to discuss and will address it in a different PR

)
rank_zero_info(f"TPU available: {_TPU_AVAILABLE}, using: {num_tpu_cores} TPU cores")

num_ipus = self.ipus if self.ipus is not None else 0
rank_zero_info(f"IPU available: {_IPU_AVAILABLE}, using: {num_ipus} IPUs")

if torch.cuda.is_available() and self._device_type != _AcceleratorType.GPU:
if torch.cuda.is_available() and isinstance(self.accelerator, GPUAccelerator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect. the prior code was checking if the GPU accelerator was not being used.
But I don't think we need to check the accelerator instance at all. Why can't we check the gpus flag directly?

Same as above, I don't get why we have to check TPU Accelerator instead of tpu_cores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ppl reach agreement on gpu/cpus vs devices in issue #10410 Future of gpus/ipus/tpu_cores with respect to devices, I think check accelerator and device number will be better than tpu/gpus.

@four4fish four4fish changed the title Generalize internal checks for Accelerator in Trainer 1/n Generalize internal checks for Accelerator in Trainer - remove trainer._device_type Dec 9, 2021
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented Dec 25, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Dec 25, 2021
@stale
Copy link

stale bot commented Jan 3, 2022

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Jan 3, 2022
@awaelchli awaelchli reopened this Jan 4, 2022
@awaelchli awaelchli added this to the 1.6 milestone Jan 4, 2022
@stale
Copy link

stale bot commented Jan 9, 2022

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants