Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto_device_count and device name support #13423

Merged
merged 94 commits into from
Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
c0bc264
Add auto_device_count and device name support
jerome-habana Jun 28, 2022
24cbfc1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 28, 2022
c103cff
Update change log
jerome-habana Jun 28, 2022
ca009a4
return default device count on failure
jerome-habana Jun 29, 2022
3c3fb0b
Update src/pytorch_lightning/CHANGELOG.md
jerome-habana Jun 29, 2022
b456564
Apply suggestions from code review
Borda Jun 29, 2022
19e4601
Remove unnecessary endpoint logic, rename `collaborative` to `hivemin…
Jun 28, 2022
fc6c27c
Update neptune-client requirement from <0.16.3,>=0.10.0 to >=0.10.0,<…
dependabot[bot] Jun 28, 2022
7638bb7
Update numpy requirement from <1.22.5,>=1.17.2 to >=1.17.2,<1.23.1 in…
dependabot[bot] Jun 28, 2022
fa7e854
[CLI] Support custom trainers without callbacks (#13138)
carmocca Jun 28, 2022
053b0df
Better errors for logging corner cases (#13164)
carmocca Jun 28, 2022
a3d7674
Rename old references to training type plugin in tests (#13421)
awaelchli Jun 28, 2022
c1ad298
CI: fix requirements freeze (#13441)
Borda Jun 29, 2022
2954714
Add model summary when using DeepSpeed Stage 3 (#13427)
Jun 29, 2022
26469d9
Remove support for DDP2 strategy (#12705)
awaelchli Jun 29, 2022
2c18643
[Docs] Fix README.md in lightning/examples/pl_basics (#13380)
Keiku Jun 29, 2022
6338ad5
Update gather_all_tensors to handle tensors of different sizes (#12630)
ananthsub Jun 29, 2022
2e7cff7
Modified python version check to accommodate for legacy version style…
martinosorb Jun 29, 2022
177d3b4
fix PL release docker (#13439)
Borda Jun 29, 2022
bb7d825
Fix docstring typo (#13447)
WrRan Jun 29, 2022
b6a666a
Unpin `protobuf` version and update `tensorboard` version (#13259)
akihironitta Jun 29, 2022
891f8f2
Remove remaining old-style AcceleratorConnector properties (#13412)
awaelchli Jun 29, 2022
d5671ca
Call `set_epoch` for distributed batch samplers (#13396)
awaelchli Jun 29, 2022
58b62df
Update comet-ml requirement from <=3.28.2,>=3.1.12 to >=3.1.12,<3.31.…
dependabot[bot] Jun 29, 2022
42c371d
Convert validation loop config warnings to `PossibleUserWarning` (#13…
CompRhys Jun 29, 2022
657099d
Fix validation when accelerator is a string (#13417)
awaelchli Jun 29, 2022
fd8afbe
Set timeout for DDPSpawnStrategy (#13383)
lsy643 Jun 30, 2022
6149abb
Remove unused docstring parameter `device` (#13448)
WrRan Jun 30, 2022
7fa962d
Update wandb requirement from <0.12.19,>=0.8.21 to >=0.8.21,<0.12.20 …
dependabot[bot] Jun 30, 2022
fc91c72
Add flash[image] dependency in Active learning example (#13442)
ekagra-ranjan Jun 30, 2022
10757bd
fixed doc of timer (#13393)
s-kumano Jun 30, 2022
1e245a9
Simplify list extension (#13435)
carmocca Jun 30, 2022
284d95c
Add BaseModelCheckpoint class to inherit from (#13024)
otaj Jun 30, 2022
f9a3055
CI: abstract and make full pkg check (#13460)
Borda Jun 30, 2022
153ffca
Fix typo in `_block_parallel_sync_behavior` docstring (#13451)
WrRan Jun 30, 2022
c3c450f
Fix typo in `Loop.replace` docstring (#13452)
WrRan Jun 30, 2022
8874983
Typo in tuner/lr_finder.py (#13453)
WrRan Jun 30, 2022
63c611f
Remove unused argument `model` (#13454)
WrRan Jun 30, 2022
57d5659
Typo in trainer/supporters.py (#13455)
WrRan Jun 30, 2022
09b2d51
More clear docs for `LightningDataModule` (#13464)
WrRan Jun 30, 2022
7fecd51
fix mypy typing errors in lightning/trainer/optimizers.py (#13470)
gautierdag Jun 30, 2022
6c9d490
adding LAI test (#13321)
boat-builder Jun 30, 2022
3275fba
Add lightning app examples (#13456)
manskx Jun 30, 2022
03d3654
ci: drop false download artifact (#13473)
Borda Jul 1, 2022
a7f41c0
Remove redundant shebang from source files (#13479)
awaelchli Jul 1, 2022
9baa7d1
Move deepspeed summary test to correct folder (#13478)
awaelchli Jul 1, 2022
10cae3d
fix mypy typing errors in pytorch_lightning.__setup__.py (#13472)
CyprienRicque Jul 1, 2022
61283a7
lightning entry point (#13490)
Borda Jul 1, 2022
e20f6a8
Add CI for python lightning app Python unit tests (#13491)
manskx Jul 1, 2022
5e78491
Remove redundant progress bar refresh (#13462)
awaelchli Jul 2, 2022
81b7874
Add CI for app examples (#13495)
manskx Jul 2, 2022
3e1725f
code-owners for App (#13497)
Borda Jul 2, 2022
8a634db
fix mypy typing errors in pytorch_lightning/strategies/single_device.…
CyprienRicque Jul 4, 2022
89766ab
fix mypy typing errors in pytorch_lightning/strategies/ddp2.py (#13535)
CyprienRicque Jul 4, 2022
c0874f3
Fix type hints of callbacks/finetuning.py (#13516)
ar90n Jul 4, 2022
eb059b4
fix mypy typing errors in pytorch_lightning/strategies/single_tpu.py …
CyprienRicque Jul 5, 2022
5dc9538
Remove deprecated `on_keyboard_interrupt` (#13438)
nitinramvelraj Jul 5, 2022
9dfc712
fix mypy typing errors in pytorch_lightning/tuner/lr_finder.py (#13513)
donlapark Jul 8, 2022
58abfda
Fix mypy errors attributed to `pytorch_lightning.loggers.logger.py` (…
jxtngx Jul 11, 2022
6a8e537
CI: Define reusable workflow - check schema (#13562)
akihironitta Jul 11, 2022
dfe5c83
Fix TPU circleci tests (#13432)
kaushikb11 Jul 11, 2022
3548628
cd: releasing packages (#13489)
Borda Jul 11, 2022
75e50c5
CI: Add PR labeler (#13475)
akihironitta Jul 11, 2022
ab67ec9
CI: Update mypy workflow (#13574)
akihironitta Jul 11, 2022
93b2e80
setup: set default metadata (#13571)
Borda Jul 11, 2022
424fb0e
Remove deprecated `pytorch_lightning.core.decorators.parameter_valida…
shantam-8 Jul 11, 2022
08c08cb
Adds is last batch (#13550)
bibhabasumohapatra Jul 12, 2022
d6ce697
Restore log step during restart (#13467)
rohitgr7 Jul 12, 2022
ee4b04f
CI/CD: Refactor building docker images (#13576)
akihironitta Jul 12, 2022
c8d2312
Fix mypy errors attributed to `pytorch_lightning.loggers.csv_logs.py`…
jxtngx Jul 12, 2022
c1ec00e
Fix mypy errors attributed to `pytorch_lightning.loggers.base.py` (#1…
jxtngx Jul 12, 2022
b10f5a5
CI: Enable dependabot for GitHub Actions (#13589)
akihironitta Jul 12, 2022
3aee345
Fix default value for `enable_progress_bar` in docs (#13584)
JiahaoYao Jul 12, 2022
cd09761
CI: Update labeler bot (#13624)
akihironitta Jul 12, 2022
047d8ee
Remove redundant GPU test (#13623)
rohitgr7 Jul 12, 2022
8e56a52
CI: hotfix gatekeeper (#13606)
Borda Jul 12, 2022
7174d7e
Remove `add_to_queue` and `remove_from_queue` from LightningModule (#…
shenoynikhil Jul 12, 2022
6924b41
Bump codecov/codecov-action from 1 to 3 (#13620)
dependabot[bot] Jul 12, 2022
27a0ac9
Bump actions/upload-artifact from 2 to 3 (#13622)
dependabot[bot] Jul 12, 2022
6b953c7
Bump docker/setup-buildx-action from 1 to 2 (#13618)
dependabot[bot] Jul 12, 2022
4239564
Removed deprecated `pytorch_lightning.overrides.distributed.IndexBatc…
samz5320 Jul 13, 2022
d83a423
fix mypy typing errors in pytorch_lightning/strategies/dp.py (#13564)
CyprienRicque Jul 13, 2022
604f7ca
Remove deprecated `Trainer.slurm_job_id` (#13459)
awaelchli Jul 13, 2022
692da6a
Remove deprecated `LightningModule.on_post_move_to_device` (#13548)
akihironitta Jul 13, 2022
8d8211e
Remove deprecated ClustertEnvironment methods (#13458)
awaelchli Jul 13, 2022
db1e4d2
Update CHANGELOG after the 1.6.5 release (#13641)
carmocca Jul 13, 2022
d825ada
Remove deprecated `LightningDistributed` (#13549)
akihironitta Jul 13, 2022
278f59a
Merge branch 'master' into hpu_pkg
jerome-habana Jul 14, 2022
8c8f94f
Merge branch 'master' into hpu_pkg
jerome-habana Jul 20, 2022
d04a792
Merge branch 'master' into hpu_pkg
kaushikb11 Jul 20, 2022
abdd890
Handle nameerror exceptions
jerome-habana Jul 20, 2022
fc8ab9c
remove unused variable
jerome-habana Jul 21, 2022
366f0b8
Merge branch 'master' into hpu_pkg
jerome-habana Jul 21, 2022
ef634ff
Merge branch 'master' into hpu_pkg
jerome-habana Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/pytorch_lightning/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Updated `val_check_interval`(int) to consider total train batches processed instead of `_batches_that_stepped` for validation check during training ([#12832](https://github.com/Lightning-AI/lightning/pull/12832)


- Updated Habana Accelerator's `auto_device_count`, `is_available` & `get_device_name` methods based on the latest torch habana package ([#13423](https://github.com/PyTorchLightning/pytorch-lightning/pull/13423))


-


### Deprecated

- Deprecated `pytorch_lightning.loggers.base.LightningLoggerBase` in favor of `pytorch_lightning.loggers.logger.Logger`, and deprecated `pytorch_lightning.loggers.base` in favor of `pytorch_lightning.loggers.logger` ([#120148](https://github.com/PyTorchLightning/pytorch-lightning/pull/12014))
Expand Down
26 changes: 22 additions & 4 deletions src/pytorch_lightning/accelerators/hpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@
from pytorch_lightning.utilities.exceptions import MisconfigurationException
from pytorch_lightning.utilities.rank_zero import rank_zero_debug

if _HPU_AVAILABLE:
import habana_frameworks.torch.hpu as torch_hpu


class HPUAccelerator(Accelerator):
"""Accelerator for HPU devices."""
Expand Down Expand Up @@ -52,13 +55,28 @@ def get_parallel_devices(devices: int) -> List[torch.device]:

@staticmethod
def auto_device_count() -> int:
"""Get the devices when set to auto."""
# TODO(@kaushikb11): Update this when api is exposed by the Habana team
return 8
"""Returns the number of HPU devices when the devices is set to auto."""
try:
return torch_hpu.device_count()
jerome-habana marked this conversation as resolved.
Show resolved Hide resolved
except (AttributeError, NameError):
rank_zero_debug("HPU `auto_device_count` failed, returning default count of 8.")
return 8

@staticmethod
def is_available() -> bool:
return _HPU_AVAILABLE
"""Returns a bool indicating if HPU is currently available."""
try:
return torch_hpu.is_available()
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved
except (AttributeError, NameError):
return False

@staticmethod
def get_device_name() -> str:
"""Returns the name of the HPU device."""
try:
return torch_hpu.get_device_name()
except (AttributeError, NameError):
return ""

@classmethod
def register_accelerators(cls, accelerator_registry: Dict) -> None:
Expand Down
2 changes: 1 addition & 1 deletion src/pytorch_lightning/strategies/hpu_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@

if _HPU_AVAILABLE:
import habana_frameworks.torch.core as htcore
import habana_frameworks.torch.core.hccl # noqa: F401
import habana_frameworks.torch.distributed.hccl # noqa: F401

log = logging.getLogger(__name__)

Expand Down
1 change: 0 additions & 1 deletion src/pytorch_lightning/strategies/single_hpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@

if _HPU_AVAILABLE:
import habana_frameworks.torch.core as htcore
import habana_frameworks.torch.core.hccl # noqa: F401


class SingleHPUStrategy(SingleDeviceStrategy):
Expand Down
6 changes: 6 additions & 0 deletions tests/tests_pytorch/accelerators/test_hpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ def test_availability():
assert HPUAccelerator.is_available()


@RunIf(hpu=True)
def test_device_name():
assert HPUAccelerator.get_device_name() == "GAUDI"


@pytest.mark.skipif(_HPU_AVAILABLE, reason="test requires non-HPU machine")
def test_fail_if_no_hpus():
with pytest.raises(MisconfigurationException, match="HPUAccelerator can not run on your system"):
Expand Down Expand Up @@ -239,6 +244,7 @@ def test_inference_only(tmpdir, hpus):
trainer.predict(model)


@RunIf(hpu=True)
def test_hpu_auto_device_count():
assert HPUAccelerator.auto_device_count() == 8

Expand Down