Skip to content

Commit e40d7b8

Browse files
Bordapre-commit-ci[bot]SkafteNickibhimrazydeependujha
authored andcommitted
tests: fix skipif condition for deepspeed (#21195)
* tests: fix skipif condition for `deepspeed` * split test_trainer_compiled_model * test_trainer_compiled_model_deepspeed * cuda-toolkit * Apply suggestion from @bhimrazy * add huggingface_hub as a dependency --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nicki Skafte Detlefsen <skaftenicki@gmail.com> Co-authored-by: Bhimraj Yadav <bhimrajyadav977@gmail.com> Co-authored-by: Deependu <deependujha21@gmail.com>
1 parent 276cc8c commit e40d7b8

File tree

11 files changed

+83
-38
lines changed

11 files changed

+83
-38
lines changed

.azure/gpu-tests-fabric.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ jobs:
8585
displayName: "extend env. vars 4 future"
8686
8787
- bash: |
88+
set -ex
8889
echo $(DEVICES)
8990
echo $CUDA_VISIBLE_DEVICES
9091
echo $CUDA_VERSION_MM
@@ -96,6 +97,10 @@ jobs:
9697
python --version
9798
pip --version
9899
pip list
100+
# todo: rather use devel base image
101+
apt-get update -qq --fix-missing
102+
apt-get install -y cuda-toolkit
103+
nvcc --version
99104
displayName: "Image info & NVIDIA"
100105
101106
- bash: |
@@ -156,7 +161,7 @@ jobs:
156161
- bash: python -m coverage run --source ${COVERAGE_SOURCE} -m pytest tests_fabric/ -v --durations=50
157162
workingDirectory: tests/
158163
displayName: "Testing: fabric standard"
159-
timeoutInMinutes: "10"
164+
timeoutInMinutes: "15"
160165

161166
- bash: |
162167
wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/run_standalone_tests.sh
@@ -165,7 +170,7 @@ jobs:
165170
env:
166171
PL_RUN_STANDALONE_TESTS: "1"
167172
displayName: "Testing: fabric standalone"
168-
timeoutInMinutes: "10"
173+
timeoutInMinutes: "15"
169174
170175
- bash: |
171176
python -m coverage report

.azure/gpu-tests-pytorch.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ jobs:
8989
displayName: "extend env. vars 4 future"
9090
9191
- bash: |
92+
set -ex
9293
echo $(DEVICES)
9394
echo $CUDA_VISIBLE_DEVICES
9495
echo $CUDA_VERSION_MM
@@ -100,6 +101,10 @@ jobs:
100101
python --version
101102
pip --version
102103
pip list
104+
# todo: rather use devel base image
105+
apt-get update -qq --fix-missing
106+
apt-get install -y cuda-toolkit
107+
nvcc --version
103108
displayName: "Image info & NVIDIA"
104109
105110
- bash: |
@@ -194,7 +199,7 @@ jobs:
194199
env:
195200
PL_USE_MOCKED_MNIST: "1"
196201
displayName: "Testing: PyTorch standalone tasks"
197-
timeoutInMinutes: "10"
202+
timeoutInMinutes: "15"
198203

199204
- bash: |
200205
python -m coverage report

.github/checkgroup.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ subprojects:
4747
- "!*.md"
4848
- "!**/*.md"
4949
checks:
50-
- "pytorch.yml / Lit Job (nvidia/cuda:12.1.1-runtime-ubuntu22.04, pytorch, 3.10)"
50+
- "pytorch.yml / Lit Job (nvidia/cuda:12.1.1-devel-ubuntu22.04, pytorch, 3.10)"
5151
- "pytorch.yml / Lit Job (lightning, 3.12)"
5252
- "pytorch.yml / Lit Job (pytorch, 3.12)"
5353

@@ -175,7 +175,7 @@ subprojects:
175175
- "!*.md"
176176
- "!**/*.md"
177177
checks:
178-
- "fabric.yml / Lit Job (nvidia/cuda:12.1.1-runtime-ubuntu22.04, fabric, 3.10)"
178+
- "fabric.yml / Lit Job (nvidia/cuda:12.1.1-devel-ubuntu22.04, fabric, 3.10)"
179179
- "fabric.yml / Lit Job (fabric, 3.12)"
180180
- "fabric.yml / Lit Job (lightning, 3.12)"
181181

.lightning/workflows/fabric.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@ trigger:
66

77
timeout: "60" # minutes
88
machine: "L4_X_2"
9-
image: "nvidia/cuda:12.6.3-runtime-ubuntu22.04"
9+
image: "nvidia/cuda:12.6.3-devel-ubuntu22.04"
1010
parametrize:
1111
matrix: {}
1212
include:
13-
# note that this is setting also all oldest requirements which is linked to Torch == 2.1
14-
- image: "pytorchlightning/pytorch_lightning:base-cuda12.1.1-py3.10-torch2.1"
13+
# note that this is setting also all oldest requirements which is linked to python == 3.10
14+
- image: "nvidia/cuda:12.1.1-devel-ubuntu22.04"
1515
PACKAGE_NAME: "fabric"
1616
python_version: "3.10"
1717
- PACKAGE_NAME: "fabric"
1818
python_version: "3.12"
19-
# - image: "nvidia/cuda:12.6-runtime-ubuntu22.04"
20-
# PACKAGE_NAME: "fabric"
19+
#- image: "nvidia/cuda:12.6-runtime-ubuntu22.04"
20+
# PACKAGE_NAME: "fabric"
2121
- PACKAGE_NAME: "lightning"
2222
python_version: "3.12"
2323
exclude: []

.lightning/workflows/pytorch.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@ trigger:
66

77
timeout: "60" # minutes
88
machine: "L4_X_2"
9-
image: "nvidia/cuda:12.6.3-runtime-ubuntu22.04"
9+
image: "nvidia/cuda:12.6.3-devel-ubuntu22.04"
1010
parametrize:
1111
matrix: {}
1212
include:
13-
# note that this is setting also all oldest requirements which is linked to Torch == 2.1
14-
- image: "pytorchlightning/pytorch_lightning:base-cuda12.1.1-py3.10-torch2.1"
13+
# note that this also sets oldest requirements which are linked to Python == 3.10
14+
- image: "nvidia/cuda:12.1.1-devel-ubuntu22.04"
1515
PACKAGE_NAME: "pytorch"
1616
python_version: "3.10"
1717
- PACKAGE_NAME: "pytorch"
1818
python_version: "3.12"
19-
# - image: "nvidia/cuda:12.6.3-runtime-ubuntu22.04"
20-
# PACKAGE_NAME: "pytorch"
19+
#- image: "nvidia/cuda:12.6.3-devel-ubuntu22.04"
20+
# PACKAGE_NAME: "pytorch"
2121
- PACKAGE_NAME: "lightning"
2222
python_version: "3.12"
2323
exclude: []

requirements/fabric/test.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ pytest-random-order ==1.2.0
88
click ==8.1.8; python_version < "3.11"
99
click ==8.2.1; python_version > "3.10"
1010
tensorboardX >=2.6, <2.7.0 # todo: relax it back to `>=2.2` after fixing tests
11+
huggingface-hub

src/lightning/fabric/utilities/testing/_runif.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def _runif_reasons(
113113
reasons.append("Standalone execution")
114114
kwargs["standalone"] = True
115115

116-
if deepspeed and not (_DEEPSPEED_AVAILABLE and not _TORCH_GREATER_EQUAL_2_4):
116+
if deepspeed and not (_DEEPSPEED_AVAILABLE and _TORCH_GREATER_EQUAL_2_4):
117117
reasons.append("Deepspeed")
118118

119119
if dynamo:

tests/tests_fabric/strategies/test_deepspeed_integration.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,9 @@ def _assert_saved_model_is_equal(fabric, model, checkpoint_path):
312312
single_ckpt_path = checkpoint_path / "single_model.pt"
313313
# the tag is hardcoded in DeepSpeedStrategy
314314
convert_zero_checkpoint_to_fp32_state_dict(checkpoint_path, single_ckpt_path, tag="checkpoint")
315+
316+
is_ckpt_path_a_file = os.path.isfile(single_ckpt_path)
317+
single_ckpt_path = single_ckpt_path if is_ckpt_path_a_file else single_ckpt_path / "pytorch_model.bin"
315318
state_dict = torch.load(single_ckpt_path, weights_only=False)
316319
else:
317320
# 'checkpoint' is the tag, hardcoded in DeepSpeedStrategy

tests/tests_pytorch/strategies/test_deepspeed.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@ def on_train_start(self, trainer, pl_module) -> None:
313313
trainer.fit(model)
314314
trainer.test(model)
315315
assert list(lr_monitor.lrs) == ["lr-SGD"]
316-
assert len(set(lr_monitor.lrs["lr-SGD"])) == 8
316+
assert len(lr_monitor.lrs["lr-SGD"]) == 8
317317

318318

319319
@RunIf(min_cuda_gpus=1, standalone=True, deepspeed=True)
@@ -1029,6 +1029,9 @@ def _assert_save_model_is_equal(model, tmp_path, trainer):
10291029
if trainer.is_global_zero:
10301030
single_ckpt_path = os.path.join(tmp_path, "single_model.pt")
10311031
convert_zero_checkpoint_to_fp32_state_dict(checkpoint_path, single_ckpt_path)
1032+
1033+
if not os.path.isfile(single_ckpt_path):
1034+
single_ckpt_path = os.path.join(single_ckpt_path, "pytorch_model.bin")
10321035
state_dict = torch.load(single_ckpt_path, weights_only=False)
10331036

10341037
model = model.cpu()

tests/tests_pytorch/utilities/test_compile.py

Lines changed: 48 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,12 @@
1313
# limitations under the License.
1414
import os
1515
import sys
16-
from contextlib import nullcontext
1716
from unittest import mock
1817

1918
import pytest
2019
import torch
21-
from lightning_utilities.core.imports import RequirementCache
2220

23-
from lightning.fabric.utilities.imports import _TORCH_GREATER_EQUAL_2_2, _TORCH_GREATER_EQUAL_2_4
21+
from lightning.fabric.utilities.imports import _TORCH_GREATER_EQUAL_2_2
2422
from lightning.pytorch import LightningModule, Trainer
2523
from lightning.pytorch.demos.boring_classes import BoringModel
2624
from lightning.pytorch.utilities.compile import from_compiled, to_uncompiled
@@ -34,7 +32,7 @@
3432
@pytest.mark.skipif(sys.platform == "darwin", reason="fatal error: 'omp.h' file not found")
3533
@RunIf(dynamo=True, deepspeed=True)
3634
@mock.patch("lightning.pytorch.trainer.call._call_and_handle_interrupt")
37-
def test_trainer_compiled_model(_, tmp_path, monkeypatch, mps_count_0):
35+
def test_trainer_compiled_model_deepspeed(_, tmp_path, monkeypatch, mps_count_0):
3836
trainer_kwargs = {
3937
"default_root_dir": tmp_path,
4038
"fast_dev_run": True,
@@ -69,22 +67,52 @@ def test_trainer_compiled_model(_, tmp_path, monkeypatch, mps_count_0):
6967
assert trainer.model._compiler_ctx is None
7068

7169
# some strategies do not support it
72-
if RequirementCache("deepspeed"):
73-
compiled_model = torch.compile(model)
74-
mock_cuda_count(monkeypatch, 2)
75-
76-
# TODO: Update deepspeed to avoid deprecation warning for `torch.cuda.amp.custom_fwd` on import
77-
warn_context = (
78-
pytest.warns(FutureWarning, match="torch.cuda.amp.*is deprecated")
79-
if _TORCH_GREATER_EQUAL_2_4
80-
else nullcontext()
81-
)
82-
83-
with warn_context:
84-
trainer = Trainer(strategy="deepspeed", accelerator="cuda", **trainer_kwargs)
85-
86-
with pytest.raises(RuntimeError, match="Using a compiled model is incompatible with the current strategy.*"):
87-
trainer.fit(compiled_model)
70+
compiled_model = torch.compile(model)
71+
mock_cuda_count(monkeypatch, 2)
72+
73+
trainer = Trainer(strategy="deepspeed", accelerator="cuda", **trainer_kwargs)
74+
75+
with pytest.raises(RuntimeError, match="Using a compiled model is incompatible with the current strategy.*"):
76+
trainer.fit(compiled_model)
77+
78+
79+
# https://github.com/pytorch/pytorch/issues/95708
80+
@pytest.mark.skipif(sys.platform == "darwin", reason="fatal error: 'omp.h' file not found")
81+
@RunIf(dynamo=True)
82+
@mock.patch("lightning.pytorch.trainer.call._call_and_handle_interrupt")
83+
def test_trainer_compiled_model_ddp(_, tmp_path, monkeypatch, mps_count_0):
84+
trainer_kwargs = {
85+
"default_root_dir": tmp_path,
86+
"fast_dev_run": True,
87+
"logger": False,
88+
"enable_checkpointing": False,
89+
"enable_model_summary": False,
90+
"enable_progress_bar": False,
91+
}
92+
93+
model = BoringModel()
94+
compiled_model = torch.compile(model)
95+
assert model._compiler_ctx is compiled_model._compiler_ctx # shared reference
96+
97+
# can train with compiled model
98+
trainer = Trainer(**trainer_kwargs)
99+
trainer.fit(compiled_model)
100+
assert trainer.model._compiler_ctx["compiler"] == "dynamo"
101+
102+
# the compiled model can be uncompiled
103+
to_uncompiled_model = to_uncompiled(compiled_model)
104+
assert model._compiler_ctx is None
105+
assert compiled_model._compiler_ctx is None
106+
assert to_uncompiled_model._compiler_ctx is None
107+
108+
# the compiled model needs to be passed
109+
with pytest.raises(ValueError, match="required to be a compiled LightningModule"):
110+
to_uncompiled(to_uncompiled_model)
111+
112+
# the uncompiled model can be fitted
113+
trainer = Trainer(**trainer_kwargs)
114+
trainer.fit(model)
115+
assert trainer.model._compiler_ctx is None
88116

89117
# ddp does
90118
trainer = Trainer(strategy="ddp", **trainer_kwargs)

0 commit comments

Comments
 (0)