Skip to content

Commit 3c79bc9

Browse files
committed
Merge branch 'master' into bugfix/cuda-oom-detection-and-handling
2 parents 8f60e88 + 745aed0 commit 3c79bc9

File tree

105 files changed

+9848
-737
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+9848
-737
lines changed

.github/workflows/ci_test-mnodes.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,9 +78,6 @@ jobs:
7878
- name: Install dependencies
7979
run: |
8080
pip install awscli coverage
81-
# todo
82-
pip install git+https://${{ secrets.PL_GHOST_TOKEN }}@github.com/PyTorchLightning/lightning-dtrun.git@v0.0.3 -q --no-cache-dir
83-
#pip install git+https://${{ secrets.PL_GHOST_TOKEN }}@github.com/PyTorchLightning/lightning-dtrun.git@mnodes -q --no-cache-dir
8481
8582
- name: Configure AWS Credentials
8683
uses: aws-actions/configure-aws-credentials@v1

CHANGELOG.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
113113
- Changed profilers to save separate report files per state and rank ([#6621](https://github.com/PyTorchLightning/pytorch-lightning/pull/6621))
114114

115115

116+
- The trainer no longer tries to save a checkpoint on exception or run callback's `on_train_end` functions ([#6864](https://github.com/PyTorchLightning/pytorch-lightning/pull/6864))
117+
118+
116119
- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](https://github.com/PyTorchLightning/pytorch-lightning/pull/6349))
117120

118121

@@ -153,6 +156,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
153156

154157
### Removed
155158

159+
- Removed evaluation loop legacy returns for `*_epoch_end` hooks ([#6973](https://github.com/PyTorchLightning/pytorch-lightning/pull/6973))
160+
161+
156162
- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](https://github.com/PyTorchLightning/pytorch-lightning/pull/6164))
157163

158164

@@ -237,6 +243,36 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
237243
- Fixed `--gpus` default for parser returned by `Trainer.add_argparse_args` ([#6898](https://github.com/PyTorchLightning/pytorch-lightning/pull/6898))
238244

239245

246+
- Fixed pickle error checker to now check for `pickle.PickleError` to catch all pickle errors ([#6917](https://github.com/PyTorchLightning/pytorch-lightning/pull/6917))
247+
248+
249+
- Fixed `AttributeError` for `require_backward_grad_sync` when running manual optimization with sharded plugin ([#6915](https://github.com/PyTorchLightning/pytorch-lightning/pull/6915))
250+
251+
252+
- Fixed multi-gpu join for Horovod ([#6954](https://github.com/PyTorchLightning/pytorch-lightning/pull/6954))
253+
254+
255+
- Fixed a bug where `LightningModule.training_epoch_end` was called after the `on_train_end_epoch` hook ([#6969](https://github.com/PyTorchLightning/pytorch-lightning/pull/6969))
256+
257+
258+
- Fixed a bug where the outputs object passed to `LightningModule.training_epoch_end` was different from the object passed to the `on_train_end_epoch` hook ([#6969](https://github.com/PyTorchLightning/pytorch-lightning/pull/6969))
259+
260+
261+
- Fixed a bug where the outputs passed to `train_batch_end` would be lists even when using a single optimizer and no truncated backprop through time steps ([#6969](https://github.com/PyTorchLightning/pytorch-lightning/pull/6969))
262+
263+
264+
- Fixed `sync_dist` for tpus ([#6950](https://github.com/PyTorchLightning/pytorch-lightning/pull/6950))
265+
266+
267+
- Fixed bug for trainer error handling which would cause hang for distributed training ([#6864](https://github.com/PyTorchLightning/pytorch-lightning/pull/6864))
268+
269+
270+
- Fixed `self.device` not returning the correct device in replicas of data-parallel ([#6414](https://github.com/PyTorchLightning/pytorch-lightning/pull/6414))
271+
272+
273+
- Fixed process rank not being available right away after `Trainer` instantiation ([#6941](https://github.com/PyTorchLightning/pytorch-lightning/pull/6941))
274+
275+
240276
## [1.2.7] - 2021-04-06
241277

242278
### Fixed
@@ -249,6 +285,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
249285
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](https://github.com/PyTorchLightning/pytorch-lightning/pull/6730))
250286

251287

288+
- Fixed bug where `predict` could not be used when `progress_bar_refresh_rate=0` ([#6884](https://github.com/PyTorchLightning/pytorch-lightning/pull/6884))
289+
290+
252291
## [1.2.6] - 2021-03-30
253292

254293
### Changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ class LitAutoEncoder(pl.LightningModule):
177177
return embedding
178178

179179
def training_step(self, batch, batch_idx):
180-
# training_step defined the train loop. It is independent of forward
180+
# training_step defines the train loop. It is independent of forward
181181
x, y = batch
182182
x = x.view(x.size(0), -1)
183183
z = self.encoder(x)

azure-pipelines.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ jobs:
6262
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'fairscale' not in line] ; open(fname, 'w').writelines(lines)"
6363
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
6464
pip install --requirement ./requirements/devel.txt --upgrade-strategy only-if-needed
65-
pip install git+https://$(AUTH_TOKEN)@github.com/PyTorchLightning/lightning-dtrun.git@v0.0.2 --no-cache-dir
6665
pip list
6766
displayName: 'Install dependencies'
6867

dockers/base-cuda/Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ RUN \
113113
pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex && \
114114
rm -rf apex
115115

116+
RUN \
117+
# install DeepSpeed
118+
pip install deepspeed>=0.3.14
119+
116120
RUN \
117121
# Show what we have
118122
pip --version && \

dockers/nvidia/Dockerfile

Lines changed: 7 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -12,52 +12,17 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
FROM nvcr.io/nvidia/cuda:11.1.1-runtime-ubuntu20.04
15+
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-03.html#rel_21-03
16+
FROM nvcr.io/nvidia/pytorch:20.12-py3
1617

1718
MAINTAINER PyTorchLightning <https://github.com/PyTorchLightning>
1819

1920
ARG LIGHTNING_VERSION=""
2021

21-
SHELL ["/bin/bash", "-c"]
22-
# https://techoverflow.net/2019/05/18/how-to-fix-configuring-tzdata-interactive-input-when-building-docker-images/
23-
ENV \
24-
DEBIAN_FRONTEND=noninteractive \
25-
TZ=Europe/Prague \
26-
PATH="$PATH:/root/.local/bin" \
27-
CUDA_TOOLKIT_ROOT_DIR="/usr/local/cuda" \
28-
MKL_THREADING_LAYER=GNU
29-
30-
RUN apt-get update -qq && \
31-
apt-get install -y --no-install-recommends \
32-
build-essential \
33-
python3 \
34-
python3-distutils \
35-
python3-dev \
36-
pkg-config \
37-
cmake \
38-
git \
39-
wget \
40-
unzip \
41-
ca-certificates \
42-
&& \
43-
44-
# Cleaning
45-
apt-get autoremove -y && \
46-
apt-get clean && \
47-
rm -rf /root/.cache && \
48-
rm -rf /var/lib/apt/lists/* && \
49-
50-
# Setup PIP
51-
update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
52-
wget https://bootstrap.pypa.io/get-pip.py --progress=bar:force:noscroll --no-check-certificate && \
53-
python get-pip.py && \
54-
rm get-pip.py && \
55-
pip --version
56-
57-
COPY ./ /home/pytorch-lightning/
22+
COPY ./ /workspace/pytorch-lightning/
5823

5924
RUN \
60-
cd /home && \
25+
cd /workspace && \
6126
mv pytorch-lightning/notebooks . && \
6227
mv pytorch-lightning/pl_examples . && \
6328
# replace by specific version if asked
@@ -71,9 +36,10 @@ RUN \
7136

7237
# Installations
7338
python -c "fname = './pytorch-lightning/requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if not line.startswith('horovod')] ; open(fname, 'w').writelines(lines)" && \
74-
pip install -r ./pytorch-lightning/requirements/extra.txt -U --no-cache-dir && \
75-
pip install -r ./pytorch-lightning/requirements/examples.txt -U --no-cache-dir && \
39+
pip install -r ./pytorch-lightning/requirements/extra.txt --no-cache-dir --upgrade-strategy only-if-needed && \
40+
pip install -r ./pytorch-lightning/requirements/examples.txt --no-cache-dir --upgrade-strategy only-if-needed && \
7641
pip install ./pytorch-lightning --no-cache-dir && \
42+
pip install "Pillow>=8.1" "torchtext>=0.9.0" ipython[all] --no-cache-dir --upgrade-strategy only-if-needed && \
7743
rm -rf pytorch-lightning
7844

7945
RUN python --version && \

0 commit comments

Comments
 (0)