Fix model loading on GPU post training #5518

vikigenius · 2021-12-18T02:10:39Z

Fixes #5511 .

Changes proposed in this pull request:

Load model on CPU after training is done to save GPU memory.

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.
codecov/patch reports high test coverage (at least 90%).
You can find this under the "Actions" tab of the pull request once the other checks have finished.

vikigenius · 2021-12-18T02:16:34Z

@epwalsh Here you go. Please review. Also FYI I noticed several tests failing in tests/training/trainer_test.py. Both before and after my code change. Is that expected?

epwalsh

Thanks @vikigenius! This LGTM.

FYI I noticed several tests failing in tests/training/trainer_test.py

I don't see any failures in CI, so did this only happen when you ran tests locally? Can you paste the error messages you saw?

vikigenius · 2021-12-20T21:03:46Z

@epwalsh a lot of tests are failing around 47, but all are failing with the same message

E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

epwalsh · 2021-12-20T21:28:06Z

Oh, when you run tests on a machine that has GPUs, you need to make sure the non-GPU tests don't see the GPUs:

CUDA_VISIBLE_DEVICES='' pytest -v tests/

And to run GPU tests:

pytest -v -m 'gpu' tests/

vikigenius · 2021-12-20T22:21:14Z

Oh cool all tests pass then. LGTM.

vikigenius added 2 commits December 17, 2021 20:26

Fixes allenai#5511: Load model on CPU post training to save GPU memory

2992ac9

Edited Changelog

19405d4

epwalsh reviewed Dec 20, 2021

View reviewed changes

epwalsh merged commit ab4f7b5 into allenai:main Dec 20, 2021

HarshTrivedi mentioned this pull request Dec 28, 2022

Load the state dict on CPU to prevent unnecessary GPU memory surge huggingface/transformers#20920

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix model loading on GPU post training #5518

Fix model loading on GPU post training #5518

vikigenius commented Dec 18, 2021 •

edited

Loading

vikigenius commented Dec 18, 2021

epwalsh left a comment

vikigenius commented Dec 20, 2021

epwalsh commented Dec 20, 2021 •

edited

Loading

vikigenius commented Dec 20, 2021

Fix model loading on GPU post training #5518

Fix model loading on GPU post training #5518

Conversation

vikigenius commented Dec 18, 2021 • edited Loading

Before submitting

After submitting

vikigenius commented Dec 18, 2021

epwalsh left a comment

Choose a reason for hiding this comment

vikigenius commented Dec 20, 2021

epwalsh commented Dec 20, 2021 • edited Loading

vikigenius commented Dec 20, 2021

vikigenius commented Dec 18, 2021 •

edited

Loading

epwalsh commented Dec 20, 2021 •

edited

Loading