Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix model loading on GPU post training #5518

Merged
merged 2 commits into from
Dec 20, 2021

Conversation

vikigenius
Copy link
Contributor

@vikigenius vikigenius commented Dec 18, 2021

Fixes #5511 .

Changes proposed in this pull request:

  • Load model on CPU after training is done to save GPU memory.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.
  • codecov/patch reports high test coverage (at least 90%).
    You can find this under the "Actions" tab of the pull request once the other checks have finished.

@vikigenius
Copy link
Contributor Author

@epwalsh Here you go. Please review. Also FYI I noticed several tests failing in tests/training/trainer_test.py. Both before and after my code change. Is that expected?

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vikigenius! This LGTM.

FYI I noticed several tests failing in tests/training/trainer_test.py

I don't see any failures in CI, so did this only happen when you ran tests locally? Can you paste the error messages you saw?

@vikigenius
Copy link
Contributor Author

@epwalsh a lot of tests are failing around 47, but all are failing with the same message

E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

@epwalsh
Copy link
Member

epwalsh commented Dec 20, 2021

Oh, when you run tests on a machine that has GPUs, you need to make sure the non-GPU tests don't see the GPUs:

CUDA_VISIBLE_DEVICES='' pytest -v tests/

And to run GPU tests:

pytest -v -m 'gpu' tests/

@vikigenius
Copy link
Contributor Author

Oh cool all tests pass then. LGTM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA Out of memory error after training has stopped
2 participants