Skip to content

[REQUEST] Moving a trainable model with an optimiser between GPU and CPU #5620

@kfertakis

Description

@kfertakis

Is your feature request related to a problem? Please describe.
When a deespeed model is initialised with an optimiser, the torch.nn.module.to() functionality for moving the model between devices breaks as the optimiser holds references to the model parameters and thus GPU memory is not cleared when trying to move it to CPU for example.

Describe the solution you'd like
Functionality that is similar to torch.nn.module.to() for moving both model and optimiser between devices which de-allocates the previously occupied memory.

Describe alternatives you've considered
The alternative is to destroy the model instance and recreate it from a checkpoint but this has a much higher time cost.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions