Skip to content

memory leak in an environment like notebook #879

@stas00

Description

@stas00

In HF/DS tests I use a notebook-like environment for some of the tests, which means I don't fork a new process for each deepspeed run, but emulate the distributed env in the process and repeatedly run deepspeed. This makes it much easier/faster to test the values of the weights.

There must be either some global variable that holds onto memory, or the circular reference happening, since memory doesn't get released after deepspeed has done its work and I even explicitly deleted the engine/scheduler/optimizer variables.

I tested that if I remove deepspeed from the equation there is no leak.

Now I have a few dozens of those tests and I get some 10GB extra RAM used per deepspeed invocation. So it quickly grows to 100GBs.

This is with a test model of 2x 1 value weights y = wx+b - i.e. the model's memory foot print is ~0.

Any idea where the objects might be held and not destroyed? If I repeat the same re-creation of the trainer in a loop the old one should get its memory freed up.

So this is an example of a cell in a jupyter notebook:

if trainer.deepspeed:
    print("reloading")
    trainer.deepspeed = None
    trainer.optimizer = None
    trainer.lr_scheduler = None
trainer = get_regression_trainer(output_dir=output_dir, deepspeed=ds_config_dict, skip_memory_metrics=True)
trainer.train()

of course the explicit Nones should be needed as they should get overwritten by the new trainer, I was just doing a sanity check.

So there must be a circular reference where 2 or more internal variables refer to each other and thus the memory doesn't get freed.

I emulate a dist env with just:

dist_env_1_gpu = dict(
            MASTER_ADDR="localhost", MASTER_PORT="10999", RANK="0", LOCAL_RANK="0", WORLD_SIZE="1"
        )
for k,v in dist_env_1_gpu.items():
    os.environ[k]=v

I'd be happy to try to investigate this on my own if you could help me with pointers at potential suspects.

Thank you!

@jeffra, @samyam

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions