`get_model_size_mb` (`LightningModule.model_size`) shouldn't create temporary files in the current directory #10074

RuRo · 2021-10-21T22:58:13Z

Currently, the get_model_size_mb function is implemented like this:

def get_model_size_mb(model: Module) -> float:
    """Calculates the size of a Module in megabytes by saving the model to a temporary file and reading its size.
    The computation includes everything in the :meth:`~torch.nn.Module.state_dict`,
    i.e., by default the parameteters and buffers.
    Returns:
        Number of megabytes in the parameters of the input module.
    """
    # TODO: Implement a method without needing to download the model
    tmp_name = f"{uuid.uuid4().hex}.pt"
    torch.save(model.state_dict(), tmp_name)
    size_mb = os.path.getsize(tmp_name) / 1e6
    os.remove(tmp_name)
    return size_mb

This writes the model to disk and also introduces a race condition where the temporary file may be left behind if the user "Ctrl+C"s the program while the torch.save and os.path.getsize calls are running. I originally found this bug by investigating, what was creating the mysterious cb0e0c0d62ee4c20986c67462c1105d3.pt files in my project directory.

You can achieve the same result without touching the disk by doing

import io

target = io.BytesIO()
torch.save(model.state_dict(), target)
size_mb = target.getbuffer().nbytes / 1e6

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-10-22T03:15:07Z

@RuRo Thanks, this seems totally reasonable to me. Would you be interested in contributing this change?

awaelchli · 2021-10-22T03:19:34Z

@calebrob6 came up with a similar idea: #8343 (comment)

He remarks there in the code comment that the disadvantage is the cost of memory.

If this is a concern, I could suggest adding a bool argument to toggle the behavior. Suboptimal, as Lightning would have to choose a default, but maybe worth considering.

RuRo · 2021-10-22T07:36:27Z

I am "interested in contributing" in the sense that you are free to use the code I provided. Unfortunately, I don't currently have the time to create a proper PR with tests and stuff. Maybe, I'll have some time next month, but given the size of the proposed fix, I don't think that waiting until next month makes sense.

Regarding the memory cost, I'll admit, I haven't thought about that. Here's a slightly more clever version which doesn't store the whole model in memory:

class ByteCounter:
    def __init__(self):
        self.nbytes = 0

    def write(self, data):
        self.nbytes += len(data)

    def flush(self):
        pass

target = ByteCounter()
torch.save(model.state_dict(), target)
size_mb = target.nbytes / 1e6
return size_mb

This still allocates some memory, but the peak memory consumption should be much lower since torch.save writes the model in chunks. For example, torch.saveing a resnet50 model will produce a total of ~97Mb split over 1614 calls to write with a peak usage of ~9Mb.

Also, this version should have about the same memory footprint as the "write the file to disk" version.

calebrob6 · 2021-10-22T07:46:33Z

That's really cool!

tchaton · 2021-10-22T11:11:23Z

Dear @RuRo, would you mind contributing ByteCounter, and I believe we could have a version for sharded models too where a summation is performed at the end.

calebrob6 · 2021-10-22T16:09:16Z

@tchaton in case you didn't see, RuRo said, "I am "interested in contributing" in the sense that you are free to use the code I provided. Unfortunately, I don't currently have the time to create a proper PR with tests and stuff."

RuRo added bug Something isn't working help wanted Open to be worked on labels Oct 21, 2021

rohitgr7 mentioned this issue Oct 25, 2021

Changed the model size calculation using ByteCounter #10123

Merged

12 tasks

rohitgr7 closed this as completed in #10123 Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_model_size_mb` (`LightningModule.model_size`) shouldn't create temporary files in the current directory #10074

`get_model_size_mb` (`LightningModule.model_size`) shouldn't create temporary files in the current directory #10074

RuRo commented Oct 21, 2021

awaelchli commented Oct 22, 2021

awaelchli commented Oct 22, 2021 •

edited

Loading

RuRo commented Oct 22, 2021 •

edited

Loading

calebrob6 commented Oct 22, 2021

tchaton commented Oct 22, 2021

calebrob6 commented Oct 22, 2021

get_model_size_mb (LightningModule.model_size) shouldn't create temporary files in the current directory #10074

get_model_size_mb (LightningModule.model_size) shouldn't create temporary files in the current directory #10074

Comments

RuRo commented Oct 21, 2021

awaelchli commented Oct 22, 2021

awaelchli commented Oct 22, 2021 • edited Loading

RuRo commented Oct 22, 2021 • edited Loading

calebrob6 commented Oct 22, 2021

tchaton commented Oct 22, 2021

calebrob6 commented Oct 22, 2021

`get_model_size_mb` (`LightningModule.model_size`) shouldn't create temporary files in the current directory #10074

`get_model_size_mb` (`LightningModule.model_size`) shouldn't create temporary files in the current directory #10074

awaelchli commented Oct 22, 2021 •

edited

Loading

RuRo commented Oct 22, 2021 •

edited

Loading