-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_model_size_mb
(LightningModule.model_size
) shouldn't create temporary files in the current directory
#10074
Comments
@RuRo Thanks, this seems totally reasonable to me. Would you be interested in contributing this change? |
@calebrob6 came up with a similar idea: #8343 (comment) He remarks there in the code comment that the disadvantage is the cost of memory. If this is a concern, I could suggest adding a bool argument to toggle the behavior. Suboptimal, as Lightning would have to choose a default, but maybe worth considering. |
I am "interested in contributing" in the sense that you are free to use the code I provided. Unfortunately, I don't currently have the time to create a proper PR with tests and stuff. Maybe, I'll have some time next month, but given the size of the proposed fix, I don't think that waiting until next month makes sense. Regarding the memory cost, I'll admit, I haven't thought about that. Here's a slightly more clever version which doesn't store the whole model in memory: class ByteCounter:
def __init__(self):
self.nbytes = 0
def write(self, data):
self.nbytes += len(data)
def flush(self):
pass
target = ByteCounter()
torch.save(model.state_dict(), target)
size_mb = target.nbytes / 1e6
return size_mb This still allocates some memory, but the peak memory consumption should be much lower since Also, this version should have about the same memory footprint as the "write the file to disk" version. |
That's really cool! |
Dear @RuRo, would you mind contributing |
@tchaton in case you didn't see, RuRo said, "I am "interested in contributing" in the sense that you are free to use the code I provided. Unfortunately, I don't currently have the time to create a proper PR with tests and stuff." |
Currently, the
get_model_size_mb
function is implemented like this:This writes the model to disk and also introduces a race condition where the temporary file may be left behind if the user "Ctrl+C"s the program while the
torch.save
andos.path.getsize
calls are running. I originally found this bug by investigating, what was creating the mysteriouscb0e0c0d62ee4c20986c67462c1105d3.pt
files in my project directory.You can achieve the same result without touching the disk by doing
The text was updated successfully, but these errors were encountered: