-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve models dump #3321
Comments
Hi, can you expand the motivation section? What would be the advantage of your suggestion, what's the problem? |
The savings would have to be significant to justify adding extra lines of pre-save/post-load special handling. For the given example, replacing a
But the models are often hundreds of megabytes or gigabytes in size, making this a negligible savings. But the cost is a few lines of extra code – and, as shown, a slight behavior change. In my quick test, a There might be other optimizations possible in model saving, but this example shows trivial gains, for non-zero risks. More valuable work would be looking into whether advances in the |
Yeah sounds like a non-issue – expanded motivation would be great @dariocurr. What prompted this ticket? |
We want to reduce the disk footprint as much as possible when storing a model, and in our implementation we implemented the few lines above. The problem is the disk footprint. I agree that it's a very small improvement but it's something, and this is just one case I found digging through your code. I don't know if other, more compelling improvements about random generation could be made. About the following:
In my simple, easy tests, everything works just fine import numpy as np
r1 = np.random.RandomState(seed=42)
r1_array = r1.get_state()[1]
r2 = np.random.RandomState(r1_array[0])
r2_array = r2.get_state()[1]
assert np.array_equal(r1_array, r2_array) Moreover, |
I understand the motivation to lower the disk footprint. But 2 KB (when typical models are >> 100 MB) is not worth it, sorry. If you find a way to save a substantial amount of space, please open a new ticket, we can discuss. Closing here. |
Got it. Thanks for the attention |
If you try an actual save-load vs re-seed-from-running-RandomState instead... import numpy as np
rs0 = np.random.RandomState(seed=42)
rs0.randint(2**16) # advance state like used model
rs1 = pickle.loads(pickle.dumps(rs0)) # simulate full save/reload
rs2 = np.random.RandomState(rs0.get_state()[1][0]) # shortcut reseed
print(rs0.randint(2**16), rs1.randint(2**16), rs2.randint(2**16)) ...you'll see the problem. A real save/reload continues the PRNG series as it was; your re-seeding misses something & thus diverges. You can probably figure out a v.2 - some effective way to actually save/restore the full state, still somewhat smaller than default serialization. But the unexpected problem with your v.1 is a nice example of how even the most-simple change might introduce bugs & behavioral changes – so keep it simple without compelling benefits. |
Wow that's something I didn't expect. Thank you for your time, you just saved me some hours of debugging.
I see your point. |
Problem description
I think you could improve the disk footprint of the save and load mechanism
Steps/code/corpus to reproduce
In order to save space you could do something like the following, avoiding to save the whole
random_state
but just its seedand then when loading
Versions
Please provide the output of:
The text was updated successfully, but these errors were encountered: