Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve models dump #3321

Closed
dariocurr opened this issue Apr 10, 2022 · 8 comments
Closed

Improve models dump #3321

dariocurr opened this issue Apr 10, 2022 · 8 comments

Comments

@dariocurr
Copy link

Problem description

I think you could improve the disk footprint of the save and load mechanism

Steps/code/corpus to reproduce

In order to save space you could do something like the following, avoiding to save the whole random_state but just its seed

model.random = np.nan
model.random_seed = model.random.get_state()[1][0]

and then when loading

model.random = np.random.RandomState(seed=model.random_seed)

Versions

Please provide the output of:

Linux-5.13.0-39-generic-x86_64-with-glibc2.31
Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:22:55) 
[GCC 10.3.0]
Bits 64
NumPy 1.22.3
SciPy 1.8.0
gensim 4.1.2
FAST_VERSION 0
@piskvorky
Copy link
Owner

Hi, can you expand the motivation section? What would be the advantage of your suggestion, what's the problem?

@gojomo
Copy link
Collaborator

gojomo commented Apr 10, 2022

The savings would have to be significant to justify adding extra lines of pre-save/post-load special handling.

For the given example, replacing a RandomState with just a compact seed, it looks like the savings might be 2-3KB:

len(pickle.dumps(np.random.RandomState()))  # 2788 in my Python 3.9.5 test

But the models are often hundreds of megabytes or gigabytes in size, making this a negligible savings.

But the cost is a few lines of extra code – and, as shown, a slight behavior change. In my quick test, a RandomState started again from the seed so saved will not give the same stream of future randoms as a pickled-then-restored RandomState. That risks breaking user expectations.

There might be other optimizations possible in model saving, but this example shows trivial gains, for non-zero risks.

More valuable work would be looking into whether advances in the pickle protocol, or some other non-custom-to-Gensim serialization, can replace Gensim's custom steps entirely, a possibility raised by #2848.

@piskvorky
Copy link
Owner

piskvorky commented Apr 11, 2022

Yeah sounds like a non-issue – expanded motivation would be great @dariocurr. What prompted this ticket?

@dariocurr
Copy link
Author

dariocurr commented Apr 11, 2022

We want to reduce the disk footprint as much as possible when storing a model, and in our implementation we implemented the few lines above.

The problem is the disk footprint.

I agree that it's a very small improvement but it's something, and this is just one case I found digging through your code. I don't know if other, more compelling improvements about random generation could be made.

About the following:

But the cost is a few lines of extra code – and, as shown, a slight behavior change. In my quick test, a RandomState started again from the seed so saved will not give the same stream of future randoms as a pickled-then-restored RandomState. That risks breaking user expectations.

In my simple, easy tests, everything works just fine

import numpy as np
r1 = np.random.RandomState(seed=42)
r1_array = r1.get_state()[1]
r2 = np.random.RandomState(r1_array[0])
r2_array = r2.get_state()[1]
assert np.array_equal(r1_array, r2_array)

Moreover, numpy's documentation seems to confirm that the same seed will produce the same output

@piskvorky
Copy link
Owner

piskvorky commented Apr 11, 2022

I understand the motivation to lower the disk footprint. But 2 KB (when typical models are >> 100 MB) is not worth it, sorry.

If you find a way to save a substantial amount of space, please open a new ticket, we can discuss. Closing here.

@dariocurr
Copy link
Author

Got it. Thanks for the attention

@gojomo
Copy link
Collaborator

gojomo commented Apr 11, 2022

In my simple, easy tests, everything works just fine

import numpy as np
r1 = np.random.RandomState(seed=42)
r1_array = r1.get_state()[1]
r2 = np.random.RandomState(r1_array[0])
r2_array = r2.get_state()[1]
assert np.array_equal(r1_array, r2_array)

Moreover, numpy's documentation seems to confirm that the same seed will produce the same output

If you try an actual save-load vs re-seed-from-running-RandomState instead...

import numpy as np
rs0 = np.random.RandomState(seed=42)
rs0.randint(2**16)  # advance state like used model
rs1 = pickle.loads(pickle.dumps(rs0))  # simulate full save/reload
rs2 = np.random.RandomState(rs0.get_state()[1][0])  # shortcut reseed
print(rs0.randint(2**16), rs1.randint(2**16), rs2.randint(2**16))

...you'll see the problem. A real save/reload continues the PRNG series as it was; your re-seeding misses something & thus diverges.

You can probably figure out a v.2 - some effective way to actually save/restore the full state, still somewhat smaller than default serialization.

But the unexpected problem with your v.1 is a nice example of how even the most-simple change might introduce bugs & behavioral changes – so keep it simple without compelling benefits.

@dariocurr
Copy link
Author

dariocurr commented Apr 11, 2022

Wow that's something I didn't expect. Thank you for your time, you just saved me some hours of debugging.
BTW its really weird.

But the unexpected problem with your v.1 is a nice example of how even the most-simple change might introduce bugs & behavioral changes – so keep it simple without compelling benefits.

I see your point.
Have a nice day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants