Improve models dump #3321

dariocurr · 2022-04-10T17:36:43Z

Problem description

I think you could improve the disk footprint of the save and load mechanism

Steps/code/corpus to reproduce

In order to save space you could do something like the following, avoiding to save the whole random_state but just its seed

model.random = np.nan
model.random_seed = model.random.get_state()[1][0]

and then when loading

model.random = np.random.RandomState(seed=model.random_seed)

Versions

Please provide the output of:

Linux-5.13.0-39-generic-x86_64-with-glibc2.31
Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:22:55) 
[GCC 10.3.0]
Bits 64
NumPy 1.22.3
SciPy 1.8.0
gensim 4.1.2
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

piskvorky · 2022-04-10T19:10:14Z

Hi, can you expand the motivation section? What would be the advantage of your suggestion, what's the problem?

gojomo · 2022-04-10T21:50:26Z

The savings would have to be significant to justify adding extra lines of pre-save/post-load special handling.

For the given example, replacing a RandomState with just a compact seed, it looks like the savings might be 2-3KB:

len(pickle.dumps(np.random.RandomState()))  # 2788 in my Python 3.9.5 test

But the models are often hundreds of megabytes or gigabytes in size, making this a negligible savings.

But the cost is a few lines of extra code – and, as shown, a slight behavior change. In my quick test, a RandomState started again from the seed so saved will not give the same stream of future randoms as a pickled-then-restored RandomState. That risks breaking user expectations.

There might be other optimizations possible in model saving, but this example shows trivial gains, for non-zero risks.

More valuable work would be looking into whether advances in the pickle protocol, or some other non-custom-to-Gensim serialization, can replace Gensim's custom steps entirely, a possibility raised by #2848.

piskvorky · 2022-04-11T07:07:55Z

Yeah sounds like a non-issue – expanded motivation would be great @dariocurr. What prompted this ticket?

dariocurr · 2022-04-11T07:36:26Z

We want to reduce the disk footprint as much as possible when storing a model, and in our implementation we implemented the few lines above.

The problem is the disk footprint.

I agree that it's a very small improvement but it's something, and this is just one case I found digging through your code. I don't know if other, more compelling improvements about random generation could be made.

About the following:

But the cost is a few lines of extra code – and, as shown, a slight behavior change. In my quick test, a RandomState started again from the seed so saved will not give the same stream of future randoms as a pickled-then-restored RandomState. That risks breaking user expectations.

In my simple, easy tests, everything works just fine

import numpy as np
r1 = np.random.RandomState(seed=42)
r1_array = r1.get_state()[1]
r2 = np.random.RandomState(r1_array[0])
r2_array = r2.get_state()[1]
assert np.array_equal(r1_array, r2_array)

Moreover, numpy's documentation seems to confirm that the same seed will produce the same output

piskvorky · 2022-04-11T09:21:55Z

I understand the motivation to lower the disk footprint. But 2 KB (when typical models are >> 100 MB) is not worth it, sorry.

If you find a way to save a substantial amount of space, please open a new ticket, we can discuss. Closing here.

dariocurr · 2022-04-11T09:29:52Z

Got it. Thanks for the attention

gojomo · 2022-04-11T17:46:06Z

In my simple, easy tests, everything works just fine
import numpy as np
r1 = np.random.RandomState(seed=42)
r1_array = r1.get_state()[1]
r2 = np.random.RandomState(r1_array[0])
r2_array = r2.get_state()[1]
assert np.array_equal(r1_array, r2_array)
Moreover, numpy's documentation seems to confirm that the same seed will produce the same output

If you try an actual save-load vs re-seed-from-running-RandomState instead...

import numpy as np
rs0 = np.random.RandomState(seed=42)
rs0.randint(2**16)  # advance state like used model
rs1 = pickle.loads(pickle.dumps(rs0))  # simulate full save/reload
rs2 = np.random.RandomState(rs0.get_state()[1][0])  # shortcut reseed
print(rs0.randint(2**16), rs1.randint(2**16), rs2.randint(2**16))

...you'll see the problem. A real save/reload continues the PRNG series as it was; your re-seeding misses something & thus diverges.

You can probably figure out a v.2 - some effective way to actually save/restore the full state, still somewhat smaller than default serialization.

But the unexpected problem with your v.1 is a nice example of how even the most-simple change might introduce bugs & behavioral changes – so keep it simple without compelling benefits.

dariocurr · 2022-04-11T18:59:22Z

Wow that's something I didn't expect. Thank you for your time, you just saved me some hours of debugging.
BTW its really weird.

But the unexpected problem with your v.1 is a nice example of how even the most-simple change might introduce bugs & behavioral changes – so keep it simple without compelling benefits.

I see your point.
Have a nice day

dariocurr mentioned this issue Apr 10, 2022

default isort && default black #3322

Closed

piskvorky closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve models dump #3321

Improve models dump #3321

dariocurr commented Apr 10, 2022

piskvorky commented Apr 10, 2022

gojomo commented Apr 10, 2022 •

edited

Loading

piskvorky commented Apr 11, 2022 •

edited

Loading

dariocurr commented Apr 11, 2022 •

edited

Loading

piskvorky commented Apr 11, 2022 •

edited

Loading

dariocurr commented Apr 11, 2022

gojomo commented Apr 11, 2022

dariocurr commented Apr 11, 2022 •

edited

Loading

Improve models dump #3321

Improve models dump #3321

Comments

dariocurr commented Apr 10, 2022

Problem description

Steps/code/corpus to reproduce

Versions

piskvorky commented Apr 10, 2022

gojomo commented Apr 10, 2022 • edited Loading

piskvorky commented Apr 11, 2022 • edited Loading

dariocurr commented Apr 11, 2022 • edited Loading

piskvorky commented Apr 11, 2022 • edited Loading

dariocurr commented Apr 11, 2022

gojomo commented Apr 11, 2022

dariocurr commented Apr 11, 2022 • edited Loading

gojomo commented Apr 10, 2022 •

edited

Loading

piskvorky commented Apr 11, 2022 •

edited

Loading

dariocurr commented Apr 11, 2022 •

edited

Loading

piskvorky commented Apr 11, 2022 •

edited

Loading

dariocurr commented Apr 11, 2022 •

edited

Loading