Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to pickle protocol 4 when saving models #3065

Merged
merged 4 commits into from
Mar 9, 2021
Merged

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Mar 7, 2021

Gensim has saved models with protocol=2 for years. This protocol has a limitation on the maximum object size, and is not as efficient as later protocols.

This PR changes the default protocol to 4. It's available from py3.4+; Gensim currently supports 3.5+, so that's fine.

Gensim release 4.0 seems the right place for this. Old environments (python < 3.4) won't be able to load models created with Gensim 4.0+, but because we made so many changes in 4.0, this is the place to cut the support anyway.

Fixes #1851.

@piskvorky piskvorky added this to the 4.0.0 milestone Mar 7, 2021
@piskvorky piskvorky requested review from gojomo and mpenkov March 7, 2021 20:34
@gojomo
Copy link
Collaborator

gojomo commented Mar 8, 2021

Simple enough, but per some ideas floated in #2848...

  • could consider jumping to version 5 (& use pickle5 backport when in Python <3.8)
  • could declare Gensim's preferred version in a central named variable that's then reused - other than just a default-parameter literal in a variety of places
  • there seem to be many refs to pickle or import pickle in other files - should all be checked to ensure using now preferred-version. (in particular: the utility pickle() function also declared in utils.py around line 1400 should also use now-preferred version.)
  • if saving-separately is no longer oft-needed to work around pickle limitations, it could be default-off unless people need separate files for other reasons (like memory-mapping). this is a bigger change in expected-behavior, that would need a prominent release-note bullet, but I've seen enough people lose/break models when they don't realize they're spread among multiple files, to think that the proper default is one-file, and only experts who need/explicitly-specify separate numpy files should get them.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 8, 2021

Yes, perhaps it's time to revisit why we needed save/load in the first place. Why not just pickle?

  1. To work around pickle size limitations. No longer relevant.
  2. To allow memory-mapping large arrays, saving RAM. May be resolved with protocol=5? I haven't really checked.
  3. To apply class-specific pre-save and post-load operations (e.g. upgrade of model attributes on load). Could be bypassed by having a dedicated upgrade script, as per Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967 (comment).
  4. Persisting models to non-local storages (e.g. S3). Can be done by pickle.dump(file-like object opened by smart_open).

So, all reasons for SaveLoad's existence may be gone now. Python has finally caught up.

We have to keep SaveLoad anyway for compatibility of course, but it might become a near no-op.

I'll look into protocol=5 and other places it's used, good point.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 8, 2021

Relying on an external library like pickle5 seems a hassle. Unless we actually take advantage of protocol=5, I'd prefer to keep the default at protocol=4.

@gojomo
Copy link
Collaborator

gojomo commented Mar 8, 2021

My understanding is that pickle v5 has the hooks needed for the sort of custom-separate-serialization of things like numpy arrays, but it's not yet automatic/standard. So it's not a simple matter of "use v5 and we're done". But, if we thought extending it soon in that manner, perhaps after seeing what other recent v5 users have done with numpy arrays, was likely, then doing a one-hop to v5 now, rather than hop-to-v4 now then hop-to-v5 soon after, might be a way to minimize interim states and catch any potential gotchas sooner rather than later. (AFAICT, 'pickle5' is an official backport by the same Python core contributor who wrote the v5 PEP and Python3.8+ implementation, so it should have fewer risks than relying on other arbitrary external libs.)

Even without 'upgrade-scripts', old-object-cleanup hooks that are analogous to SaveLoad 'specials'/etc seem possible via pickle - and if we relied on those instead, we might stay closer to other community Python practices. If we think that's the eventual future direction, we needn't break anything that works and is familiar right away - but we would want to discourage further growth/complexity in SaveLoad - so something like the recent lifecycle logging could move to a different superclass/mix-in.

@piskvorky piskvorky self-assigned this Mar 8, 2021
@piskvorky
Copy link
Owner Author

piskvorky commented Mar 8, 2021

Reasonable points. But this will need someone to code it up – at least as a proof of concept. It's not trivial work.

I fear I won't have the bandwidth for this, certainly not in time for 4.0.0. And frankly, if you do have some extra bandwidth for Gensim now @gojomo, I'd rather you revisited the blocking tickets for 4.0.0 Milestone.

@mpenkov mpenkov changed the title [MRG] Save with pickle protocol 4 by default Default to pickle protocol 4 when saving models Mar 9, 2021
@mpenkov mpenkov merged commit 9eb52d1 into develop Mar 9, 2021
@mpenkov mpenkov deleted the pickle_protocol branch March 9, 2021 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support streaming models split into multiple files from S3 / GCS
3 participants