-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to pickle protocol 4 when saving models #3065
Conversation
Simple enough, but per some ideas floated in #2848...
|
Yes, perhaps it's time to revisit why we needed
So, all reasons for We have to keep I'll look into protocol=5 and other places it's used, good point. |
Relying on an external library like pickle5 seems a hassle. Unless we actually take advantage of |
My understanding is that pickle v5 has the hooks needed for the sort of custom-separate-serialization of things like numpy arrays, but it's not yet automatic/standard. So it's not a simple matter of "use v5 and we're done". But, if we thought extending it soon in that manner, perhaps after seeing what other recent v5 users have done with numpy arrays, was likely, then doing a one-hop to v5 now, rather than hop-to-v4 now then hop-to-v5 soon after, might be a way to minimize interim states and catch any potential gotchas sooner rather than later. (AFAICT, 'pickle5' is an official backport by the same Python core contributor who wrote the v5 PEP and Python3.8+ implementation, so it should have fewer risks than relying on other arbitrary external libs.) Even without 'upgrade-scripts', old-object-cleanup hooks that are analogous to |
Reasonable points. But this will need someone to code it up – at least as a proof of concept. It's not trivial work. I fear I won't have the bandwidth for this, certainly not in time for 4.0.0. And frankly, if you do have some extra bandwidth for Gensim now @gojomo, I'd rather you revisited the blocking tickets for 4.0.0 Milestone. |
Gensim has saved models with protocol=2 for years. This protocol has a limitation on the maximum object size, and is not as efficient as later protocols.
This PR changes the default protocol to 4. It's available from py3.4+; Gensim currently supports 3.5+, so that's fine.
Gensim release 4.0 seems the right place for this. Old environments (python < 3.4) won't be able to load models created with Gensim 4.0+, but because we made so many changes in 4.0, this is the place to cut the support anyway.
Fixes #1851.