-
-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: cPickle #485
Comments
It is doable. You ship a precompiled binary for every single OS/Python version pair that you want to support. There aren't way too many combinations, just 20, plus the one for noarch for a fallback. The reason that I didn't bother thinking about it is because it would require some cleverness to build the C extension around the Python package in a way that would reduce code duplication but result in fast performance. This causes an issue for people who want to use pickle5 in older versions of Python. The pickle5 package has a C library clone of pickle as well like you described, but it is not as trivial to get it working with a hypothetical C-dill compared to the current Python implementation. If you would like to pursue this, I would say make a clone of cPickle that has all of its functions exposed and we can extend from it if it if the C extension is present. The other option is a complete rewrite of dill to use dispatch_table instead of dispatch like cloudpickle did, but considering that people expect the dill internal API to be relatively stable (huggingface/datasets#4379), this is not a good idea and rebuilding our own cPickle is the way to go. We could also optimize dill functions while leaving the Python code put using Cython to get just a little bit more. This only needs to be done for CPython. PyPy's pickle module is precompiled but runs within Python. @mmckerns Do you have stats on how many IronPython or Jython users dill has? I do not even know if there is support for them or if they are still maintained. |
As far as I know, they are both supported. The most detailed stats I track are here: https://pypistats.org/packages/dill, which doesn't give details on Jython/IronPython. (EDIT: I should also mention these stats, which track total downloads by release version of |
I just noticed that they have no stable Python 3 builds, so they are probably implicitly unsupported now that Python 2 is no longer supported. The only binaries that need to be built are for CPython, which makes this problem much easier. |
Just saw the cloudpickle source. I was wrong, the default saving for functions and classes can be changed with the |
Well, |
I remember looking at the cloudpickle code for |
Oh, you are mentioning the "deterministic" feature, right? Would this be an optional feature? Because since Python 3.7, the insertion order for dictionaries and sets is preserved, so it should also be preserved in pickling. If sorting the keys or items is dics/sets is only used for standalone objects, it could be implemented as a special case in
This would be just for I'm considering that if it is possible to use standard |
@mmckerns, it seems to be a lot going on in dill's development plans. I, personally, was hooked by the challenges put (I was in need of a new puzzle and love to learn more about Python 😉) and I'm committed to work on these if you feel my contributions are valid —I've sent some PR, mostly drafts, after the session bugfixes, but there's no urge about them. However, what do you think of opening a board in GitHub "Projects", a wiki page or even an issue with a to-do list to track all these parallel efforts? I'm beginning to get lost between so many issues and PRs (some open, others closed...). There is currently:
|
I've managed large software projects, and found that the only/best way to track issues/effort is to have a tool that is integrated with the tickets/PRs. So, a project view in GitHub could make sense. As a lightweight step, I've found that adding a milestone to the ticket/PR can make it easy to track what is intended for the next pending release. There is always a lot of development to be done -- just not always the time to do it. :) I'd really like to get |
Yes it is an optional feature, but very popular packages like pulumi and huggingface datasets need it, so we can't just ignore it. The insertion order for dictionaries is only for user create dictionaries. Global variable dictionaries are still non-deterministic, so the feature is still needed. Insertion order preservation is not guaranteed for sets, never was, and isn't true in CPython. It would be cool and probably possible if we could just expose the
persistent_id is used here: https://github.com/python/cpython/blob/3.7/Lib/pickle.py#L489-L492 Only one When optimizing remember the most important rule: safety first. Do not mix adding new code (adding |
Some thoughts about
|
A solution like that would be hairy and awkward. The hash idea avoids collisions in the sense that it is possible to know if the object belongs to dill or not, but it doesn't avoid collisions in the way that I described in torch where it would be impossible to distinguish if the persistent id belonged to dill or was a tensor that was just missing from the model and should result in an error, so they would need to add an if statement to their code to delegate to dill's persistent id if dill is installed and the id belongs to dill. And everyone using persistent id would have to do that. Although dill is a popular package, I don't think we can beat the entire Python community into using our custom conventions for pickling without scaring most people off. |
You can't do that. It is possible however, to provide something super useful and easy to integrate so that developers choose to use it to get super nice additional functionality. I'd add an if statement to my hashing code if it meant that I could get a 10x speed up on serialization... |
Since I'm not a professional developer, but rather an academic that happens to develop library code occasionally, I'll guide myself 99% by your experience. I'm just throwing stuff in the wall here before dedicating any time to prototype solutions. (By the way, the "portable mode" prototype is working! I'll create a draft PR until tomorrow.) Another wild idea I had: Is it possible to override the |
Why not use Of course, we should add a copy to dill in case CPython gets rid of it because it is private. It would be less efficient, but we could create an optimizer that smooths over it. In either case, I don't think there is an easy way to force |
Hey, I've been working on a prototype of a "portable pickle" feature and started wondering whether it could eventually work with cPickle (or
_pickle
), the accelerator module written in C. I've also seen some comments in the code referencing the possibility that dill in general could use it.However, after doing some investigation I noted that, even if the
save_<type>
functions are updated to use state setters (Py 3.8+) and don't rely on internal Pickler methods likesave
andsave_reduce
, there remains a big problem: cPickle will always save types and functions as globals (there isn't an internaldispatch
table that we can modify as with the Python implementation).I thought of creating a C extension module to play with cPickle internals, but all the C functions are private. The only remaining alternative I see is to have a complete clone of
Pickler
code with some minor tweaks to allow overriding these types' saving.Have you also wondered about these issues? Do you think it is feasible to have a C extension compatible with multiple Python 3 versions? (I have no ideia how building and distributing these works.)
The text was updated successfully, but these errors were encountered: