-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrate from dill
to cloudpickle
for advanced serialization
#7870
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Is it worth making the pickle module a config setting, or just always using cloudpickle instead? |
so I'd generally advocate for allowing fewer dependencies though, so it could be nice to make |
either way it would probably also necessitate reworking some of the tests, since (for example) cloudpickle could conceivably serialize |
Hi @jrwalk, can I work on this? |
Assigned you. The issue is pretty old - we also have ExternalPythonOperator using same approach now. I think just focusing on tests for those might be enough |
#35529 -- maybe using cloudpickle could cure this problem? |
If you would like to take a stab on it and attempt to try it - fee free to verify that hypothesis @Felix-neko - PRs are always most welcome. |
Heh, it looks like you're overestimating my power (and my current knowledge in cloudpickle and airflow) |
It might help. There is also limitations for cloudpickle exists:
|
That won't be a big problem. Especially if simple |
@sumeshpremraj |
With #39270 completed, argument |
Description
Usage of
dill
for optional serialization inPythonVirtualenvOperator
may be replaced withcloudpickle
as its serialization library. This should be a mostly drop-in replacement.Use case / motivation
Currently, the
PythonVirtualenvOperator
optionally usesdill
in place of stockpickle
to serialize advanced types. However, most major distributed compute frameworks have opted to shift tocloudpickle
, meaning usingdill
for Airflow can introduce redundant dependencies for calling out to other distributed compute (e.g., farming compute-heavy tasks out to a remotedask
cluster), and can interfere with serialization of tasks for those tools.Since both
dill
andcloudpickle
are largely drop-in replacements forpickle
, the migration should be fairly minor.Related Issues
kubeflow/pipelines#1387
dask/distributed#3606
piskvorky/gensim#558 (comment)
uqfoundation/multiprocess#22 (comment)
The text was updated successfully, but these errors were encountered: