Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace dill package to use cloudpickle #38531

Merged
merged 7 commits into from
Apr 18, 2024
Merged

Conversation

VladaZakharova
Copy link
Contributor

Related issue: #35307


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow area:providers area:system-tests provider:cncf-kubernetes Kubernetes provider related issues provider:docker provider:google Google (including GCP) related issues labels Mar 27, 2024
@VladaZakharova
Copy link
Contributor Author

Hi @potiuk @eladkal !
I have found pretty old issue related to upgrade of dill package. As I understood, replacement of dill with cloudpickle here makes more sense since we will not break beam dependencies in this case.
Can you please take a look on changes?
Thanks!

@potiuk
Copy link
Member

potiuk commented Mar 27, 2024

This is cool :). But:

Certainy we will need to keep backwards-compatibility option (and I do not think dill should be replaced really - they should be two equivalent options).

I think this change needs a bit more:

  1. separate use_cloud_pickle from use_dill (and failure when both are true)
  2. turning dill into an optional extra (Similar to virtualenv in hatch_build.py
  3. similarly adding cloudpickle into an optional extra rather than mandatory requirement
  4. Everywhere where dill and cloudpickle are used we need to raise an exception (on ImportError) explaining that this is an optional feature and how to enable it (by using [dill], [cloudpickle] extras respectively or installing the dill/cloudpickle dependencies
  5. significant newsfragment explaining that dill is no longer automatically installed and that [dill] extra is needed to use dill
  6. Some more documentation explaining differences between the two (somewhere here : https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator)

This way we will:

  • stop depending on dill as mandatory requirement
  • will not start depending on cloudpickle as mandatory requirement
  • have an easy way for users to continue using dill (add [dill] extra - without them having to adapt touse cloudpickle that might introduce various incompatibilities

airflow/operators/python.py Outdated Show resolved Hide resolved
airflow/models/dagpickle.py Outdated Show resolved Hide resolved
airflow/models/taskinstance.py Outdated Show resolved Hide resolved
@VladaZakharova
Copy link
Contributor Author

Thank you @potiuk and @hussein-awala for your comments!
I have updated the code to follow all suggestions.
Right now thinking about this migration script. Right now both dill and cloudpickle will be optional, so the best idea here from my point of view will be to use just pickle, if none of the optional modules are provided. WDYT?

hatch_build.py Outdated Show resolved Hide resolved
@VladaZakharova VladaZakharova requested a review from potiuk April 8, 2024 08:25
@moiseenkov moiseenkov force-pushed the dill-upgrade branch 2 times, most recently from ae1f8be to 48ce74f Compare April 11, 2024 12:31
@potiuk
Copy link
Member

potiuk commented Apr 12, 2024

Unlike dill and cloudpickle, serde's serialize() method loses an object's type information similar to what json.dumps() does:

It shoudl not . @bolkedebruin -> i believe serde should be good for it and we should be able to do round-trip serialization of most types

@VladaZakharova
Copy link
Contributor Author

Hi @bolkedebruin @potiuk !
Friendly reminder on this PR, we are actually waiting for some response to start implementation, since as I understand there will be a lot of changes to make until we actually will be able to close this ticket :)
Maybe someone else can also be involved in this conversation here?

@potiuk
Copy link
Member

potiuk commented Apr 16, 2024

I think if you have no confirmation from @bolkedebruin on the proposed path, the way to go is to implement POC and see if it works with the current executor configs @VladaZakharova. There is no better way to confirm the approach.

@VladaZakharova
Copy link
Contributor Author

VladaZakharova commented Apr 16, 2024

@potiuk If we are talking about migration and changing the way we serialize, should we consider changing dill to use json? Are there some limitations here? By doing this, we can try to avoid the same problem in the future for other Python versions. WDYT?

Also regarding the original issue with incorrect serialization for Python 3.11, is it a problem only with serialization or deserialization too? If we can use dill for Python 3.11 only for deserialization, and cloudpickle for serialization, is there will be a problem?

@hussein-awala @Taragolis Can you give us some details here? Thanks!

@potiuk
Copy link
Member

potiuk commented Apr 16, 2024

Are there some limitations here?

As mentioned before - we need to be able to handle different serializers - to handle the K8S configuration problem described above in #38531 (comment). This is the reason we have dill in the first place. If serde will not solve the problem (seems not) - then the solution with storing pickler together with serialized value seems to be good direction - providing that migration scenarios will be part of the solution of course.

@potiuk
Copy link
Member

potiuk commented Apr 16, 2024

Also regarding the original issue with incorrect serialization for Python 3.11, is it a problem only with serialization or deserialization too? If we can use dill for Python 3.11 only for deserialization, and cloudpickle for serialization, is there will be a problem?

The Python 3.11 issue is really a "test" issue - I am not really sure if this has the effect in production.

BTW. I think there is a little misunderstanding here. If the only reason for this change is supporting cloudpickle in Python Virtualenv Operato and not getting rid of dill, then we can likely leave dill as mandatory dependency of Airflow and do not worry about the configuration for executor - and only get cloudpickle as optional dependency for Python virtualenv operator.

Maybe I was assuming too much of a reason for that change when I was commending on dill used in the executor. And we can leave that part altogether - concentrating back on just Python Virtualenv Operator.

@moiseenkov moiseenkov force-pushed the dill-upgrade branch 3 times, most recently from f7c1d8b to f22d1dd Compare April 18, 2024 07:49
@moiseenkov
Copy link
Contributor

moiseenkov commented Apr 18, 2024

@potiuk ,
Thank you for helping with this PR. We see that at this point modifying Airflow core has really high cost and the data migration brings additional risks for customers.
If we come back to the original issue, as you mentioned, the problem is failing a test for a particular operator. And apparently, it can be solved without core modification, because the PythonVirtualenvOperator's logic (virtual environment preparation) has nothing to do with Airflow models serialization. Therefore I updated this PR and left only changes in the operator level. So the Airflow core remains unchanged. It should resolve the initial problem. Please take a look.
And thank you once again!

@moiseenkov moiseenkov force-pushed the dill-upgrade branch 3 times, most recently from f102d7b to 54eee08 Compare April 18, 2024 11:53
airflow/operators/python.py Outdated Show resolved Hide resolved
hatch_build.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Member

potiuk commented Apr 18, 2024

@potiuk ,
Thank you for helping with this PR. We see that at this point modifying Airflow core has really high cost and the data migration brings additional risks for customers.
If we come back to the original issue, as you mentioned, the problem is failing a test for a particular operator. And apparently, it can be solved without core modification, because the PythonVirtualenvOperator's logic (virtual environment preparation) has nothing to do with Airflow models serialization. Therefore I updated this PR and left only changes in the operator level. So the Airflow core remains unchanged. It should resolve the initial problem. Please take a look.
And thank you once again!

Yes, sorry for the confusion - I think indeed we should limit that change to only that - all the complexity that replacing core executor config might still remain. It might not help with some dependency issues (I.e. dill will still be a core depenency) - but it will give the users a way to handle their pickling for PVO / External Python Operator better.

@potiuk potiuk merged commit 7b654b4 into apache:main Apr 18, 2024
93 checks passed
@moiseenkov
Copy link
Contributor

@potiuk , hi
Thanks for reviewing and merging this. We have only one short question here if you don't mind.
Because these changes are not contained by any provider package, does it mean that they will be released with the Airflow itself?

@potiuk
Copy link
Member

potiuk commented Apr 19, 2024

yes. Python Operator and related are part of the Airflow core.

@moiseenkov
Copy link
Contributor

Thank you!

@Taragolis Taragolis added this to the Airflow 2.10.0 milestone Apr 24, 2024
@utkarsharma2 utkarsharma2 added the type:misc/internal Changelog: Misc changes that should appear in change log label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow area:providers area:system-tests provider:cncf-kubernetes Kubernetes provider related issues provider:docker provider:google Google (including GCP) related issues type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants