Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pandas dependency optional for Amazon Provider #28505

Merged

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Dec 20, 2022

Pandas dependency is only used for a few features in Amazon providers amd it is a huge "drag" on a package in general - we use the existing Airflow 2.3+ feature of Optional Provider Feature to make pandas an optional dependency (with "[pandas]" extra).

This is not a breaking change, since those who already installed previous provider version have pandas installed already. Only people who install provider manually will have to install "pandas" extra either for Airflow core package or for the provider.

Fixes: #28468


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk potiuk force-pushed the make-pandas-optional-for-amazon-provider branch 2 times, most recently from e83c95c to b4017e1 Compare December 20, 2022 20:58
Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ 🚀

@eladkal
Copy link
Contributor

eladkal commented Dec 20, 2022

Only people who install provider manually will have to install "pandas" extra either for Airflow core package or for the provider.

This means it is a breaking change.
Yes, on upgrade everything looks fine because as you said pandas lib is already there but if users are using install scripts that install airflow and providers seperatly then once workers restart and reinstall from scratch it will cause broken dags because now pandas is not present. This require manual change to add the pandas dependency thus this is not backward compatible.

@potiuk
Copy link
Member Author

potiuk commented Dec 20, 2022

Yes, on upgrade everything looks fine because as you said pandas lib is already there but if users are using install scripts that install airflow and providers seperatly then once workers restart and reinstall from scratch it will cause broken dags because now pandas is not present. This require manual change to add the pandas dependency thus this is not backward compatible.

Not that it matters in this case (we are releasing breaking version anyway), but let me venture into academic discussion (just for the sake of discussion). And more for "food for thoughts". I do not want to argue vehemotly on this, I am happy to move it now to breaking, I would like more to present a different view on that.

It really depends what WE consider as breaking change. It's never 0/1 (breaking/not breaking). As I mentioned multiple time and as Hyrum's law nicely states: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody".

Every single change in our case we do is breaking someone's workflow. We should accept this. And whether we classify something as breaking or not depends purely on our assesment of the situation on how breaking it is. We earlier agreed that just dependency change is not breaking on its own (and we released a number of providers with dependency changes without increasing major.

What you described is a possibility of course, but more likely than not pandas will already be installed as dependency anyway there for mutliple reasons. And even if not it is just the matter of adding "[pandas]" in their scripts. No DAG modificaitons needed, no changes in the code just "environment" change. We are just dicussing what is what "Airflow Public Interface" - #28300 and I have not seen anyone asking to add "list of dependencies to install automatically" as part of the public API. Once we merge this change, this kind of change is not going to be considered as "breaking" on the list.

So - super easy way to recover and error message will be very clear.

Also there are other things we should consider - do you think the users who install airflow automatically in the way do not have a test harness to verify if their new install is working ? There might be many more problems than missing pandas if every time they rebuild it from scratch, so in your scenario - they for sure have a test harness in place. They have a staging system to test it.

And even if they don't have test harness - do we realistically think they will limit their installation and not install 7.0.0 version of the provider gets released in the same automated way? Do you think they have a separate test harness that they will test against such breaking relase of every single provider they use? Will that make a difference for them whether that description is put in "breaking" or "feature" changes" (because that will be the only difference in this case) ? I am not sure, but I think this is quite unrealistic.

And of course we can be wrong in our assesment. Happens. We can always produce a bugfix for that if many of our users will start complaining. And we should of course try to avoid this, but ultimately, we will break some workflows of some people - this is unavoidable. And there is no way to 100% protect against it.

And finally - yes, I think this is a border-line for me this is really the case where things aren't clearly 0/1. And we can either err on a side of caution or be a bit more bold

This is all purely academic discussion of course - I have 0 problem with putting it into breaking section - it does not really matter in this case of course. Just wanted to explain that putting everything that potentially changes some particular workflow is not necessarily breaking. It's a very simplistic view IMHO and we should not only asses if the change "actually" breaks something but also how disruptive it really is and whether it is simple to recover for the user at early stage, because it is realy harmful.

image

Pandas dependency is only used for a few features in Amazon providers
amd it is a huge "drag" on a package in general - we use the
existing Airflow 2.3+ feature of Optional Provider Feature to
make pandas an optional dependency (with "[pandas]" extra).

This is not a breaking change, since those who already installed
previous provider version have pandas installed already. Only people
who install provider manually will have to install "pandas" extra
either for Airflow core package or for the provider.

Fixes: apache#28468
@potiuk potiuk force-pushed the make-pandas-optional-for-amazon-provider branch from b4017e1 to 026150c Compare December 20, 2022 23:29
@potiuk
Copy link
Member Author

potiuk commented Dec 20, 2022

Moved the description up to breaking changes.

@eladkal
Copy link
Contributor

eladkal commented Dec 21, 2022

@potiuk I actually agree with you on all fronts. I just think that whenever we are not 100% sure we should lean more to the safest option. If this also means no hassle for us then even more reason!

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @potiuk

@uranusjr uranusjr merged commit d9ae90f into apache:main Dec 21, 2022
@potiuk potiuk deleted the make-pandas-optional-for-amazon-provider branch April 4, 2023 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pandas an optional dependency for amazon provider
5 participants