Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint of s3 key trigger #40473

Merged

Conversation

dstandish
Copy link
Contributor

Also had to make the provide_bucket_name decorator support async generators, and in the process i realized we didn't need a separate "async" version of the decorator.

We can return True on the first positive.  And we don't need to keep track of the files.
@dstandish dstandish requested a review from uranusjr June 28, 2024 00:36
@dstandish
Copy link
Contributor Author

dstandish commented Jun 28, 2024

@eladkal fyi this would change get_file_metadata_async to be an iterator instead of occumulate everything in a list.

this lets _check_key_async return early if it finds a single match. we have a customer who reported high memory usage with s3 trigger so i looked into it and found this optimization.

but it would be technically breaking so i guess require a major, which is why i mention to you.

also the removal of the _async form of the decorator would also be technically "breaking"

@eladkal
Copy link
Contributor

eladkal commented Jun 28, 2024

but it would be technically breaking so i guess require a major,

Please add 9.0.0 to versions list

versions:
- 8.25.0

also please short description at the top of the change log
https://github.com/apache/airflow/blob/main/airflow/providers/amazon/CHANGELOG.rst
explaining to users what has been changed and what it means for them (how to migrate from 8.x to 9.0), similar to what we do with newsfragment for Airflow core)

@potiuk
Copy link
Member

potiuk commented Jun 28, 2024

@eladkal fyi this would change get_file_metadata_async to be an iterator instead of occumulate everything in a list.

this lets _check_key_async return early if it finds a single match. we have a customer who reported high memory usage with s3 trigger so i looked into it and found this optimization.

but it would be technically breaking so i guess require a major, which is why i mention to you.

also the removal of the _async form of the decorator would also be technically "breaking"

I am not sure whether it's breaking. It changes the behaviour, for sure (like any other change) but the likelihood it actually breaks someone's workflow is very low. If you are iterating over the results (which is the intended usage of that method), things will continue to work. So the INTENTION of the API has not changed - some of the internal details of it and characteristics (less memory used, lazy evaluation) changed. Also - the decorators are by definition not part of the API - they are part of the method implementation and again - impact the implementation detail, but - unless you inspect details of the decorator - from the user perspective, nothing changed.

IMHO this is literally just an optimization, and I would not qualify it as a breaking change.

This is a great example of something that I advocate for quite some time - not every "method signature" change - even if technically breaking, is breaking. What reallly counts is the intentions of the one who wrote the method on how it should be used, matched with assesment of probabllity the changed thing was used properly - according to the intention.

In this case, the intention has not changed, it should be used in the same way before and after. And IMHO probability that it was used not according to the intention and this change will break it, is very low.

@potiuk
Copy link
Member

potiuk commented Jun 28, 2024

BTW. Yes it shoudl be mentioned in the changelog, but not as a breaking change IMHO.

@eladkal
Copy link
Contributor

eladkal commented Jun 28, 2024

In that case all we need is just the info note I mentioned.
I will classify this as bug fix but the note in the change log will make sure that users are aware of the impact

dstandish and others added 3 commits June 28, 2024 09:50
@dstandish
Copy link
Contributor Author

ok @eladkal i went with the "it's a bugfix" approach and added note in changelog. PTAL

@eladkal eladkal merged commit bbfeee4 into apache:main Jun 28, 2024
6 checks passed
@o-nikolas
Copy link
Contributor

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

@dstandish
Copy link
Contributor Author

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

weird, maybe this https://github.com/apache/airflow/actions/runs/9716729092

@o-nikolas
Copy link
Contributor

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

weird, maybe this https://github.com/apache/airflow/actions/runs/9716729092

Yeah, that's all I found as well, but that can't be the full set of tests since there is almost nothing there. Or maybe it was and that explains how it got through, I'm not sure.

@potiuk
Copy link
Member

potiuk commented Jul 2, 2024

Is there a way to tell the build steps that were done as part of PR that's already been merged? Doc builds are failing in main for the amazon provider, and I'm curious how they didn't fail in this PR.

Click "View details" button at the last push :

Screenshot 2024-07-02 at 22 30 40

@jedcunningham jedcunningham deleted the reduce-memory-footprint-of-s3-key-trigger branch July 2, 2024 21:07
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
* Use generator to reduce memory footprint

We can return True on the first positive.  And we don't need to keep track of the files.

* add tests

* fixup

* Update airflow/providers/amazon/aws/hooks/s3.py

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

* Revert "Update airflow/providers/amazon/aws/hooks/s3.py"

This reverts commit cb2f31a.

* reapply vincent's suggestion

* add check for param

* add changelog

---------

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants