Skip to content

Conversation

@jroachgolf84
Copy link
Collaborator

This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering the get_file_metadata method in the S3Hook to yield records in the paginated response, rather than loading them into a single list. The return type for the get_file_metadata method is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated.

closes: #55039

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Aug 29, 2025
@jroachgolf84 jroachgolf84 requested review from dstandish and removed request for eladkal and o-nikolas August 29, 2025 13:53
@jroachgolf84
Copy link
Collaborator Author

cc: @dstandish

@jroachgolf84
Copy link
Collaborator Author

@eladkal, since we're changing the return type of get_file_metadata, would this need to be part of a major release?

@dstandish
Copy link
Contributor

mentioned it on slack but just adding here

should check with @eladkal re backcompat issues. changing get_file_metadata to iterator likely means major release. and given that maybe we should update the whole check key funciton logic so that it never has to hold everything in memory, i.e. by making it so that check_fn takes a key and not a list

but thinking about that..... the problem is it's somewhat ambiguous what the behavior should be

for best performance, you would want check_fn to return on first "pass"

but, the behavior that would be most similar to current, would be to return all the files for which check_fn evaluates to true

@jroachgolf84
Copy link
Collaborator Author

cc: @eladkal

@eladkal
Copy link
Contributor

eladkal commented Sep 2, 2025

Lets consult first with @o-nikolas and @vincbeck.

@vincbeck
Copy link
Contributor

vincbeck commented Sep 2, 2025

To keep it backward compatible, could we make get_file_metadata return list | Iterator? We would need to handle both cases in providers/amazon/src/airflow/providers/amazon/aws/sensors/s3.py which I do not really like but that's okay. What do you think?

@jroachgolf84
Copy link
Collaborator Author

I think that's feasible - how would we determine whether to return a list or an Iterator?

@vincbeck
Copy link
Contributor

vincbeck commented Sep 3, 2025

In the current implementation we would always return an iterator but having list | Iterator as return type annotation would keep it backward compatible

@jroachgolf84
Copy link
Collaborator Author

In the current implementation we would always return an iterator but having list | Iterator as return type annotation would keep it backward compatible

Making the type annotation change now. Both cases are handled in providers/amazon/src/airflow/providers/amazon/aws/sensors/s3.py.

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sorry but I realized I provided wrong instructions. This does not solve back compat issues ... What we should do instead is deprecate get_file_metadata and create a new one using Iterator and use this one in the sensor. That way we do not need to create a major release because of this.

@jroachgolf84
Copy link
Collaborator Author

jroachgolf84 commented Sep 4, 2025

Got it - so the plan would be to:

  1. Deprecate get_file_metadata with a deprecation warning?
  2. Create a new get_file_metadata method that returns an Iterator, and use that in the Sensor.

What is the best way to deprecate a method and create one with the same name? Or should I create a new method with a new name, and use this one in the Sensor?

@vincbeck
Copy link
Contributor

vincbeck commented Sep 4, 2025

Got it - so the plan would be to:

Deprecate get_file_metadata with a deprecation warning?
Create a new get_file_metadata method that returns an Iterator, and use that in the Sensor.

Correct

What is the best way to deprecate a method and create one with the same name?

To deprecate a method, please emit deprecation warning at the beginning of the method. Example below:

warnings.warn(
            "The property `async_conn` is deprecated. Accessing it in an async context will cause the event loop to block. "
            "Use the async method `get_async_conn` instead.",
            AirflowProviderDeprecationWarning,
            stacklevel=2,
        )

Or should I create a new method with a new name, and use this one in the Sensor?

Yes, you should create a new one with a new name

@jroachgolf84
Copy link
Collaborator Author

@vincbeck, I've updated the PR accordingly!

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the quick turn around!

@jroachgolf84
Copy link
Collaborator Author

@dstandish, can you take a look at the updated PR?

@jroachgolf84
Copy link
Collaborator Author

@eladkal, can you take a look at this for me?

@jroachgolf84
Copy link
Collaborator Author

@ashb should be all set here!

@ashb ashb merged commit 6a10bc6 into apache:main Sep 18, 2025
78 checks passed
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Sep 30, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 1, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 2, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 3, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 4, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 7, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 8, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 9, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 10, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 11, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 12, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 14, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 15, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025
This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering
the get_file_metadata method in the S3Hook to yield records in the paginated response,
rather than loading them into a single list. The return type for the get_file_metadata method
is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

s3 key sensor (sync version) has bad memory footprint

6 participants