Skip to content

Comments

fix(amazon): flush file buffer in S3Hook.download_file() before returning path#62078

Merged
vincbeck merged 3 commits intoapache:mainfrom
pippo995:fix/s3hook-download-file-flush
Feb 17, 2026
Merged

fix(amazon): flush file buffer in S3Hook.download_file() before returning path#62078
vincbeck merged 3 commits intoapache:mainfrom
pippo995:fix/s3hook-download-file-flush

Conversation

@pippo995
Copy link
Contributor

Summary

  • S3Hook.download_file() writes S3 object content to a file via download_fileobj() but never calls flush() before returning the file path
  • When the caller immediately reads the returned path, the file may contain 0 bytes because data is still in Python's write buffer
  • Added file.flush() after download_fileobj() to ensure buffered content is written to disk

Details

The original implementation used a with context manager which auto-closes (and flushes) the file. When preserve_file_name support was added, the with was removed and the file is now left open and unflushed.

This particularly affects small files (< ~8KB) that fit entirely in the buffer. The bug is latent in all environments but was exposed by apache-airflow-providers-common-compat==1.13.1 (PR #61157), which changed the execution timing of get_hook_lineage_collector() between download_fileobj() and return file.name.

…ning path

S3Hook.download_file() writes S3 object content to a file via
download_fileobj() but never flushes the write buffer before returning
the file path. When the caller immediately opens the returned path,
the file may contain 0 bytes because the data is still in Python's
write buffer.

This particularly affects small files (< ~8KB) that fit entirely in
the buffer, and was exposed by apache-airflow-providers-common-compat
1.13.1 which changed execution timing of get_hook_lineage_collector().

See also: boto/boto3#1304
@pippo995 pippo995 requested a review from o-nikolas as a code owner February 17, 2026 15:23
@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 17, 2026
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 17, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@vincbeck
Copy link
Contributor

Tests are failing

The tests for download_file() mocked NamedTemporaryFile to return a
PosixPath instead of a file-like object. PosixPath lacks flush() and
its .name property returns just the filename, not the full path like
a real NamedTemporaryFile. Use the default MagicMock return value
which properly supports file-like operations.
@pippo995
Copy link
Contributor Author

The tests were mocking NamedTemporaryFile to return a PosixPath instead of a file-like object. PosixPath has no .flush() and its .name returns just the filename (not the full path like a real file object). I'm going to change tests

@pippo995 pippo995 requested a review from vincbeck February 17, 2026 18:05
@vincbeck vincbeck merged commit 78dccfd into apache:main Feb 17, 2026
90 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 17, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants