Skip to content

Conversation

@josuegen
Copy link
Contributor


This PR introduces a new operator, HttpToGCSOperator, which facilitates the transfer of data from an HTTP endpoint to a Google Cloud Storage (GCS) bucket.

Key Features:

  • HTTP Request Flexibility: Supports various HTTP methods (GET, POST, etc.), data passing, custom headers, and extra request options.
  • GCS Integration: Enables seamless uploading of data to GCS buckets with options for specifying the bucket name, object name, MIME type, gzip compression, encoding, and other GCS upload parameters.
  • Connection Management: Leverages Airflow's connection management for both HTTP and GCP connections, promoting secure and configurable access.
  • Hook Reusability: Utilizes HttpHook and GCSHook for efficient HTTP requests and GCS interactions, with @cached_property for optimized hook instantiation.
  • Templating: Supports Jinja templating for key parameters like endpoint, data, headers, bucket_name, and object_name, allowing for dynamic value injection.
  • Comprehensive Parameters: Offers a wide range of parameters to customize both the HTTP request and the GCS upload process, including:
    • HTTP: http_conn_id, endpoint, method, data, headers, extra_options, log_response, auth_type, tcp_keep_alive related parameters.
    • GCS: gcp_conn_id, impersonation_chain, bucket_name, object_name, mime_type, gzip, encoding, chunk_size, timeout, num_max_attempts, metadata, cache_control, user_project.

Purpose:

This operator simplifies data ingestion from HTTP sources into GCS, which is a common requirement for data pipelines. It eliminates the need for writing custom code to handle HTTP requests and GCS uploads, promoting code reusability and reducing development time.

Example Use Case:

  • Fetching data from a REST API and storing it in GCS for further processing.
  • Downloading files from an HTTP server and archiving them in GCS.
  • Integrating with web services that provide data that needs to be persisted in GCS.

Testing:

The operator has been thoroughly tested with unit tests to ensure its functionality and robustness. (Note: Ideally, reference the specific tests or test file names here if they are in the PR.)

Documentation:

The operator is fully documented with parameter descriptions and usage examples within the code itself.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Apr 23, 2025
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 23, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@josuegen josuegen changed the title fAdd HttpToGCSOperator for transferring data from HTTP to GCS Add HttpToGCSOperator for transferring data from HTTP to GCS Apr 23, 2025
@josuegen josuegen requested review from ashb and potiuk as code owners May 7, 2025 15:47
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Let make the CI green and merge.

@potiuk
Copy link
Member

potiuk commented May 22, 2025

Yeah - a breeze test to update as I thought.

@josuegen
Copy link
Contributor Author

Thanks @potiuk ! I'll wait till the rest of test complete and then I'll push the fix to the breeze test

@josuegen
Copy link
Contributor Author

@potiuk Added fix for the Breeze test

@molcay
Copy link
Contributor

molcay commented May 27, 2025

Hi @josuegen,
One last thing from my side; did you have a chance to run the system test (providers/google/tests/system/google/cloud/gcs/example_http_to_gcs.py) locally?

@josuegen
Copy link
Contributor Author

@molcay I did! See attached screenshot

Screenshot 2025-05-28 at 12 22 07 a m

@josuegen
Copy link
Contributor Author

@molcay In case you're wondering what the warning is about.

Screenshot 2025-05-28 at 12 23 41 a m

@molcay
Copy link
Contributor

molcay commented May 28, 2025

Hi @josuegen,

Thank you for the answer and the screenshots.
It looks like the warning is coming from open_lineage and it is a depreciation warning. We can ignore it (I guess).

From the provider perspective; we are OK to merge this PR.

However someone from the community (who has the permission to merge) needs to merge it.

A small note; they might ask for squashing the commits :)

@potiuk potiuk merged commit ac3578c into apache:main May 29, 2025
93 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented May 29, 2025

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

sanederchik pushed a commit to sanederchik/airflow that referenced this pull request Jun 7, 2025
…#49625)

* Created HTTP to GCS Operator

* Created Unit Test for HTTP to GCS Operator

* Precommit fixes

* Added documentation and test as per pre-commit checks

* Removed commits on files made by pre-commit

* Fixerd unit test for HTTP to GCS operator

* Fixed Unit testing for HTTP to GCS

* Fixed unit testing

* Fixed unit testing

* Updated cross-dependency specification for breeze checks

* Fixed provider documentation for HTTP to GCS

* Fixed breeze unit testing for HTTP to GCS

* Fixed unit testing for HTTP to GCS

* Fixed order in selected-providers-list-as-string for breeze test

* Fixed unit test for HTTP Hook in HTTPToGCS

* Fixed unit test for HTTP Hook in HTTPToGCS

* Fixed order in selected-providers-list-as-string for breeze test

* Fixed order in selected-providers-list-as-string for breeze test

* Removed ORM calls when managing connections in system tests

* Added fix for breeze unit test individual-providers-test-types-list-as-strings-in-json

* Typo fix for breeze unit test individual-providers-test-types-list-as-strings-in-json

---------

Co-authored-by: Josue Velazquez Gen <josuegen@Josues-MacBook-Air.local>
@nathadfield
Copy link
Collaborator

@josuegen Thanks for submitting this operator! Just a quick thing I noticed, the docstring suggests that there is a response_check parameter however there no such provision for this in it's implementation. Is this an oversight?

@josuegen
Copy link
Contributor Author

@nathadfield you're right! I added these two parameters initially: response_check and response_filter. But I removed them later, due to the functionality of the operator.

I'd need to remove them from the docstring, working on the PR

@nathadfield
Copy link
Collaborator

Ok. I would love to actually see these as features for this operator as it would then serve as a drop in replacement for a custom operator we developed a long time ago that does the same thing.

@potiuk
Copy link
Member

potiuk commented Jun 20, 2025

Ok. I would love to actually see these as features for this operator as it would then serve as a drop in replacement for a custom operator we developed a long time ago that does the same thing.

Why don't you contribute it :)?

@nathadfield
Copy link
Collaborator

@potiuk Yes, I might.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2025

@potiuk Yes, I might.

cool :)

@josuegen
Copy link
Contributor Author

josuegen commented Jul 2, 2025

@nathadfield I got some free cycles today and I want to start working on this, did you start already? If not, I'll go ahead and start the work

@nathadfield
Copy link
Collaborator

@josuegen No, I've managed to get around to it so feel free.

@nathadfield
Copy link
Collaborator

@josuegen The other thing I was considering was the option to unpack compressed files before uploading to GCS. The reason is because BigQuery only supports the loading of gzip; not .zip.

jose-lehmkuhl pushed a commit to jose-lehmkuhl/airflow that referenced this pull request Jul 11, 2025
…#49625)

* Created HTTP to GCS Operator

* Created Unit Test for HTTP to GCS Operator

* Precommit fixes

* Added documentation and test as per pre-commit checks

* Removed commits on files made by pre-commit

* Fixerd unit test for HTTP to GCS operator

* Fixed Unit testing for HTTP to GCS

* Fixed unit testing

* Fixed unit testing

* Updated cross-dependency specification for breeze checks

* Fixed provider documentation for HTTP to GCS

* Fixed breeze unit testing for HTTP to GCS

* Fixed unit testing for HTTP to GCS

* Fixed order in selected-providers-list-as-string for breeze test

* Fixed unit test for HTTP Hook in HTTPToGCS

* Fixed unit test for HTTP Hook in HTTPToGCS

* Fixed order in selected-providers-list-as-string for breeze test

* Fixed order in selected-providers-list-as-string for breeze test

* Removed ORM calls when managing connections in system tests

* Added fix for breeze unit test individual-providers-test-types-list-as-strings-in-json

* Typo fix for breeze unit test individual-providers-test-types-list-as-strings-in-json

---------

Co-authored-by: Josue Velazquez Gen <josuegen@Josues-MacBook-Air.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants