Skip to content

Comments

Return files destination uris in GoogleDriverToGCSOperator and SheetsToGCSOperator#61347

Open
Ajay9704 wants to merge 6 commits intoapache:mainfrom
Ajay9704:gdrive_sheets_to_gcs
Open

Return files destination uris in GoogleDriverToGCSOperator and SheetsToGCSOperator#61347
Ajay9704 wants to merge 6 commits intoapache:mainfrom
Ajay9704:gdrive_sheets_to_gcs

Conversation

@Ajay9704
Copy link
Contributor

@Ajay9704 Ajay9704 commented Feb 2, 2026

Description

This PR implements consistent GCS destination URI return behavior for GoogleDriveToGCSOperator and GoogleSheetsToGCSOperator

Changes

GoogleDriveToGCSOperator (transfers/gdrive_to_gcs.py)

Added unwrap_single parameter to control return format (default: True)
Returns full GCS URIs in gs://bucket/object format instead of None
Added deprecation warning for future default behavior change
Maintains backward compatibility with existing XCom functionality

GoogleSheetsToGCSOperator (transfers/sheets_to_gcs.py)

Added unwrap_single parameter to control return format (default: True)
Returns full GCS URIs in gs://bucket/object format instead of object names only
Added deprecation warning for future default behavior change
Preserves existing XCom push behavior for destination objects

Implementation Details

Both operators now follow the consistent return convention:
When unwrap_single=True (default): Returns single string for one file, list for multiple files
When unwrap_single=False: Always returns list regardless of file count
Full GCS URI format: All return values include gs:// prefix
Backward compatibility: Existing code continues to work unchanged

Tests

Added comprehensive test coverage for both operators:
test_execute_with_unwrap_single_true - Single file return behavior
test_execute_with_unwrap_single_false - List return behavior
test_execute_with_unwrap_single_default - Default behavior verification
GoogleSheetsToGCSOperator: Additional tests for single vs multiple file scenarios

Related Issues

Part of issue #11323 - Return files destination URIs in all ToGCS operators

Migration Notes

Current behavior (backward compatible):
python
/// Returns single URI string when one file, list when multiple
op = GoogleDriveToGCSOperator(...)
result = op.execute(context) # e.g., "gs://bucket/file.txt" or ["gs://bucket/file1.txt", "gs://bucket/file2.txt"]
Future behavior (prepare now):
python
/// Explicitly set unwrap_single to avoid future breaking changes
op = GoogleDriveToGCSOperator(unwrap_single=False, ...)
result = op.execute(context) # Always returns list: ["gs://bucket/file.txt"]

Checklist

Added unwrap_single parameter with proper type hints
Implemented deprecation warning for future default change
Updated return values to full GCS URI format (gs://bucket/object)
Maintained backward compatibility with existing XCom behavior
Added comprehensive unit tests for all scenarios
Updated docstrings with new parameter documentation
Verified system test examples remain compatible

Related: #11323

Used AI for resolving conflicts

@Ajay9704 Ajay9704 requested a review from shahar1 as a code owner February 2, 2026 11:33
@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Feb 2, 2026
@Ajay9704 Ajay9704 marked this pull request as draft February 2, 2026 11:57
@Ajay9704 Ajay9704 force-pushed the gdrive_sheets_to_gcs branch from 21413d9 to 6a42f67 Compare February 2, 2026 16:13
@Ajay9704 Ajay9704 force-pushed the gdrive_sheets_to_gcs branch from 667cc9c to 388d3a0 Compare February 2, 2026 18:10
@Ajay9704 Ajay9704 marked this pull request as ready for review February 3, 2026 08:30
@Ajay9704
Copy link
Contributor Author

Ajay9704 commented Feb 3, 2026

@shahar1 please review when you are free and If there is anything left from my end please let me know

@shahar1 shahar1 changed the title fix to gcs Return files destination uris in GoogleDriverToGCSOperator and SheetsToGCSOperator Feb 6, 2026
Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!
Please handle my comments.
Also, please note that in order to merge it we'll need to run the system tests (same goes for other related PRs you've made). If you're able to do it on your own and attach screenshots it will be the best, otherwise we'll need to wait for Google team or someone else to do it.

Comment on lines 146 to 153
gcs_uri = f"gs://{self.destination_bucket}/{gcs_path_to_file}"
destination_array.append(gcs_uri)

context["ti"].xcom_push(key="destination_objects", value=destination_array)

if self.unwrap_single:
return destination_array[0] if len(destination_array) == 1 else destination_array
return destination_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle this operator differently - it already returns a list, but the entities are not in the gs URI format. Therefore:

  1. There's no need for the unwarp_single flag - the idea is that after aligning the operators to return list[str] by default, we could then deprecate the unwrap_single where used and then users will access the single results using lst[0] only.
  2. Instead, we should introduce another flag called return_gcs_uris (see comment here: Return list of GCS URIs from Azure*ToGCS operators #61048 (comment)) - so we won't break backward-compatibility with current behavior. Default should be as it is today (False), and the warning should be that it will later be changed to True - you could use the wording from Return list of GCS URIs from Azure*ToGCS operators #61048.

Comment on lines 118 to 123
gcs_uri = f"gs://{self.bucket_name}/{self.object_name}"
result = [gcs_uri]

if self.unwrap_single:
return result[0]
return result
Copy link
Contributor

@shahar1 shahar1 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment for the other operator - we could avoid adding unwrap_single parameter at all, and just return [gcs_uri].

@Ajay9704
Copy link
Contributor Author

Ajay9704 commented Feb 6, 2026

Good job! Please handle my comments. Also, please note that in order to merge it we'll need to run the system tests (same goes for other related PRs you've made). If you're able to do it on your own and attach screenshots it will be the best, otherwise we'll need to wait for Google team or someone else to do it.

@shahar1 Thanks for the review! I'm already working on the changes you mentioned.
I'm also trying to run the system tests locally , earlier Breeze was giving me some connectivity issues, but I'm setting it up again and will update you very soon with the results.
If everything runs fine on my side, I'll attach the screenshots too.

@Ajay9704
Copy link
Contributor Author

Ajay9704 commented Feb 6, 2026

@shahar1
I tried multiple approaches to run the system tests locally, but ran into environment issues each time. The system test DAGs seem to rely on infrastructure that only exists in the CI environment (like the tests_common module, pytest setup, and GCP credentials), so I couldn’t get them working when triggering the DAGs directly from the Airflow UI.
I attempted :
Importing the system test DAGs directly in the Airflow UI → failed due to missing tests_common.
Setting up Breeze → faced connectivity and configuration issues.
Trying Docker Compose → same missing-module problems.
Setting up the full GCP environment → required extensive configuration and credentials, and I kept running into blockers.
I’ve been trying until now, but unfortunately I wasn’t able to run the system tests successfully.
If possible, could you please guide me through the correct setup or help with running them?
I’ll be happy to follow any steps you suggest.

Screencast.From.2026-02-06.17-05-02.mp4

@shahar1
Copy link
Contributor

shahar1 commented Feb 6, 2026

@shahar1 I tried multiple approaches to run the system tests locally, but ran into environment issues each time. The system test DAGs seem to rely on infrastructure that only exists in the CI environment (like the tests_common module, pytest setup, and GCP credentials), so I couldn’t get them working when triggering the DAGs directly from the Airflow UI. I attempted : Importing the system test DAGs directly in the Airflow UI → failed due to missing tests_common. Setting up Breeze → faced connectivity and configuration issues. Trying Docker Compose → same missing-module problems. Setting up the full GCP environment → required extensive configuration and credentials, and I kept running into blockers. I’ve been trying until now, but unfortunately I wasn’t able to run the system tests successfully. If possible, could you please guide me through the correct setup or help with running them? I’ll be happy to follow any steps you suggest.

Screencast.From.2026-02-06.17-05-02.mp4

Yes - try to do it using breeze and Airflow's UI, it should be the easiest - please follow these instructions***:

  1. Make sure that you have gcloud installed locally on your machine, and you're authenticated to a GCP project
  2. Create the file files/airflow-breeze-config/init.sh or use the existing one, and add the following:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://'
export GOOGLE_CLOUD_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_GCP_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_ENV_ID='default'
  1. Copy and paste the system test file to files/dags.
  2. Run the test.
  3. When you're done, copy the system test back to the original path - make sure not to commit any unwanted changes (like sensitive data).

*** - There are probably better ways to achieve it, but it works for me (and hopefully we could simpify it even further).
If there are issues reading the SYSTEM_TESTS_ variables, just hardcode them in the copied Dag (make sure not to commit them though).

@Ajay9704
Copy link
Contributor Author

Ajay9704 commented Feb 8, 2026

@shahar1 I tried multiple approaches to run the system tests locally, but ran into environment issues each time. The system test DAGs seem to rely on infrastructure that only exists in the CI environment (like the tests_common module, pytest setup, and GCP credentials), so I couldn’t get them working when triggering the DAGs directly from the Airflow UI. I attempted : Importing the system test DAGs directly in the Airflow UI → failed due to missing tests_common. Setting up Breeze → faced connectivity and configuration issues. Trying Docker Compose → same missing-module problems. Setting up the full GCP environment → required extensive configuration and credentials, and I kept running into blockers. I’ve been trying until now, but unfortunately I wasn’t able to run the system tests successfully. If possible, could you please guide me through the correct setup or help with running them? I’ll be happy to follow any steps you suggest.
Screencast.From.2026-02-06.17-05-02.mp4

Yes - try to do it using breeze and Airflow's UI, it should be the easiest - please follow these instructions***:

1. Make sure that you have `gcloud` installed locally on your machine, and you're authenticated to a GCP project

2. Create the file `files/airflow-breeze-config/init.sh` or use the existing one, and add the following:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://'
export GOOGLE_CLOUD_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_GCP_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_ENV_ID='default'
3. Copy and paste the system test file to `files/dags`.

4. Run the test.

5. When you're done, copy the system test back to the original path - make sure not to commit any unwanted changes (like sensitive data).

*** - There are probably better ways to achieve it, but it works for me (and hopefully we could simpify it even further). If there are issues reading the SYSTEM_TESTS_ variables, just hardcode them in the copied Dag (make sure not to commit them though).

@shahar1 Thanks for reviewing the changes and for the clear guidance.

I attempted to run the system tests using Breeze as instructed, but I am consistently running into an issue where the Airflow services do not start. PostgreSQL initializes successfully, but the webserver, scheduler, triggerer, and API server never come up. I reinstalled Breeze, restarted multiple times, and re-checked my environment, but the problem persists and appears to be related to my local Docker/Breeze setup.

Because of this, I haven't been able to run the system tests locally. Could you advise on how to resolve this Breeze startup issue, or let me know if running system tests in CI would be acceptable for this PR? If unit tests demonstrating the behavior are sufficient, I can proceed accordingly as well.

Any guidance you can provide would be very helpful.

@shahar1
Copy link
Contributor

shahar1 commented Feb 8, 2026

@shahar1 I tried multiple approaches to run the system tests locally, but ran into environment issues each time. The system test DAGs seem to rely on infrastructure that only exists in the CI environment (like the tests_common module, pytest setup, and GCP credentials), so I couldn’t get them working when triggering the DAGs directly from the Airflow UI. I attempted : Importing the system test DAGs directly in the Airflow UI → failed due to missing tests_common. Setting up Breeze → faced connectivity and configuration issues. Trying Docker Compose → same missing-module problems. Setting up the full GCP environment → required extensive configuration and credentials, and I kept running into blockers. I’ve been trying until now, but unfortunately I wasn’t able to run the system tests successfully. If possible, could you please guide me through the correct setup or help with running them? I’ll be happy to follow any steps you suggest.
Screencast.From.2026-02-06.17-05-02.mp4

Yes - try to do it using breeze and Airflow's UI, it should be the easiest - please follow these instructions***:

1. Make sure that you have `gcloud` installed locally on your machine, and you're authenticated to a GCP project

2. Create the file `files/airflow-breeze-config/init.sh` or use the existing one, and add the following:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://'
export GOOGLE_CLOUD_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_GCP_PROJECT='your-gcp-project' # Fill in with GCP project ID
export SYSTEM_TESTS_ENV_ID='default'
3. Copy and paste the system test file to `files/dags`.

4. Run the test.

5. When you're done, copy the system test back to the original path - make sure not to commit any unwanted changes (like sensitive data).

*** - There are probably better ways to achieve it, but it works for me (and hopefully we could simpify it even further). If there are issues reading the SYSTEM_TESTS_ variables, just hardcode them in the copied Dag (make sure not to commit them though).

@shahar1 Thanks for reviewing the changes and for the clear guidance.

I attempted to run the system tests using Breeze as instructed, but I am consistently running into an issue where the Airflow services do not start. PostgreSQL initializes successfully, but the webserver, scheduler, triggerer, and API server never come up. I reinstalled Breeze, restarted multiple times, and re-checked my environment, but the problem persists and appears to be related to my local Docker/Breeze setup.

Because of this, I haven't been able to run the system tests locally. Could you advise on how to resolve this Breeze startup issue, or let me know if running system tests in CI would be acceptable for this PR? If unit tests demonstrating the behavior are sufficient, I can proceed accordingly as well.

Any guidance you can provide would be very helpful.

What error message/logs do you get? Do you run it using breeze start-airflow --backend postgres? (Maybe you could try the default sqlite backend as well)
Please try sending a message about it in the Airflow's Slack channel (#airflow-breeze), it will be easier for me to try helping you debugging it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants