In S3ToGcsOperator return the list of copied files even in deferrable mode#49768
In S3ToGcsOperator return the list of copied files even in deferrable mode#49768AlexisBRENON wants to merge 2 commits intoapache:mainfrom
Conversation
kiran2706
left a comment
There was a problem hiding this comment.
fix: make execute_complete files parameter optional to maintain compatibility
The execute_complete method signature was changed to require a 'files' argument,
which caused TypeError in deferrable mode and broke provider compatibility tests.
This commit updates execute_complete to accept 'files' as an optional parameter
(defaulting to None) to ensure backward compatibility and fix the test failures.
Also updates relevant tests to handle the optional parameter properly.
c9f8fa5 to
30c74c7
Compare
|
I think that it's fine to just use the hook = S3Hook(aws_conn_id='aws_default')
s3_objects = hook.list_keys(
bucket_name='s3-bucket-name',
prefix='some-prefix/',
)I'm suggesting that because the behaviour won't be consistent if the triggerer encountered an interruption that caused a restart i.e files will be copied but the return statement ends up being |
|
You're suggested solution does not cover my use case where files are always copied to the same prefix. |
I don’t think checking the creation time is necessary. The list of objects to be copied is determined by
It's not restarted during deferral, but it’s designed to be stateless and resilient to restarts. To preserve that statelessness with your proposed solution, you'd need to serialize the list of objects—which might not be ideal, as it could consume significant space in the metadata database. |
I'm sorry, I don't get your point. So on day 1, there is On day 2, client pushed If I use any of the
I understand that storing a long list of copied files may not be ideal (however, in the non-deferred setup, the list of copied files is stored anyway as the XCom value). |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
ping @dejii |
There was a problem hiding this comment.
Sorry for the late response, this PR went under the radar (I'm now going all over open GCP related PRs).
Overall LGTM, but could you please add a unit test? Also, it would be great if you could test it on a real GCP instance - please let me know if you can't do it for any reason.
| poll_interval=self.poll_interval, | ||
| ), | ||
| method_name="execute_complete", | ||
| kwargs=dict(files=files), |
There was a problem hiding this comment.
@AlexisBRENON could you please clarify what behavior you want to achieve by this line?
If you want to put files to trigger or return it from execute_complete then it is better to add it as an attribute to CloudStorageTransferServiceCreateJobsTrigger.
| @overload | ||
| def execute_complete(self, context: Context, event: dict[str, Any], files: None) -> None: ... | ||
| @overload | ||
| def execute_complete( | ||
| self, context: Context, event: dict[str, Any], files: Iterable[str] | ||
| ) -> list[str]: ... | ||
| def execute_complete( | ||
| self, context: Context, event: dict[str, Any], files: Iterable[str] | None = None | ||
| ) -> list[str] | None: |
There was a problem hiding this comment.
@AlexisBRENON as I understand you want to overload the execute_complete for it being possible to use the files variable from self.defer please correct me if I am wrong?
I think it is better to put files as attribute to CloudStorageTransferServiceCreateJobsTrigger and, then, return it with success event here. In that case the operator returns file names only when the transfer's operation is successful. What do you think about this idea?
There was a problem hiding this comment.
You get it right. I tried to make the files list available in the execute_complete to be able to return it and so make it available to subsequent tasks.
Passing it through the Trigger and then the event dict may be a solution, that would avoid to change the signature of the execute_complete. But, can you explain why you think it's a better solution (which is not typed) instead of using regular kwargs?
There was a problem hiding this comment.
@AlexisBRENON yes, sure
self.defer() is a method from the BaseOperator and the main purpose of this method is to say to the Operator that the execution will continue on the triggerer process using trigger code. And when execution on triggerer is finished then the code backs to the method specified in the method_name attribute and continues execution on the worker process. All attributes in the defer method are needed for start and configure deferable execution. All business related attributes are better to pass to the trigger. Also, as I see it is not a common practice in Airflow using kwargs in defer for business related attributes. I found only one operator which did it.
Because it is not a common practice and attributes in defer are not intended for business logic I think the solution to pass this attribute to trigger is better.
P.S. We can look for one more solution for this issue. We can make the s3_objects variable as a private attribute for the S3ToGCSOperator class and return it in the execute_complete method. But it should be checked that this attribute will not be None after returning to execution on worker.
This operator exhibited a different behavior between deferrable and non-deferrable mode.
If the first one, it returned the list of the copied files, while in the later it just returned
Nonemaking it barely usable for subsequent tasks.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.