-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update S3PrefixSensor to support checking multiple prefixes within a bucket #18807
Conversation
@uranusjr thank you for reviewing the PR. Few points from the summary which I wanted to clarify:
|
a0fd5ed
to
691b492
Compare
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
return self.get_hook().check_for_prefix( | ||
prefix=self.prefix, delimiter=self.delimiter, bucket_name=self.bucket_name | ||
) | ||
return all(self._check_for_prefix(prefix) for prefix in self.prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised this would cause some _check_for_prefix
to not be called if any of the prefixes fail the check, because all
returns eagerly. Not sure if that would be an issue in practice.
Also we should've changed the attribute from prefix
to prefixes
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can raise another PR to change the name to prefixes
. Had the same thought going on in my mind 😄 . Had called out the same as point 1 on the description.I'm curious how we handle such backward incompatible changes. Do we update UPDATING.md ?
To support finer control over all
or any
, was also suggesting passing a callable which lets the user decide on a per key basis. Default implementation could continue to be all
based. Can raise another PR to do both these changes if it makes sense.
The change will lead to an extra __init__
parameter:
def __init__(
self,
...,
callback: Callable[[Dict[str, bool]], bool] = lambda prefix_available: all(prefix_available.values()),
**kwargs,
):
def poke(self, context: Dict[str, Any]):
self.log.info('Poking for prefix : %s in bucket s3://%s', self.prefix, self.bucket_name)
# callback can choose to return true even if any/subset/all of the keys are present
return self.callback({prefix: self._check_for_prefix(prefix) for prefix in self.prefix})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how we handle such backward incompatible changes.
Do you mean changing prefix
to prefixes
? As the change stands right now, it's already backwards incoimpatible, so changing the name is not really a concern. It's extremely unlikely (and not idiomatic) to access attributes in a hook, so I shouldn't really matter much. We could even change the attribute to _prefixes
to discourage access even further.
As for the all
issue, you can actually fix it quite easily with all([...])
(explicitly cast the result of all calls to a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that with the removal of full_url
the backward compatibility has gone away. For the name change, the worry was more for people passing keyword arguments. Wasn't worried about the access to attributes. I can create a small PR to rename to _prefixes if that isn't an issue. Let me know if that makes sense.
I now understand what you were referring to with the eager return of all
. Personally I think it helps avoid a few extra calls to s3 and so should be fine.
Used the alternate approach as mentioned in #15001 (comment)
Summary of changes:
self.full_url
since we can't build one for multiple prefixes . Not sure if this needs to be mentioned in [UPDATING.md]Other candidates (haven't picked yet, can do in same PR):
@cached_property
for the hookcloses: #15001
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.