Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update S3PrefixSensor to support checking multiple prefixes within a bucket #18807

Merged
merged 8 commits into from
Oct 11, 2021

Conversation

anaynayak
Copy link
Contributor

@anaynayak anaynayak commented Oct 7, 2021

Used the alternate approach as mentioned in #15001 (comment)

Summary of changes:

  1. Support prefix as a str or list. Haven't changed the name to avoid changes to existing consumers.
  2. Removed self.full_url since we can't build one for multiple prefixes . Not sure if this needs to be mentioned in [UPDATING.md]
  3. Updated existing test to pytest
  4. Added tests for multiple prefixes
  5. Added context type as Dict[str, Any]. There are other variations across the codebase.

Other candidates (haven't picked yet, can do in same PR):

  1. Use @cached_property for the hook
  2. Support a callable parameter which lets the user get finer control on whether all prefixes or a subset of them should be considered as required.

closes: #15001

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Oct 7, 2021
@anaynayak
Copy link
Contributor Author

@uranusjr thank you for reviewing the PR.

Few points from the summary which I wanted to clarify:

  1. Removal of self.full_url from the class hopefully doesn't require a change on UPDATING.md
  2. Usage of @cached_property for the hook. Noticed this in other places. Can add it here for consistency.
  3. Haven't done a similar change across other S3 sensors.

@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Oct 11, 2021
@potiuk potiuk merged commit 176165d into apache:main Oct 11, 2021
return self.get_hook().check_for_prefix(
prefix=self.prefix, delimiter=self.delimiter, bucket_name=self.bucket_name
)
return all(self._check_for_prefix(prefix) for prefix in self.prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised this would cause some _check_for_prefix to not be called if any of the prefixes fail the check, because all returns eagerly. Not sure if that would be an issue in practice.

Also we should've changed the attribute from prefix to prefixes 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can raise another PR to change the name to prefixes . Had the same thought going on in my mind 😄 . Had called out the same as point 1 on the description.I'm curious how we handle such backward incompatible changes. Do we update UPDATING.md ?

To support finer control over all or any, was also suggesting passing a callable which lets the user decide on a per key basis. Default implementation could continue to be all based. Can raise another PR to do both these changes if it makes sense.

The change will lead to an extra __init__ parameter:

    def __init__(
        self,
        ...,
        callback: Callable[[Dict[str, bool]], bool] = lambda prefix_available: all(prefix_available.values()),
        **kwargs,
    ):

    def poke(self, context: Dict[str, Any]):
        self.log.info('Poking for prefix : %s in bucket s3://%s', self.prefix, self.bucket_name)
        # callback can choose to return true even if any/subset/all of the keys are present
        return self.callback({prefix: self._check_for_prefix(prefix) for prefix in self.prefix})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how we handle such backward incompatible changes.

Do you mean changing prefix to prefixes? As the change stands right now, it's already backwards incoimpatible, so changing the name is not really a concern. It's extremely unlikely (and not idiomatic) to access attributes in a hook, so I shouldn't really matter much. We could even change the attribute to _prefixes to discourage access even further.

As for the all issue, you can actually fix it quite easily with all([...]) (explicitly cast the result of all calls to a list).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that with the removal of full_url the backward compatibility has gone away. For the name change, the worry was more for people passing keyword arguments. Wasn't worried about the access to attributes. I can create a small PR to rename to _prefixes if that isn't an issue. Let me know if that makes sense.

I now understand what you were referring to with the eager return of all . Personally I think it helps avoid a few extra calls to s3 and so should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3MultipleKeysSensor operator
3 participants