Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds an s3 list prefixes operator #17145

Merged
merged 8 commits into from
Oct 5, 2021

Conversation

jarfgit
Copy link
Contributor

@jarfgit jarfgit commented Jul 21, 2021

  • Adds an operator to return a list of prefixes from an S3 bucket
  • Updates list_prefixes() unit test to assert on a nested dir with a prefix variable
  • Removes duplicate calls to list_keys() that were in the test_list_prefixes() unit test (likely a copy/paste boo boo?)

There are two suggestion from this conversation that I have not included here:

  1. Combine or otherwise simplify s3_list_keys() and s3_list_prefixes() into one - this makes sense to me but I don't quite know how people tend to use these operators or if there is a valid argument for keeping them separate.

  2. Combining all the s3 operators into one file like gcs.py - this also makes sense to me, but it's not consistent with the other AWS operators. Might be worth opening a new issue to refactor them all if we want to go in this direction?

Issue Link: #8448


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Jul 21, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 21, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@jarfgit jarfgit force-pushed the add-s3-list-prefixes-operator branch from 03f158c to 26a6b4d Compare July 21, 2021 20:37
Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I'm seeing a lot of duplicate code here. As called out previously in the issue, I'd like to see this as at least one operator. The hooks are a different story, which I think is outside the scope of your contribution here.

To me it makes sense to keep the current operator for backwards compatibility and also since it's name is nicely generic (i.e. it's S3ListOperator, not S3ListKeysOperator). I propose we add a new optional boolean parameter to the existing S3ListOperator, perhaps something like include_subfolders or include_common_prefixes. This should be false by default to maintain the current behaviour, but when true it includes common prefixes in the result payload.

The only issue with this that I see is that if folks want only the common prefixes, the new parameter will have to be more complex than just a boolean (your conditions in that case would be keys, common_prefixes, all). Note: if we consider that usecase uncommon then another option the users could do is call the operator twice; once with just keys and a second time with keys and subfolders and then do a set difference to get just the subfolders.

@jarfgit
Copy link
Contributor Author

jarfgit commented Jul 21, 2021

@o-nikolas I like what you're thinking. Thanks for giving me more context. I'll see what I can do.

@potiuk
Copy link
Member

potiuk commented Jul 25, 2021

Hey @jarfgit - any news :) ?

@iostreamdoth
Copy link
Contributor

iostreamdoth commented Jul 26, 2021

I propose if we can rename parameter to recursive in place of (include_subfolders or include_common_prefixes) in s3listoperator to be in concordance with aws s3 ls command line tool options. Default recursive is true to retain the current behavior of operator.
thoughts @o-nikolas @jarfgit ?

@jarfgit
Copy link
Contributor Author

jarfgit commented Jul 26, 2021

@potiuk still working on it. I'm actually on the airflow team at astronomer and got pulled into something else. I'll circle back on this shortly. Sorry for the delay!

@o-nikolas
Copy link
Contributor

I propose if we can rename parameter to recursive in place of (include_subfolders or include_common_prefixes) in s3listoperator to be in concordance with aws s3 ls command line tool options. Default recursive is true to retain the current behavior of operator.
thoughts @o-nikolas @jarfgit ?

I think that's a perfectly fine name, but I don't like the reasoning of tying it to the aws s3 ls command. If anything I think we should be tying to choose naming that closely follows the boto api that this operator wraps (list_objects_v2) and that api uses the common S3 language of prefixes. But again, I think recursive is expressive enough and I'm happy to agree to that :)

@jarfgit
Copy link
Contributor Author

jarfgit commented Aug 5, 2021

@o-nikolas @iostreamdoth @potiuk Ok, at long last I've updated this pull request:

  • Refactored S3ListOperator to take a recursive parameter
  • Refactored the s3_list_keys hook to return both a list of keys and list of common prefixes (I know this is out of scope for the issue, but making unnecessary API calls bothers me and it was easy enough to do)
  • Experimented with deleting the s3_list_prefixes hook altogether, but it's used elsewhere and constituted some serious scope creep 😛
  • If recursive == True then we add the prefixes to the list of keys returned by the operator
  • Ensured recursive defaults to False and shouldn't present a breaking change
  • Added unit tests to both the s3_list_keys and s3_list_prefixes hooks to make sure I (and presumably future contributors) fully understood what the delimiter and prefix args were doing and how various combinations affected what was returned when recursive == True

The above assumes the following:

  • The user wants to retrieve both keys and prefixes using the same optional params (i.e. delimiter and prefix).
  • The user can distinguish keys from prefixes from the list returned by the operator. I would assume that keys have a file extension and prefixes would include the delimiter... but it seems possible that keys may not always have a file extension and I'm basing my understanding of the prefixes containing the delimiter on the unit tests.

So I have this question:

If we want to refactor the operator to return both prefixes and keys but a user might want to use different optional params between keys and prefixes, I don't see an alternative other than requiring the user to use the operator twice with different params. With this in mind, is there a valid argument to have a dedicated S3ListPrefixes operator after all?

@o-nikolas
Copy link
Contributor

  • The user can distinguish keys from prefixes from the list returned by the operator. I would assume that keys have a file extension and prefixes would include the delimiter... but it seems possible that keys may not always have a file extension and I'm basing my understanding of the prefixes containing the delimiter on the unit tests.

This doesn't sound overly confident 😆

So I have this question:

If we want to refactor the operator to return both prefixes and keys but a user might want to use different optional params between keys and prefixes, I don't see an alternative other than requiring the user to use the operator twice with different params. With this in mind, is there a valid argument to have a dedicated S3ListPrefixes operator after all?

It does sound like you're finding significant friction here. Two paths forward I see are:

  1. Change the type of the new parameter to something more advanced that will allow the user to specify exactly what they want out of the command (something like a string or an enum, that can be keys, prefixes, or all). This doesn't get around the issue of having two calls for different optional params though. But, at that point, the user really should have two separate calls if they are querying for a completely distinct set of results IMHO.
  2. Or, just bail on a single operator and go back to having two as you suggest.

What do others think?

@jarfgit jarfgit force-pushed the add-s3-list-prefixes-operator branch from e2f4e9a to 628b59d Compare September 1, 2021 21:45
@jarfgit
Copy link
Contributor Author

jarfgit commented Sep 1, 2021

@o-nikolas I went ahead and reverted back to having two distinct operators - if for no other reason than to keep this PR / convo alive. If I had a better understanding of s3 list keys / prefixes the use cases I would feel more confident exploring the first path you outlined above.

That being said, still open to workshopping this :)

Comment on lines +127 to +134
bucket.put_object(Key='dir/sub_dir/c', Body=b'c')

assert [] == hook.list_prefixes(s3_bucket, prefix='non-existent/')
assert [] == hook.list_prefixes(s3_bucket)
assert ['dir/'] == hook.list_prefixes(s3_bucket, delimiter='/')
assert ['a'] == hook.list_keys(s3_bucket, delimiter='/')
assert ['dir/b'] == hook.list_keys(s3_bucket, prefix='dir/')
assert [] == hook.list_prefixes(s3_bucket, prefix='dir/')
assert ['dir/sub_dir/'] == hook.list_prefixes(s3_bucket, delimiter='/', prefix='dir/')
assert [] == hook.list_prefixes(s3_bucket, prefix='dir/sub_dir/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes related to the operator addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calls to list_keys were in the test_list_prefixes unit test. I didn't immediately see why these calls were there and assumed it might have been a copy / past error. Am I missing something?

@uranusjr uranusjr closed this Sep 3, 2021
@uranusjr uranusjr reopened this Sep 3, 2021
@o-nikolas
Copy link
Contributor

@o-nikolas I went ahead and reverted back to having two distinct operators - if for no other reason than to keep this PR / convo alive. If I had a better understanding of s3 list keys / prefixes the use cases I would feel more confident exploring the first path you outlined above.

That being said, still open to workshopping this :)

Sorry I've been on vacation for the past three weeks:

Sounds good to me, I think that's perfectly valid :) I'll try have a review sometime this week.

Comment on lines 74 to 75
prefix: str = '',
delimiter: str = '',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to allow the user to not provide these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't :)

I'll admit this was a lazy copy / paste from S3List, so I updated the prefix and delimiter there as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@o-nikolas github is not letting me re-request a review from you for some reason, so pinging here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I missed this message. Code looks good, and I see that it's merged. Congrats :)

@jarfgit jarfgit requested a review from uranusjr October 1, 2021 22:18
@github-actions
Copy link

github-actions bot commented Oct 3, 2021

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Oct 3, 2021
jarfgit and others added 6 commits October 5, 2021 22:18
Also:
- Updates list_prefixes() unit tests to assert on a nested dir with a prefix variable
- Removes duplicate calls to list_keys() that were in the list_prefixes() unit test (likely a copy/paste boo boo?)
	- Returns common prefixes (i.e. subfolders) in addition to files in the npayload when set to True
	- Avoids breaking change by preserving file-only functionality by defaulting to False

Delete s3_list_prefixes operator and test

Refactor list keys s3  hook to return both files and common prefixes

Refactor calls to list_keys hook, unit tests

Refine param documentation in the S3ListOperator

Add more unit test cases to test_list_keys and test_list_prefixes

Clarify a comment
This reverts commit c995e9a and returns to the original notion of having a separate list_s3_prefixes operator.
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
…s and S3List operators, remove unnecessary sort() in test, fix typo
@kaxil kaxil force-pushed the add-s3-list-prefixes-operator branch from 68e0ab8 to 4506561 Compare October 5, 2021 16:50
…Prefixes and S3List operators, remove unnecessary sort() in test, fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants