-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds an s3 list prefixes operator #17145
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
03f158c
to
26a6b4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I'm seeing a lot of duplicate code here. As called out previously in the issue, I'd like to see this as at least one operator. The hooks are a different story, which I think is outside the scope of your contribution here.
To me it makes sense to keep the current operator for backwards compatibility and also since it's name is nicely generic (i.e. it's S3ListOperator
, not S3ListKeysOperator
). I propose we add a new optional boolean parameter to the existing S3ListOperator
, perhaps something like include_subfolders
or include_common_prefixes
. This should be false by default to maintain the current behaviour, but when true it includes common prefixes in the result payload.
The only issue with this that I see is that if folks want only the common prefixes, the new parameter will have to be more complex than just a boolean (your conditions in that case would be keys
, common_prefixes
, all
). Note: if we consider that usecase uncommon then another option the users could do is call the operator twice; once with just keys and a second time with keys and subfolders and then do a set difference to get just the subfolders.
@o-nikolas I like what you're thinking. Thanks for giving me more context. I'll see what I can do. |
Hey @jarfgit - any news :) ? |
I propose if we can rename parameter to |
@potiuk still working on it. I'm actually on the airflow team at astronomer and got pulled into something else. I'll circle back on this shortly. Sorry for the delay! |
I think that's a perfectly fine name, but I don't like the reasoning of tying it to the |
26a6b4d
to
4afc084
Compare
@o-nikolas @iostreamdoth @potiuk Ok, at long last I've updated this pull request:
The above assumes the following:
So I have this question: If we want to refactor the operator to return both prefixes and keys but a user might want to use different optional params between keys and prefixes, I don't see an alternative other than requiring the user to use the operator twice with different params. With this in mind, is there a valid argument to have a dedicated |
This doesn't sound overly confident 😆
It does sound like you're finding significant friction here. Two paths forward I see are:
What do others think? |
e2f4e9a
to
628b59d
Compare
@o-nikolas I went ahead and reverted back to having two distinct operators - if for no other reason than to keep this PR / convo alive. If I had a better understanding of s3 list keys / prefixes the use cases I would feel more confident exploring the first path you outlined above. That being said, still open to workshopping this :) |
bucket.put_object(Key='dir/sub_dir/c', Body=b'c') | ||
|
||
assert [] == hook.list_prefixes(s3_bucket, prefix='non-existent/') | ||
assert [] == hook.list_prefixes(s3_bucket) | ||
assert ['dir/'] == hook.list_prefixes(s3_bucket, delimiter='/') | ||
assert ['a'] == hook.list_keys(s3_bucket, delimiter='/') | ||
assert ['dir/b'] == hook.list_keys(s3_bucket, prefix='dir/') | ||
assert [] == hook.list_prefixes(s3_bucket, prefix='dir/') | ||
assert ['dir/sub_dir/'] == hook.list_prefixes(s3_bucket, delimiter='/', prefix='dir/') | ||
assert [] == hook.list_prefixes(s3_bucket, prefix='dir/sub_dir/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes related to the operator addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calls to list_keys
were in the test_list_prefixes
unit test. I didn't immediately see why these calls were there and assumed it might have been a copy / past error. Am I missing something?
Sorry I've been on vacation for the past three weeks: Sounds good to me, I think that's perfectly valid :) I'll try have a review sometime this week. |
prefix: str = '', | ||
delimiter: str = '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make sense to allow the user to not provide these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't :)
I'll admit this was a lazy copy / paste from S3List, so I updated the prefix
and delimiter
there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@o-nikolas github is not letting me re-request a review from you for some reason, so pinging here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry I missed this message. Code looks good, and I see that it's merged. Congrats :)
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
Also: - Updates list_prefixes() unit tests to assert on a nested dir with a prefix variable - Removes duplicate calls to list_keys() that were in the list_prefixes() unit test (likely a copy/paste boo boo?)
- Returns common prefixes (i.e. subfolders) in addition to files in the npayload when set to True - Avoids breaking change by preserving file-only functionality by defaulting to False Delete s3_list_prefixes operator and test Refactor list keys s3 hook to return both files and common prefixes Refactor calls to list_keys hook, unit tests Refine param documentation in the S3ListOperator Add more unit test cases to test_list_keys and test_list_prefixes Clarify a comment
This reverts commit c995e9a and returns to the original notion of having a separate list_s3_prefixes operator.
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
…s and S3List operators, remove unnecessary sort() in test, fix typo
68e0ab8
to
4506561
Compare
…Prefixes and S3List operators, remove unnecessary sort() in test, fix typo
list_prefixes()
unit test to assert on a nested dir with a prefix variablelist_keys()
that were in thetest_list_prefixes()
unit test (likely a copy/paste boo boo?)There are two suggestion from this conversation that I have not included here:
Combine or otherwise simplify
s3_list_keys()
ands3_list_prefixes()
into one - this makes sense to me but I don't quite know how people tend to use these operators or if there is a valid argument for keeping them separate.Combining all the s3 operators into one file like gcs.py - this also makes sense to me, but it's not consistent with the other AWS operators. Might be worth opening a new issue to refactor them all if we want to go in this direction?
Issue Link: #8448
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.