Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempts to "clean up" s3 in case of Iceberg tables, which is redundant #285

Closed
brabster opened this issue May 5, 2023 · 8 comments · Fixed by #292
Closed

Attempts to "clean up" s3 in case of Iceberg tables, which is redundant #285

brabster opened this issue May 5, 2023 · 8 comments · Fixed by #292
Labels
bug Something isn't working documentation Improvements or additions to documentation iceberg This issue concerns Iceberg integration

Comments

@brabster
Copy link
Contributor

brabster commented May 5, 2023

More new behaviour since 1.3.5

For CTAS dbt-athena attempts to delete objects at location in the case of iceberg table. It did not do that before, so this is another breaking change thus requiring new S3 permissions (listing and deleting) if you're using iceberg (which I am, and I don't have the necessary permissions).

This activity seems to be redundant given AWS docs

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.

Problematic line appears to be here. I'll PR a fix to remove this call in case of iceberg.

@nicor88
Copy link
Contributor

nicor88 commented May 5, 2023

@brabster why would you want to remove this
Can't you get the right permissions in place?

@nicor88
Copy link
Contributor

nicor88 commented May 5, 2023

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.

the idea is that we DO not drop the table via drop, but via glue api plus s3 location. The issue is that drop sometimes do not clean the path correctly when you have many objects. We had already few iteration on this so far, and I won't recommend to change this behavior ;)

@brabster
Copy link
Contributor Author

brabster commented May 6, 2023

(I do appreciate some of the challenges with OSS maintenance, I've done a bit of that myself. I'll take a few minutes to try and explain my context and why I'm having problems, but I think it's really more a question for how dbt-athena decides what and how to implement, and how breaking changes are defined and handled. I'll try and find a bit more time to drop my broader thoughts into the governance for the maintainers)

Can't you get the right permissions in place?

I think this might be the crux of the problem - dbt-athena makes the assumption that any user will have access not just to Athena, but also fairly comprehensive access to Glue and S3. I'm sure that's the case for many or even most teams, but it is possible to use Athena in a much more restricted mode, using the CalledVia key to permit access to S3 only when called via Athena, as documented by AWS here.

The idea of restricting IAM permissions as far as possible is indicated by the presence of that documentation specifically in the context of Athena, and the general AWS "best practices" guidance for IAM issued here, specifically:

When you set permissions with IAM policies, grant only the permissions required to perform a task.

I am a consultant working with highly sensitive data for a client, and aside from is being a best practice, I was explicitly asked to minimise the permissions required for teams - with a specific callout for no direct access to S3. Up to dbt-athena 1.3.5, this was not a problem. Almost every release since, the set of required permissions has been extended (I'm sure you're aware of the battery of issues I've raised this week - this is as a result of me trying to get update our dependency on dbt-athena to latest and every time I fix one thing, I just surface the next broken thing.

To reiterate - everything works fine on 1.3.5. These are all breaking changes, if your consumer is operating in a least privilege environment.

Why don't I just update the permissions?

  • I have to go and argue a case with infosec and compliance folk to make this kind of change. What's my argument? It's clearly possible to operate without these permissions - as 1.3.5 shows. "The maintainers of dbt-athena will change the required permissions at any time, for any reason, without notice, and will refuse to consider it a breaking change when they break you"... not the best look
  • That dialogue takes time. It's also expensive, occupying a bunch of people's time. This kind of a change might even result in a need to do a bunch of security threat modelling, which is not a quick process.
  • I'm likely to hit a wall anyway. The S3 access is likely a dealbreaker (some teams are allowed this, others are not)
  • Finally - what permissions are safe to assume stable? dbt-athena does not document the permissions it needs, and if I figure out what's needed in line with AWS best practice and go with that it seems inevitable that by release 1.4.8 you will break me again.

The issue is that drop sometimes do not clean the path correctly when you have many objects. We had already

Have you raised this issue with AWS? If the advertised functionality doesn't work as expected, I'm sure AWS would appreciate getting that info and whatever evidence you have of the problem (as would everyone who uses Athena Iceberg and expects it to work properly like me!)

@nicor88
Copy link
Contributor

nicor88 commented May 8, 2023

Thanks for your detailed answer.

When we build and maintain new features for the adapter we don't have all the usage clear in mind, therefore giving such context in each issue helps us to understand what lead to it. Often we give our first answer based on common patterns/issues raised in the past - see this case: I assumed that you could simply add a missing IAM action.

Your setup is a rather special one, thanks for making us aware. I apply the "at least privilege rule" in all of my projects, but as often I have admin rights, I can add more policies, this is not the case for you - and for sure other users.

We are aware of changes in the IAM policies, and often we prefer to let the final user add new policies in order to fix a specific issue, generally, this approach had lead us to make most of the users happy, and I'm aware that we won't be able to to make everybody happy.

Have you raised this issue with AWS? If the advertised functionality doesn't work as expected, I'm sure AWS would appreciate getting that info and whatever evidence you have of the problem (as would everyone who uses Athena Iceberg and expects it to work properly like me!)

This implementation is a fruit of a suggestion made by AWS support. I guess that you are aware of the long fixing time of some bugs, hence the proposed fix.

Also, I would like to remind you that this project is a community-based effort, we do maintain in free time and compromise with work/private life. Said so, when we do a change we put "maintainability first", which means we avoid "backward compatibility changes", and convoluted code, as simply we don't have enough resources (time/people) to do so - happy to add you as code maintainer if you are interested.

In this specific case, I believe that we could consider an extra global/model specific flag native_iceberg_drop, if set to True the native athena drop statement will be executed and cleanup of s3 will happen again using athena iceberg capability - by default such flag will be set to False.

WDYT?

cc @Jrmyy

@nicor88 nicor88 added bug Something isn't working documentation Improvements or additions to documentation enhancement iceberg This issue concerns Iceberg integration labels May 8, 2023
@brabster
Copy link
Contributor Author

brabster commented May 8, 2023

Thanks @nicor88 for taking the time, that sounds like a perfectly good solution - I'll be happy to submit a PR to make that change as per your spec. I am also mid-writing a bit of a timeline and suggestions in the Governance ticket as there's a bit of a multi-issue story - will finish that up and then do that PR 🙇

This was referenced May 8, 2023
@brabster
Copy link
Contributor Author

brabster commented May 9, 2023

@nicor88 I have started work on a PR, but I'm wondering whether a better way of approaching the whole problem I've been trying to explain would be to have a single flag like "sql_only" that switches behaviour to avoid other API calls instead of a potentially several flags that toggle bits to be done in SQL rather than API calls.

I think that could be understood as a feature to support the restricted access case which should help avoid confusion. That flag should also be easier to understand and use as well as easier to implement than multiple small independent fixes (although support can be implemented in small PRs not one mega-PR) - I get the feeling the piecemeal approach will end up being more difficult to maintain?

@Jrmyy
Copy link
Contributor

Jrmyy commented May 9, 2023

Hello

I overall agree with the discussion and I feel indeed that we can both support an "enhanced" version of the adapter, leveraging S3, Glue and Athena in order to have the maximum capacities of the adapter.

I also think that having a sql_only or athena_only could be a better way to handle things.

This config could be in the profiles file.

I also warn that this might be a noticeable amount of work because we would have to check everything to be sure they are compliant with both approaches. My 2 cents is really to have a shared discussion of what we want to achieve and what complexity it will bring to support several ways to do things. I don't want to be opinionated or impose anything but on the other hand I don't want to over-complexify the adapter if we don't have an audience for that.

What do you think about having an open discussion about it on #db-athena in DBT Slack ? Maybe it will help having more insights.

@brabster
Copy link
Contributor Author

brabster commented May 9, 2023

@Jrmyy I didn't know there was a Slack channel! Could pop that in the README. Would love to have more of a dialogue about this, it's not a trivial thing I'll see you over on Slack 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation iceberg This issue concerns Iceberg integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants