Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.all should not be able to delete all Glue databases and tables in Environment account #531

Closed
jeffshep opened this issue Jun 22, 2023 · 5 comments
Labels

Comments

@jeffshep
Copy link

jeffshep commented Jun 22, 2023

Describe the bug

A fundamental issue with data.all

Deleting a dataset, that was imported from existing resources, results in the Glue database being deleted in the AWS environment. The Glue resources are not owned by data.all, they are owned by a data pipeline in the account already, and should exist independently of whether the data is shared with data.all or not.

How to Reproduce

In an AWS account, create a Glue database and table, pointing to data in S3.
In data.all, Import a new dataset with these resources.
Wait for data.all to create the CloudFormation stack to import the dataset successfully.
Delete the dataset in data.all.
Observe the dataall-gluedb-lf-handler-UUID resource delete the Glue database and table

Expected behavior

In an AWS account, create a Glue database and table, pointing to data in S3.
In data.all, Import a new dataset with these resources.
Wait for data.all to create the CloudFormation stack to import the dataset successfully.
Delete the dataset in data.all.
The database and table should remain in the AWS account, data.all permissions should be removed.

I would expect data.all to respect existing resources and delineate between an imported dataset and a created dataset, so that when the dataset is deleted, it doesn't overstep its boundaries and delete the resources that existed before data.all was involved.

As for a technical solution, given that Glue databases created by data.all have a prefix with the environment name dataall_DBNAME_UUID, can logic be added to the gluedatabasecustomresource (in backend/dataall/cdkproxy/assets/) to check if the name has this format, and skip deleting the database if it does not?

Your project

No response

Screenshots

No response

OS

n/a

Python version

n/a

AWS data.all version

1.5.4

Additional context

No response

@dlpzx dlpzx added type: enhancement Feature enhacement status: in-progress This issue has been picked and is being implemented labels Jun 23, 2023
@dlpzx
Copy link
Contributor

dlpzx commented Jun 23, 2023

Hi @jeffshep, thanks for raising the issue. Yes we are aware of this and were discussing internally how to tackle it best. We have 2 alternatives:

  • Do not delete databases in any case - do not delete imported databases or data.all created databases. The reason being that Glue databases should be considered differently and only be deleted manually (as it happens with s3 buckets for example)
  • Do not delete imported databases. We can clean-up data.all created ones but the imported ones should always be kept and as you said above, data.all should respect existing resources

I would like to know your opinion on it so that we can start working on it and include it in V1.6 release (mid July)

@jeffshep
Copy link
Author

jeffshep commented Jun 23, 2023

@dlpzx Thanks for the quick response, I would strongly advocate option 2. If I was using data.all to create resources I would expect it to delete them when I delete a dataset, my expectation would be that any data retention requirements would be my concern not data.all's.

@dlpzx
Copy link
Contributor

dlpzx commented Jun 23, 2023

Thank you @jeffshep, we will try to include is as optional for v1.6 and at that state it will keep imported databases only.

@dlpzx
Copy link
Contributor

dlpzx commented Jul 11, 2023

It was implemented in #524, it is now under validation

This was referenced Jul 11, 2023
dlpzx added a commit that referenced this issue Jul 12, 2023
### Feature or Bugfix
- Bugfix

### Detail
- When updating the dataset database handler custom resource to use a
different PhysicalId to distinguish imported datasets CloudFormation
forces a replacement of the resource which results in the deletion of
the glue database. With this PR we add a condition to only delete
databases when the custom resource physical id is prefixed by
"CREATED-".
- Force that Glue crawlers ccreated by data.all use the dataset IAM role
as execution role. Previously a boto3 call updated this to use the
pivotRole (contrary to what it was defined in CloudFormation). With this
PR we make sure that the correct role is used in any existing crawler.

I tested in a real AWS deployment that had previously created
environments and datasets, with a crawler with the pivotRole assigned:

- [X] Crawler execution role gets updated to dataset role
- [X] Imported glue databases do not get deleted when the custom
resource updates the physicalID

### Relates
- #531 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx added a commit that referenced this issue Jul 19, 2023
### Feature or Bugfix
Release PR with the following list of features. Refer to each PR for the
details

### Detail
- #498 
- #482 
- #543
- #524 (which also solves #531)
- #532 
- #535 
- #497 
- #515
- #529 
- #562 
- #455 
- #572 
- #567 
- #573 
- #579 
- #578 
- #582 

### Breaking changes - release notes
- ⚠️ IMPORTANT: upgrade to a version >V1.5.0 before upgrading to V1.6 to
avoid deletion of resources in custom resource deletion
- ⚠️ IMPORTANT: requires an update of environments and then datasets
after upgrading. Either using cdk.json parameter
`enable_update_dataall_stacks_in_cicd_pipeline`, waiting for overnight
update stack task, or manually updating first environments and then
datasets
- CloudFront distribution replace for #529 
- Additional EC2 permissions in CDK Synth CodeBuild stage for #543 -->
this can be avoided by upgrading to v1.5.6 before upgrading to v1.6.0
- local development affected by more restrictive pivotRole trust policy


### Relates
V1.6.0 release

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com>
Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com>
Co-authored-by: nikpodsh <124577300+nikpodsh@users.noreply.github.com>
Co-authored-by: chamcca <40579012+chamcca@users.noreply.github.com>
Co-authored-by: Nikita Podshivalov <nikpodsh@amazon.com>
Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com>
@noah-paige
Copy link
Contributor

Closing issue - implemented in v1.6

@noah-paige noah-paige removed the status: in-progress This issue has been picked and is being implemented label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants