data.all should not be able to delete all Glue databases and tables in Environment account #531

jeffshep · 2023-06-22T16:09:12Z

Describe the bug

A fundamental issue with data.all

Deleting a dataset, that was imported from existing resources, results in the Glue database being deleted in the AWS environment. The Glue resources are not owned by data.all, they are owned by a data pipeline in the account already, and should exist independently of whether the data is shared with data.all or not.

How to Reproduce

In an AWS account, create a Glue database and table, pointing to data in S3.
In data.all, Import a new dataset with these resources.
Wait for data.all to create the CloudFormation stack to import the dataset successfully.
Delete the dataset in data.all.
Observe the dataall-gluedb-lf-handler-UUID resource delete the Glue database and table

Expected behavior

In an AWS account, create a Glue database and table, pointing to data in S3.
In data.all, Import a new dataset with these resources.
Wait for data.all to create the CloudFormation stack to import the dataset successfully.
Delete the dataset in data.all.
The database and table should remain in the AWS account, data.all permissions should be removed.

I would expect data.all to respect existing resources and delineate between an imported dataset and a created dataset, so that when the dataset is deleted, it doesn't overstep its boundaries and delete the resources that existed before data.all was involved.

As for a technical solution, given that Glue databases created by data.all have a prefix with the environment name dataall_DBNAME_UUID, can logic be added to the gluedatabasecustomresource (in backend/dataall/cdkproxy/assets/) to check if the name has this format, and skip deleting the database if it does not?

Your project

No response

Screenshots

No response

OS

n/a

Python version

n/a

AWS data.all version

1.5.4

Additional context

No response

The text was updated successfully, but these errors were encountered:

dlpzx · 2023-06-23T07:50:30Z

Hi @jeffshep, thanks for raising the issue. Yes we are aware of this and were discussing internally how to tackle it best. We have 2 alternatives:

Do not delete databases in any case - do not delete imported databases or data.all created databases. The reason being that Glue databases should be considered differently and only be deleted manually (as it happens with s3 buckets for example)
Do not delete imported databases. We can clean-up data.all created ones but the imported ones should always be kept and as you said above, data.all should respect existing resources

I would like to know your opinion on it so that we can start working on it and include it in V1.6 release (mid July)

jeffshep · 2023-06-23T10:39:24Z

@dlpzx Thanks for the quick response, I would strongly advocate option 2. If I was using data.all to create resources I would expect it to delete them when I delete a dataset, my expectation would be that any data retention requirements would be my concern not data.all's.

dlpzx · 2023-06-23T14:00:46Z

Thank you @jeffshep, we will try to include is as optional for v1.6 and at that state it will keep imported databases only.

dlpzx · 2023-07-11T16:44:41Z

It was implemented in #524, it is now under validation

### Feature or Bugfix - Bugfix ### Detail - When updating the dataset database handler custom resource to use a different PhysicalId to distinguish imported datasets CloudFormation forces a replacement of the resource which results in the deletion of the glue database. With this PR we add a condition to only delete databases when the custom resource physical id is prefixed by "CREATED-". - Force that Glue crawlers ccreated by data.all use the dataset IAM role as execution role. Previously a boto3 call updated this to use the pivotRole (contrary to what it was defined in CloudFormation). With this PR we make sure that the correct role is used in any existing crawler. I tested in a real AWS deployment that had previously created environments and datasets, with a crawler with the pivotRole assigned: - [X] Crawler execution role gets updated to dataset role - [X] Imported glue databases do not get deleted when the custom resource updates the physicalID ### Relates - #531 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix Release PR with the following list of features. Refer to each PR for the details ### Detail - #498 - #482 - #543 - #524 (which also solves #531) - #532 - #535 - #497 - #515 - #529 - #562 - #455 - #572 - #567 - #573 - #579 - #578 - #582 ### Breaking changes - release notes - ⚠️ IMPORTANT: upgrade to a version >V1.5.0 before upgrading to V1.6 to avoid deletion of resources in custom resource deletion - ⚠️ IMPORTANT: requires an update of environments and then datasets after upgrading. Either using cdk.json parameter `enable_update_dataall_stacks_in_cicd_pipeline`, waiting for overnight update stack task, or manually updating first environments and then datasets - CloudFront distribution replace for #529 - Additional EC2 permissions in CDK Synth CodeBuild stage for #543 --> this can be avoided by upgrading to v1.5.6 before upgrading to v1.6.0 - local development affected by more restrictive pivotRole trust policy ### Relates V1.6.0 release By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com> Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com> Co-authored-by: nikpodsh <124577300+nikpodsh@users.noreply.github.com> Co-authored-by: chamcca <40579012+chamcca@users.noreply.github.com> Co-authored-by: Nikita Podshivalov <nikpodsh@amazon.com> Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com> Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com>

noah-paige · 2023-07-19T15:59:37Z

Closing issue - implemented in v1.6

dlpzx added the priority: high label Jun 23, 2023

dlpzx added type: enhancement Feature enhacement status: in-progress This issue has been picked and is being implemented labels Jun 23, 2023

This was referenced Jul 11, 2023

V1.6.0 features #565

Merged

v1.6.0 backwards compatibility changes #567

Merged

noah-paige closed this as completed Jul 19, 2023

noah-paige removed the status: in-progress This issue has been picked and is being implemented label Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.all should not be able to delete all Glue databases and tables in Environment account #531

data.all should not be able to delete all Glue databases and tables in Environment account #531

jeffshep commented Jun 22, 2023 •

edited

Loading

dlpzx commented Jun 23, 2023

jeffshep commented Jun 23, 2023 •

edited

Loading

dlpzx commented Jun 23, 2023

dlpzx commented Jul 11, 2023

noah-paige commented Jul 19, 2023

data.all should not be able to delete all Glue databases and tables in Environment account #531

data.all should not be able to delete all Glue databases and tables in Environment account #531

Comments

jeffshep commented Jun 22, 2023 • edited Loading

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

dlpzx commented Jun 23, 2023

jeffshep commented Jun 23, 2023 • edited Loading

dlpzx commented Jun 23, 2023

dlpzx commented Jul 11, 2023

noah-paige commented Jul 19, 2023

jeffshep commented Jun 22, 2023 •

edited

Loading

jeffshep commented Jun 23, 2023 •

edited

Loading