[Gh 904] Central Catalog Support #1021

TejasRGitHub · 2024-01-30T21:42:03Z

Feature or Bugfix

Feature

Detail

PR containing all the code raised in PR - #905 + Unit Tests + Addressing comments raised on that PR. Copy pasting details from PR -

Detect if the source database is a resource link
If it is a resource link, check that the catalog account has been onboarded to data.all
Check for the presence of owner_account_id tag on the database
The tag needs to exist and the value has to match the account id of the share approver

Credits - @blitzmohit

Testing

Running Unit tests - ✅
Testing on AWS Deployed data.all instance with the Original PR - ✅
Sanity testing after addressing comments - [EDIT] ✅ ( Testing done )

Relates

Support for table sharing when a catalog account is being used #904

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)? No
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization? No
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features? No
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users? Yes
- Have you used the least-privilege principle? How? Yes

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…s on PR

TejasRGitHub · 2024-01-30T21:44:30Z

Adding the same instructions as mentioned by @blitzmohit on the previous PR for easy access -

Expected flow is as follows:

Data.all Onboarding:

Producer_team is onboarded to data.all with their account/env set up.
Consumer_team is onboarded to data.all with their account/env set up.
Data_catalog_team is onboarded to data.all with their account/env set up. (Note: No additional privileges are required on the catalog account)

Glue DB Creation:

Producer_team requests data_catalog_team to create a Glue database (e.g., producer_team_db1).
Data_catalog_team creates the Glue DB and Tables in their account and shares it back with write/create table permissions to producer_team. ( For Sharing, the Db is shared with the Producer Account Id and the IAM role which is going to be importing the dataset ( Grant "Describe" on the DB ) , and for the tables ( Grant "Select", "Describe", & "Alter" (Optional) ) access to the Producer Account id and the IAM role which is importing dataset in data.all )
Data_catalog_team adds a tag on producer_team_db1 with the key "owner_account_id" and the value as producer_team’s AWS account ID (account_id: 11111). Can use this ->aws glue --region us-east-1 tag-resource --resource-arn arn:aws:glue:us-east-1:<CATALOG_ACCOUNT>:database/<DATABASE_NAME> --tags-to-add owner_account_id=<PRODUCER_ACCOUNT>
Create a resource link from Producer Account to the Catalog account ( producer_team_db1_res_link) with the IAM role onboarding the dataset

Data.all Dataset & Share Creation:

Producer_team imports a new dataset (dataset1) with the S3 bucket & producer_team_db1 ( producer_team_db1_res_link - in this case ) Glue DB.
Consumer_team creates a share request for specific Glue tables & S3 bucket access in the producer_team dataset.

Data.all Share Provisioning:

Producer_team approves the share request.
Share is provisioned if the following conditions are met:
Data.all checks the Glue DB metadata to get catalog account id
Check the catalog account is onboarded to data.all (i.e., data.all can assume the pivotRole in the catalog account).
The "owner_account_id" tag exists on producer_team_db1 ( This is the DB in the catalog account ) Glue DB.
The tag value matches with producer_team account ID (i.e. {owner_account_id=11111})

Note - that is it is intended to be used with the S3 bucket access as sharing the table from the catalog account would not provide that. In the future version only tables could be shared without S3 bucket share

backend/dataall/modules/dataset_sharing/services/data_sharing_service.py

backend/dataall/modules/dataset_sharing/db/share_object_models.py

dlpzx

Some changes needed left in comments

# Conflicts: # backend/dataall/modules/dataset_sharing/aws/glue_client.py # backend/dataall/modules/dataset_sharing/services/data_sharing_service.py # backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py # backend/dataall/modules/dataset_sharing/services/share_processors/lf_process_cross_account_share.py # tests/modules/datasets/tasks/test_lf_share_manager.py

TejasRGitHub · 2024-02-15T23:54:19Z

Hi @dlpzx , I have updated PR by merging into the changes made by you for Simplifying the Lake formation. Thanks for the refactoring, the code really looks very clean and is easy to understand.

Pending task -

Raising another PR to update the readme to share more on how the catalog model works (TODO - After Approval )
Resolving failing Unit test and adding tests for the new introduced function

I have performed the following tests

Testing

Existing normal share ( with a producer and a consumer )

Revoking Share ✅
Approving Share ✅ ( By Revoking one table from share and then again approving that table and checking if the existing shared db is used )

Creating a new share on a dataset

Approving share ✅ ( Able to see a shared db with only the "_shared" )
Revoking share ✅ ( The "_shared" db is removed and no table is seen when checking from Athena for that role)

Existing share on catalog database

Revoking a share ( TODO )
Approving share ( with additional tables or the one which was revoked ) (TODO)

Catalog account share

Checking share with a consumer account ( which is not a producer and catalog account )
- Adding 1 Table for share ✅ ( Able to see the "_shared" db created in consumer account )
- Adding another Table for share ✅ ( The same "_shared" db is used and share is successful)
- Revoking 1 Table ✅ ( "_shared" db is present and now able to see the revoked table )
- Revoking all other tables ✅ ( "_shared" is not present )
Checking share with the producer account
- Creating the share of table with a Team ( Role ) which is present in producer env ✅
- Revoking the share of table with that same team ✅
Checking share with the catalog account itself
- Creating the share of table with a Team ( Role ) which is present in producer env ✅
- Revoking the share of table with that same team ✅

...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py

dlpzx

Left some comments. Going back to the root issue, I am thinking that maybe we could add the check_catalog_account_exists_and_update_processor logic as part of the initialization of the LFShareManager.

With this PR, when we approve a share request:

We initialize the ProcessLakeFormationShare(LFShareManager) with "source" variables of the source_environment
Then we call process_approved_shares
Where wecheck_catalog_account and then update_processor variable (re-writing the "source" values defined in the init)
If check_catalog_account raises exception we handle_share_failure_for_all_tables

As an alternative we could:

We initialize the ProcessLakeFormationShare(LFShareManager). In the init we check_catalog_account and we directly define the "source" variables with their final values. Similar to how we define the build_db_shared.

I think the main issue is the exception handling for the catalog when the account is not onboarded to data.all and it is not properly tagged. We could store that the catalog is invalid for example setting the source_account_id as None and then manage the alarms in process_approved_shares

We can meet over a call to talk about it, but what do you think? I just thought that we could make it more human-logical and avoid re-writing init values

backend/dataall/base/aws/sts.py

...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py

noah-paige · 2024-02-19T22:25:12Z

Left some comments. Going back to the root issue, I am thinking that maybe we could add the check_catalog_account_exists_and_update_processor logic as part of the initialization of the LFShareManager.

With this PR, when we approve a share request:
1. We initialize the `ProcessLakeFormationShare(LFShareManager)` with "source" variables of the source_environment

2. Then we call `process_approved_shares`

3. Where we`check_catalog_account` and then `update_processor` variable (re-writing the "source" values defined in the **init**)

4. If `check_catalog_account` raises exception we `handle_share_failure_for_all_tables`
As an alternative we could:
1. We initialize the `ProcessLakeFormationShare(LFShareManager)`. In the **init** we `check_catalog_account` and we directly define the "source" variables with their final values. Similar to how we define the `build_db_shared`.
I think the main issue is the exception handling for the catalog when the account is not onboarded to data.all and it is not properly tagged. We could store that the catalog is invalid for example setting the source_account_id as None and then manage the alarms in process_approved_shares

We can meet over a call to talk about it, but what do you think? I just thought that we could make it more human-logical and avoid re-writing init values

Agreed with this approach - think best if we avoid setting and re-setting instance variables and would be cleaner to just handle at init()

# Conflicts: # backend/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py

TejasRGitHub · 2024-02-21T21:10:25Z

Updated the PR after refactoring as per review comment. @dlpzx , @noah-paige , please take a look and review it. Thanks!

Testing

Share with normal producer consumer dataset with table and S3 bucket ✅
Share Revoke with normal producer and consumer of dataset with Table and S3Bucket ✅
Share with Central Catalog dataset with the Consumer ✅
Share Revoke Central Catalog dataset with the Consumer ✅
Share with Central Catalog dataset with the Producer ✅
Share Revoke with Central Catalog dataset with the Producer ✅
Share with Central Catalog dataset with the Central Catalog Account’s role ✅
Share Revoke with Central Catalog dataset with the Central Catalog Account’s role ✅

For all testing scenario , checked if the tables are showing up in Athena for the role on which they are shared . Also confirmed if when the share is revoked , the _shared table is removed and the permissions are not present.

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py

...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py

dlpzx

Small cosmetic changes. I like the new way of initializing the clients

dlpzx · 2024-02-22T13:34:36Z

noah-paige · 2024-02-22T21:13:19Z

Reviewing code from a high level looks good with the latest edits - but will defer to @dlpzx more formal review and testing for this PR

SofiaSazonova · 2024-02-23T10:03:47Z

@dlpzx locally it looks fine. It's ready to be tested in AWS

dlpzx · 2024-02-23T14:29:52Z

Testing in AWS. To not repeat the exact same tests I will try some additional scenarios

CICD pipeline succeeds
Testing datasets with catalog_account_id = dataset environment account (SAME ACCOUNT sharing and KMS imported Dataset)
- Sharing 2 tables + bucket --> succeeds and table can be queried from Athena, objects from bucket can be downloaded
- Revoking 2 tables + bucket --> succeeds
Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id NOT linked as environment. For this test, I created one S3 bucket in the data producer account, then I registered the S3 location in Lake formation in another account (catalog account) with an IAM role with permissions in the data producer bucket policy. I created the database and tables in the catalog account, I created the schema of the tables and I was able to query the data in Athena in the catalog account. Then I granted permissions to this database and tables to the data producer account with Lake Formation. In the data producer account I accepted the RAM invitations and created a resource link database targeting the shared database. I can see the tables of the resource link database (the same ones as in the original database). Then I import the dataset into data.all using the original S3 bucket and the resource link database.
- Sharing 2 tables + bucket --> raises error: Failed to share table books from source account DATAPRODUCER_ACCOUNT//eu-west-1 with target account TARGET_ACCOUNT/eu-west-1due to: Source account details not initialized properly. Please check if the catalog account is properly onboarded on data.all. All tables appear as failed. Bucket is shared.
Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id NOT linked as environment --> Added tags with aws glue --region eu-west-1 tag-resource --resource-arn arn:aws:glue:eu-west-1:CATALOG:database/imported_external_sse_2 --tags-to-add owner_account_id=DATAPRODUCER. I tested the case where there is a pivot role but without being assumable.
- Sharing 2 tables + bucket --> raises error: same as previous scenario. Bucket shared
- Revoking 2 tables + bucket (first share then remove Environment/pivot role) --> raise error: on account not being properly onboarded Bucket revoked.
Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id linked as environment (use untag-resource API call and then checked with get-tags)
- Sharing 2 tables + bucket--> raises error: Tags ... All tables appear as failed. Folders are shared.
- Revoking 2 tables + bucket (first share then remove tags) --> raise error: same error as before but for revoking. All tables appear as failed. Folders are revoked.
Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id linked as environment (I did not linked it as environment, I just created a pivotRole-cdk with trust to the data.all central account)
- Sharing 2 tables + bucket --> succeeds and table can be queried from Athena
- Revoking 2 tables + bucket--> succeeds

dlpzx

lgtm. I left a minor suggestion but it is optional

noah-paige

looks good!

TejasRGitHub · 2024-02-23T16:50:08Z

lgtm. I left a minor suggestion but it is optional

I updated the code for the minor comment. I think it will help with readability and also goes along with the comments. Thanks @dlpzx

### Feature or Bugfix - Bugfix ### Detail When using worksheet with a share made with a catalog account ( by using steps as described here in this PR - #1021 ) , the worksheet drop down list doesn't display the correct DB name. This is due to the fact that DB name is picked from the producer account ( where the S3 bucket is present and where the actualDB is not present ) which has the resource linked DB. Thus, the autogenerated querying doesn't work . Please refer to the screenshot <img width="1482" alt="image" src="https://github.com/data-dot-all/dataall/assets/71188245/fbc28286-0ca7-47de-a6ae-3020b1188dcb"> Also, on the share view, the db name mentioned on the query ( in the "Data Consumption details" ) is the resource linked DB name and not the correct DB name. ### Relates - #904 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? No - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? No - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? No - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? No - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>

TejasRGitHub added 3 commits January 26, 2024 12:30

GH - 904 | Central catalog Support Change

b81a91d

data-dot-allGH-904 - Addressing comments raised on PR - data-dot-all#905

2fd031b

data-dot-allGH-904 - Added unit tests and addressed remaining comment…

75fa5ee

…s on PR

TejasRGitHub mentioned this pull request Jan 30, 2024

Support table sharing when using a catalog account #905

Closed

dlpzx reviewed Feb 1, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/data_sharing_service.py Outdated Show resolved Hide resolved

dlpzx reviewed Feb 1, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/data_sharing_service.py Outdated Show resolved Hide resolved

dlpzx reviewed Feb 1, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/db/share_object_models.py Outdated Show resolved Hide resolved

dlpzx requested changes Feb 1, 2024

View reviewed changes

trajopadhye and others added 3 commits February 15, 2024 17:02

Refactored Code for Central Catalog Changes

8fcf258

Resolving linting issues

08dd883

TejasRGitHub added 2 commits February 15, 2024 17:55

Removing not needed comments

d6223c5

Removing Catalog Object as not needed

e75c7df

TejasRGitHub requested a review from dlpzx February 15, 2024 23:57

Adding Unit tests for central catalog after refactored code

ad5eac2

dlpzx reviewed Feb 19, 2024

View reviewed changes

...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py Outdated Show resolved Hide resolved

dlpzx reviewed Feb 19, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py Outdated Show resolved Hide resolved

dlpzx reviewed Feb 19, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py Show resolved Hide resolved

dlpzx reviewed Feb 19, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py Outdated Show resolved Hide resolved

dlpzx reviewed Feb 19, 2024

View reviewed changes

noah-paige reviewed Feb 19, 2024

View reviewed changes

backend/dataall/base/aws/sts.py Outdated Show resolved Hide resolved

noah-paige reviewed Feb 19, 2024

View reviewed changes

...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py Outdated Show resolved Hide resolved

trajopadhye and others added 3 commits February 20, 2024 14:30

Merge branch 'main' into data-dot-allgh-904-central-catalog-support

3aa1703

# Conflicts: # backend/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py

Changing the process of initializing the catalog account

475ada8

Refactoring as per review comments and correcting unit tests

8208445

TejasRGitHub requested review from dlpzx and noah-paige February 21, 2024 20:55

Resolving Lint Problems

8d00273

TejasRGitHub commented Feb 21, 2024

View reviewed changes

backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py Show resolved Hide resolved