Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gh 904] Central Catalog Support #1021

Merged

Conversation

TejasRGitHub
Copy link
Contributor

@TejasRGitHub TejasRGitHub commented Jan 30, 2024

Feature or Bugfix

  • Feature

Detail

PR containing all the code raised in PR - #905 + Unit Tests + Addressing comments raised on that PR. Copy pasting details from PR -

Detect if the source database is a resource link
If it is a resource link, check that the catalog account has been onboarded to data.all
Check for the presence of owner_account_id tag on the database
The tag needs to exist and the value has to match the account id of the share approver

Credits - @blitzmohit

Testing

Running Unit tests - ✅
Testing on AWS Deployed data.all instance with the Original PR - ✅
Sanity testing after addressing comments - [EDIT] ✅ ( Testing done )

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)? No
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization? No
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features? No
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users? Yes
    • Have you used the least-privilege principle? How? Yes

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@TejasRGitHub
Copy link
Contributor Author

TejasRGitHub commented Jan 30, 2024

Adding the same instructions as mentioned by @blitzmohit on the previous PR for easy access -

Expected flow is as follows:

Data.all Onboarding:

Producer_team is onboarded to data.all with their account/env set up.
Consumer_team is onboarded to data.all with their account/env set up.
Data_catalog_team is onboarded to data.all with their account/env set up. (Note: No additional privileges are required on the catalog account)

Glue DB Creation:

  1. Producer_team requests data_catalog_team to create a Glue database (e.g., producer_team_db1).
  2. Data_catalog_team creates the Glue DB and Tables in their account and shares it back with write/create table permissions to producer_team. ( For Sharing, the Db is shared with the Producer Account Id and the IAM role which is going to be importing the dataset ( Grant "Describe" on the DB ) , and for the tables ( Grant "Select", "Describe", & "Alter" (Optional) ) access to the Producer Account id and the IAM role which is importing dataset in data.all )
  3. Data_catalog_team adds a tag on producer_team_db1 with the key "owner_account_id" and the value as producer_team’s AWS account ID (account_id: 11111). Can use this ->aws glue --region us-east-1 tag-resource --resource-arn arn:aws:glue:us-east-1:<CATALOG_ACCOUNT>:database/<DATABASE_NAME> --tags-to-add owner_account_id=<PRODUCER_ACCOUNT>
  4. Create a resource link from Producer Account to the Catalog account ( producer_team_db1_res_link) with the IAM role onboarding the dataset

Data.all Dataset & Share Creation:

Producer_team imports a new dataset (dataset1) with the S3 bucket & producer_team_db1 ( producer_team_db1_res_link - in this case ) Glue DB.
Consumer_team creates a share request for specific Glue tables & S3 bucket access in the producer_team dataset.

Data.all Share Provisioning:

Producer_team approves the share request.
Share is provisioned if the following conditions are met:
Data.all checks the Glue DB metadata to get catalog account id
Check the catalog account is onboarded to data.all (i.e., data.all can assume the pivotRole in the catalog account).
The "owner_account_id" tag exists on producer_team_db1 ( This is the DB in the catalog account ) Glue DB.
The tag value matches with producer_team account ID (i.e. {owner_account_id=11111})

Note - that is it is intended to be used with the S3 bucket access as sharing the table from the catalog account would not provide that. In the future version only tables could be shared without S3 bucket share

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes needed left in comments

trajopadhye and others added 3 commits February 15, 2024 17:02
# Conflicts:
#	backend/dataall/modules/dataset_sharing/aws/glue_client.py
#	backend/dataall/modules/dataset_sharing/services/data_sharing_service.py
#	backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
#	backend/dataall/modules/dataset_sharing/services/share_processors/lf_process_cross_account_share.py
#	tests/modules/datasets/tasks/test_lf_share_manager.py
@TejasRGitHub
Copy link
Contributor Author

TejasRGitHub commented Feb 15, 2024

Hi @dlpzx , I have updated PR by merging into the changes made by you for Simplifying the Lake formation. Thanks for the refactoring, the code really looks very clean and is easy to understand.

Pending task -

  • Raising another PR to update the readme to share more on how the catalog model works (TODO - After Approval )
  • Resolving failing Unit test and adding tests for the new introduced function

I have performed the following tests

Testing

  1. Existing normal share ( with a producer and a consumer )
  • Revoking Share ✅
  • Approving Share ✅ ( By Revoking one table from share and then again approving that table and checking if the existing shared db is used )
  1. Creating a new share on a dataset
  • Approving share ✅ ( Able to see a shared db with only the "_shared" )
  • Revoking share ✅ ( The "_shared" db is removed and no table is seen when checking from Athena for that role)
  1. Existing share on catalog database
  • Revoking a share ( TODO )
  • Approving share ( with additional tables or the one which was revoked ) (TODO)
  1. Catalog account share
  • Checking share with a consumer account ( which is not a producer and catalog account )

    • Adding 1 Table for share ✅ ( Able to see the "_shared" db created in consumer account )
    • Adding another Table for share ✅ ( The same "_shared" db is used and share is successful)
    • Revoking 1 Table ✅ ( "_shared" db is present and now able to see the revoked table )
    • Revoking all other tables ✅ ( "_shared" is not present )
  • Checking share with the producer account

    • Creating the share of table with a Team ( Role ) which is present in producer env ✅
    • Revoking the share of table with that same team ✅
  • Checking share with the catalog account itself

    • Creating the share of table with a Team ( Role ) which is present in producer env ✅
    • Revoking the share of table with that same team ✅

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Going back to the root issue, I am thinking that maybe we could add the check_catalog_account_exists_and_update_processor logic as part of the initialization of the LFShareManager.

With this PR, when we approve a share request:

  1. We initialize the ProcessLakeFormationShare(LFShareManager) with "source" variables of the source_environment
  2. Then we call process_approved_shares
  3. Where wecheck_catalog_account and then update_processor variable (re-writing the "source" values defined in the init)
  4. If check_catalog_account raises exception we handle_share_failure_for_all_tables

As an alternative we could:

  1. We initialize the ProcessLakeFormationShare(LFShareManager). In the init we check_catalog_account and we directly define the "source" variables with their final values. Similar to how we define the build_db_shared.

I think the main issue is the exception handling for the catalog when the account is not onboarded to data.all and it is not properly tagged. We could store that the catalog is invalid for example setting the source_account_id as None and then manage the alarms in process_approved_shares

We can meet over a call to talk about it, but what do you think? I just thought that we could make it more human-logical and avoid re-writing init values

@noah-paige
Copy link
Contributor

Left some comments. Going back to the root issue, I am thinking that maybe we could add the check_catalog_account_exists_and_update_processor logic as part of the initialization of the LFShareManager.

With this PR, when we approve a share request:

1. We initialize the `ProcessLakeFormationShare(LFShareManager)` with "source" variables of the source_environment

2. Then we call `process_approved_shares`

3. Where we`check_catalog_account` and then `update_processor` variable (re-writing the "source" values defined in the **init**)

4. If `check_catalog_account` raises exception we `handle_share_failure_for_all_tables`

As an alternative we could:

1. We initialize the `ProcessLakeFormationShare(LFShareManager)`. In the **init** we `check_catalog_account` and we directly define the "source" variables with their final values. Similar to how we define the `build_db_shared`.

I think the main issue is the exception handling for the catalog when the account is not onboarded to data.all and it is not properly tagged. We could store that the catalog is invalid for example setting the source_account_id as None and then manage the alarms in process_approved_shares

We can meet over a call to talk about it, but what do you think? I just thought that we could make it more human-logical and avoid re-writing init values

Agreed with this approach - think best if we avoid setting and re-setting instance variables and would be cleaner to just handle at init()

trajopadhye and others added 3 commits February 20, 2024 14:30
@TejasRGitHub
Copy link
Contributor Author

Updated the PR after refactoring as per review comment. @dlpzx , @noah-paige , please take a look and review it. Thanks!

Testing

  1. Share with normal producer consumer dataset with table and S3 bucket ✅
  2. Share Revoke with normal producer and consumer of dataset with Table and S3Bucket ✅
  3. Share with Central Catalog dataset with the Consumer ✅
  4. Share Revoke Central Catalog dataset with the Consumer ✅
  5. Share with Central Catalog dataset with the Producer ✅
  6. Share Revoke with Central Catalog dataset with the Producer ✅
  7. Share with Central Catalog dataset with the Central Catalog Account’s role ✅
  8. Share Revoke with Central Catalog dataset with the Central Catalog Account’s role ✅

For all testing scenario , checked if the tables are showing up in Athena for the role on which they are shared . Also confirmed if when the share is revoked , the _shared table is removed and the permissions are not present.

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small cosmetic changes. I like the new way of initializing the clients

@dlpzx
Copy link
Contributor

dlpzx commented Feb 22, 2024

Testing locally

  • Testing datasets with catalog_account_id = dataset environment account
    • Sharing 2 tables + 1 folder --> succeeds and table can be queried from Athena
    • Revoking 2 tables + 1 folder --> succeeds
  • Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id NOT linked as environment
    • Sharing 2 tables + 1 folder --> raises error: Pivot role ... All tables appear as failed. Folders are shared.
    • Revoking 2 tables + 1 folder (first share then remove Environment/pivot role) --> raise error: Pivot Role.. All tables appear as failed. Folders are revoked.
  • Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id NOT linked as environment
    • Sharing 2 tables + 1 folder --> raises error: Pivot role... All tables appear as failed. Folders are shared.
    • Revoking 2 tables + 1 folder (first share then remove Environment/pivot role) --> raise error: Pivot role... All tables appear as failed. Folders are revoked.
  • Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id linked as environment
    • Sharing 2 tables + 1 folder --> raises error: Tags ... All tables appear as failed. Folders are shared.
    • Revoking 2 tables + 1 folder (first share then remove tags) --> raise error: Tags... All tables appear as failed. Folders are revoked.
  • Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id linked as environment
    • Sharing 2 tables + 1 folder --> succeeds and table can be queried from Athena
    • Revoking 2 tables + 1 folder --> succeeds

@noah-paige
Copy link
Contributor

Reviewing code from a high level looks good with the latest edits - but will defer to @dlpzx more formal review and testing for this PR

@SofiaSazonova
Copy link
Contributor

@dlpzx locally it looks fine. It's ready to be tested in AWS

@dlpzx
Copy link
Contributor

dlpzx commented Feb 23, 2024

Testing in AWS. To not repeat the exact same tests I will try some additional scenarios

  • CICD pipeline succeeds
  • Testing datasets with catalog_account_id = dataset environment account (SAME ACCOUNT sharing and KMS imported Dataset)
    • Sharing 2 tables + bucket --> succeeds and table can be queried from Athena, objects from bucket can be downloaded
    • Revoking 2 tables + bucket --> succeeds
  • Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id NOT linked as environment. For this test, I created one S3 bucket in the data producer account, then I registered the S3 location in Lake formation in another account (catalog account) with an IAM role with permissions in the data producer bucket policy. I created the database and tables in the catalog account, I created the schema of the tables and I was able to query the data in Athena in the catalog account. Then I granted permissions to this database and tables to the data producer account with Lake Formation. In the data producer account I accepted the RAM invitations and created a resource link database targeting the shared database. I can see the tables of the resource link database (the same ones as in the original database). Then I import the dataset into data.all using the original S3 bucket and the resource link database.
    • Sharing 2 tables + bucket --> raises error: Failed to share table books from source account DATAPRODUCER_ACCOUNT//eu-west-1 with target account TARGET_ACCOUNT/eu-west-1due to: Source account details not initialized properly. Please check if the catalog account is properly onboarded on data.all. All tables appear as failed. Bucket is shared.
  • Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id NOT linked as environment --> Added tags with aws glue --region eu-west-1 tag-resource --resource-arn arn:aws:glue:eu-west-1:CATALOG:database/imported_external_sse_2 --tags-to-add owner_account_id=DATAPRODUCER. I tested the case where there is a pivot role but without being assumable.
    • Sharing 2 tables + bucket --> raises error: same as previous scenario. Bucket shared
    • Revoking 2 tables + bucket (first share then remove Environment/pivot role) --> raise error: on account not being properly onboarded Bucket revoked.
  • Testing datasets with catalog_account_id != dataset environment account + WITHOUT proper tags + catalog_account_id linked as environment (use untag-resource API call and then checked with get-tags)
    • Sharing 2 tables + bucket--> raises error: Tags ... All tables appear as failed. Folders are shared.
    • Revoking 2 tables + bucket (first share then remove tags) --> raise error: same error as before but for revoking. All tables appear as failed. Folders are revoked.
  • Testing datasets with catalog_account_id != dataset environment account + WITH proper tags + catalog_account_id linked as environment (I did not linked it as environment, I just created a pivotRole-cdk with trust to the data.all central account)
    • Sharing 2 tables + bucket --> succeeds and table can be queried from Athena
    • Revoking 2 tables + bucket--> succeeds

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I left a minor suggestion but it is optional

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@TejasRGitHub
Copy link
Contributor Author

lgtm. I left a minor suggestion but it is optional

I updated the code for the minor comment. I think it will help with readability and also goes along with the comments. Thanks @dlpzx

@noah-paige noah-paige merged commit 34fea4f into data-dot-all:main Feb 23, 2024
8 checks passed
@noah-paige noah-paige linked an issue Feb 23, 2024 that may be closed by this pull request
noah-paige pushed a commit that referenced this pull request Mar 4, 2024
### Feature or Bugfix
- Bugfix


### Detail

When using worksheet with a share made with a catalog account ( by using
steps as described here in this PR -
#1021 ) , the worksheet drop
down list doesn't display the correct DB name. This is due to the fact
that DB name is picked from the producer account ( where the S3 bucket
is present and where the actualDB is not present ) which has the
resource linked DB. Thus, the autogenerated querying doesn't work .
Please refer to the screenshot
<img width="1482" alt="image"
src="https://github.com/data-dot-all/dataall/assets/71188245/fbc28286-0ca7-47de-a6ae-3020b1188dcb">

Also, on the share view, the db name mentioned on the query ( in the
"Data Consumption details" ) is the resource linked DB name and not the
correct DB name.

### Relates
- #904

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? No
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization? No
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features? No
  - Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
No
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for table sharing when a catalog account is being used
4 participants