-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gh 904] Central Catalog Support #1021
[Gh 904] Central Catalog Support #1021
Conversation
Adding the same instructions as mentioned by @blitzmohit on the previous PR for easy access - Expected flow is as follows: Data.all Onboarding: Producer_team is onboarded to data.all with their account/env set up. Glue DB Creation:
Data.all Dataset & Share Creation: Producer_team imports a new dataset (dataset1) with the S3 bucket & producer_team_db1 ( producer_team_db1_res_link - in this case ) Glue DB. Data.all Share Provisioning: Producer_team approves the share request. Note - that is it is intended to be used with the S3 bucket access as sharing the table from the catalog account would not provide that. In the future version only tables could be shared without S3 bucket share |
backend/dataall/modules/dataset_sharing/services/data_sharing_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/data_sharing_service.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/db/share_object_models.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes needed left in comments
# Conflicts: # backend/dataall/modules/dataset_sharing/aws/glue_client.py # backend/dataall/modules/dataset_sharing/services/data_sharing_service.py # backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py # backend/dataall/modules/dataset_sharing/services/share_processors/lf_process_cross_account_share.py # tests/modules/datasets/tasks/test_lf_share_manager.py
Hi @dlpzx , I have updated PR by merging into the changes made by you for Simplifying the Lake formation. Thanks for the refactoring, the code really looks very clean and is easy to understand. Pending task -
I have performed the following tests Testing
|
...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Going back to the root issue, I am thinking that maybe we could add the check_catalog_account_exists_and_update_processor
logic as part of the initialization of the LFShareManager
.
With this PR, when we approve a share request:
- We initialize the
ProcessLakeFormationShare(LFShareManager)
with "source" variables of the source_environment - Then we call
process_approved_shares
- Where we
check_catalog_account
and thenupdate_processor
variable (re-writing the "source" values defined in the init) - If
check_catalog_account
raises exception wehandle_share_failure_for_all_tables
As an alternative we could:
- We initialize the
ProcessLakeFormationShare(LFShareManager)
. In the init wecheck_catalog_account
and we directly define the "source" variables with their final values. Similar to how we define thebuild_db_shared
.
I think the main issue is the exception handling for the catalog when the account is not onboarded to data.all and it is not properly tagged. We could store that the catalog is invalid for example setting the source_account_id
as None and then manage the alarms in process_approved_shares
We can meet over a call to talk about it, but what do you think? I just thought that we could make it more human-logical and avoid re-writing init values
...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
Outdated
Show resolved
Hide resolved
Agreed with this approach - think best if we avoid setting and re-setting instance variables and would be cleaner to just handle at |
# Conflicts: # backend/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Show resolved
Hide resolved
Updated the PR after refactoring as per review comment. @dlpzx , @noah-paige , please take a look and review it. Thanks! Testing
For all testing scenario , checked if the tables are showing up in Athena for the role on which they are shared . Also confirmed if when the share is revoked , the _shared table is removed and the permissions are not present. |
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Show resolved
Hide resolved
...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
Show resolved
Hide resolved
...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
Outdated
Show resolved
Hide resolved
...end/dataall/modules/dataset_sharing/services/share_processors/lakeformation_process_share.py
Outdated
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small cosmetic changes. I like the new way of initializing the clients
Testing locally
|
Reviewing code from a high level looks good with the latest edits - but will defer to @dlpzx more formal review and testing for this PR |
@dlpzx locally it looks fine. It's ready to be tested in AWS |
Testing in AWS. To not repeat the exact same tests I will try some additional scenarios
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I left a minor suggestion but it is optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
I updated the code for the minor comment. I think it will help with readability and also goes along with the comments. Thanks @dlpzx |
### Feature or Bugfix - Bugfix ### Detail When using worksheet with a share made with a catalog account ( by using steps as described here in this PR - #1021 ) , the worksheet drop down list doesn't display the correct DB name. This is due to the fact that DB name is picked from the producer account ( where the S3 bucket is present and where the actualDB is not present ) which has the resource linked DB. Thus, the autogenerated querying doesn't work . Please refer to the screenshot <img width="1482" alt="image" src="https://github.com/data-dot-all/dataall/assets/71188245/fbc28286-0ca7-47de-a6ae-3020b1188dcb"> Also, on the share view, the db name mentioned on the query ( in the "Data Consumption details" ) is the resource linked DB name and not the correct DB name. ### Relates - #904 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? No - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? No - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? No - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? No - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>
Feature or Bugfix
Detail
PR containing all the code raised in PR - #905 + Unit Tests + Addressing comments raised on that PR. Copy pasting details from PR -
Detect if the source database is a resource link
If it is a resource link, check that the catalog account has been onboarded to data.all
Check for the presence of owner_account_id tag on the database
The tag needs to exist and the value has to match the account id of the share approver
Credits - @blitzmohit
Testing
Running Unit tests - ✅
Testing on AWS Deployed data.all instance with the Original PR - ✅
Sanity testing after addressing comments - [EDIT] ✅ ( Testing done )
Relates
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)? No
eval
or similar functions are used?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.