-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple environments in the same account with cdk-pivot role #1064
Conversation
# Conflicts: # backend/dataall/base/aws/iam.py # backend/dataall/base/aws/sts.py # backend/dataall/core/environment/cdk/pivot_role_stack.py # backend/dataall/modules/dashboards/aws/dashboard_quicksight_client.py # backend/dataall/modules/datapipelines/services/datapipelines_service.py # backend/dataall/modules/dataset_sharing/aws/lakeformation_client.py # backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py # backend/dataall/modules/dataset_sharing/services/share_managers/s3_access_point_share_manager.py # backend/dataall/modules/dataset_sharing/services/share_managers/s3_bucket_share_manager.py # backend/dataall/modules/datasets/aws/s3_profiler_client.py # backend/dataall/modules/datasets/aws/sns_dataset_client.py # backend/dataall/modules/datasets/services/dataset_service.py # tests/modules/datasets/tasks/test_lf_share_manager.py # tests/modules/datasets/tasks/test_s3_access_point_share_manager.py # tests/modules/datasets/tasks/test_s3_bucket_share_manager.py
Went through the PR and left some minor comments. Overall the PR makes sense to me, my comments for the Issues to Decide below:
Question: Ideas:
Additional:
|
backend/dataall/modules/dataset_sharing/services/share_managers/s3_bucket_share_manager.py
Show resolved
Hide resolved
backend/dataall/modules/dataset_sharing/services/share_managers/s3_bucket_share_manager.py
Show resolved
Hide resolved
### Feature or Bugfix - Feature - Bugfix ### Detail - Implements what is described in #1148 ### Relates - #1148 - #1064 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
# Conflicts: # backend/dataall/modules/dataset_sharing/services/share_managers/lf_share_manager.py
Hello @dlpzx Thanks for prioritising and the details. Attached image is my understanding about the feature About TO DECIDE: Should we make this feature configurable? If so, what is the deprecation timeline for dataallPivotRole-cdk?
ISSUE TO FIX: Folder and bucket permissions get wiped out when the dataset stacks get updated with the new KMS statements (see the issues above). Users need to re-apply the shares because after the upgrade they will be broken.
ISSUE TO DECIDE: orphan permissions pointing at the old cdk role in LF: LakeFormation does not delete the permissions of deleted IAM roles, so the old pivot role keeps having LF permissions and is still the datalake admin after it has been deleted. It appears as invalid. Should we do anything about it? Users can just clean-up the permissions and remove the deleted role from their accounts, but it is true that it is a bit annoying. We can at least offer some sort of script or at least instructions
|
Hi @sandeephs1 thanks for the diagram :) Yes, that is exactly the expected behavior. We will consider your feedback, with the feature as is now I think your requirements for Q2-Q3 are satisfied |
Tested deploying this PR and ensuring: W/ Multi-Env Pivot Role Disabled:
W/ Pivot Role
Left one minor comment on IAM policy permissions - once addressed I think good to approve! |
### Feature or Bugfix - Feature ### Detail Implements #1150 - Add **on-demand** ECS task to reapply shares - It goes through ALL active shares, if there are unhealthy items, it reapplies the share As explained in the issue description, this task is a resource of data.all admins that can run it on-demand using ECS commands or directly in the AWS console. We could extend how it is used: - By triggering it in the `share-verifier` task daily. We can add a one-liner that triggers the share-reapplier task at the end of the share-verifier task. - By adding functionalities in the UI that allow users to trigger this task ### Testing Deploying in AWS: - [X] CICD pipeline runs successfully and creates new on-demand ECS task definition `share-reapplier` - [X] We can run the ECS task `share-reapplier` from the AWS console - [X] Checking the logs of the ECS task `share-reapplier` without any unhealthy share succeeds - [X] Checking the logs of the ECS task `share-reapplier` with unhealthy shares succeeds - reapplies the unhealthy shares ### Relates - #1150 - #1064 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Approving
### Feature or Bugfix - Feature ### Detail - Add details about multi-region pivot roles related to #1064 - Add details about SSM parameter account settings related to #1154 - Add details on all ECS tasks and commands on how to execute on demand task such as the one in #1151 ### Relates - #1064 - #1154 - #1151 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). `n/a` - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
IMPORTANT
An error occurred (AWSResourceNotFound) when calling GET_STACK_LOGS operation: Logs could not be found for this stack
- 1. Update data.all to new version
- 2. Update all Environment and Dataset stacks
- 3. trigger the
share-verifier
task in ECS- 4. trigger the
share-reapplier
task in ECS introduced in PR Add share_reapply task - ON DEMAND for data.all admins #1151dataallPivotRole-cdk
because LakeFormation does not delete the permissions of deleted IAM roles. In addition, Lake Formation keeps the old pivot role as datalake admin flagged as invalid (see pic below). The permissions do not really harm anyone and can be deleted manually. If you are a user and are struggling cleaning-up or wish more support, please open a GitHub issue referencing this PR and we will address your challenges.Detail
Problem statement
data.all =<2.2.0 is not designed to allow the creation of multiple data.all Environments in the same AWS account. However, there are customers that store and work with data in the same AWS account but in different regions. This PR is meant to enable that use-case solving its 2 blockers:
Design
Our recommendation is to use cdk-created pivot roles because their policies can be more restrictive and manual errors of deploying CFN stacks are avoided. To solve #936 and use CDK pivot role in a multi-environment same account setup we have 2 options:
dataallPivotRole-cdk
role.Design decisions
Despite having disadvantages, this PR implements option 2 because it is more robust in the long-term. It also has advantages in terms of security as it "isolates" regions; we can take it further an define pivot role policies with the specified region. In addition, we reduce the impact of IAM size service quotas as each role is dedicated to one environment only.
To minimize the code changes and impact, I have used the
region
to differentiate between pivot roles in the same account. At first I added theenvironmentUri
to the pivot role name, but it made it very difficult to use in the boto3aws
clients. We would need to modify all the client calls to input theenvironmentUri
. That's why I re-used theregion
which was already a parameter of most clients.This PR does not handle multiple environments using the manual pivot role. Meaning that it does not expect that each environment has a dedicated
dataallPivotRole-region
. It works as usual with onedataallPivotRole
for all environments. We want to favor CDK-pivot roles and slowly move away from manual pivot roles.Backwards compatibility
As preventive measurement, this PR introduces a
config.json
parameter to configure the usage ofcdk_pivot_role_multiple_environments_same_account
. The default value isFalse
, thus the changes introduced in this PR are not mandatory and customers can migrate whenever they have the need. UPDATE FROM REVIEW: If we see that the backwards compatibility is smooth we will discard the parameter. This is a good enhancement and does not remove any functionality for customers.TESTING
On an existing deployment with 2 environments and datasets (imported and created) and multiple shares, update the codebase. All modules are active and enabled in the environments.
Part 1:
dataallPivotRole-cdk
continues workingcdk_pivot_role_multiple_environments_same_account
--> check it updates and keeps deploying thedataallPivotRole-cdk
cdk_pivot_role_multiple_environments_same_account
=False
--> check it keeps deploying thedataallPivotRole-cdk
dataallPivotRole-cdk
dataallPivotRole-cdk
Part 2:
dataallPivotRole-cdk-region
migration infrastructurecdk_pivot_role_multiple_environments_same_account
toTrue
and update environment. Environment updates the pivot role stack and creates NEWdataallPivotRole-cdk-region
- as expected, in CloudFormation we see the messageRequested update requires the creation of a new physical resource; hence creating one
dataallPivotRole-cdk-region
--> check that imported datasets in the environment are present in the policy statementsdataallPivotRole-cdk-region
is a LF admin - it happens right after environment updatedataallPivotRole-cdk-region
has LF permissions to the dataset. Check that KMS key policy includes the new pivot role. Finally check that the trust policy of the dataset IAM role includes the new pivot role:dataallPivotRole-cdk-region
has LF permissions to the dataset. Bucket policy and KMS policies are not managed by data.all in the creation of dataset. Finally check that the trust policy of the dataset IAM role includes the new pivot rolePart 3:
dataallPivotRole-cdk-region
migration data sharingFailed to get kms key id of alias/loc-new-uqo3m5oh: An error occurred (AccessDeniedException) when calling the DescribeKey operation: User: arn:aws:sts::796821004569:assumed-role/dataallPivotRole-cdk-eu-west-1/dataallPivotRole-cdk-eu-west-1 is not authorized to perform: kms:DescribeKey on resource: arn:aws:kms:eu-west-1:796821004569:key/d00a9763-7fd0-4845-81d6-863d20714888 because no resource-based policy allows the kms:DescribeKey action
Part4:
dataallPivotRole-cdk-region
in boto3 calls - especially IAM and Quicksight that required some additional development.key: profiling/code/jars/deequ-2.0.0-spark-3.1.jar.The specified key does not exist
Additional testing in AWS + support PRs (Update April 8th) - deploy #1151 as well
I triggered the share-verifier task manually and only the expected broken shares (created datasets - folder/bucket sharing) appeared as unhealthy
Other less important items - open issues if needed
ImportError: cannot import name 'ColumnProfilerRunner' from 'awsglue.transforms' (/opt/amazon/lib/python3.6/site-packages/awsglue/transforms/__init__.py)
--> open github issueOpen Issues, remarks and questions
ISSUE TO FIX: Folder and bucket permissions get wiped out when the dataset stacks get updated with the new KMS statements (see the issues above). Users need to re-apply the shares because after the upgrade they will be broken.⚠️ in order for this PR to be released we need to implement some ECS task or automation to re-apply all failed shares ---> Issue Using PR Add share_reapply task - ON DEMAND for data.all admins #1151 this issue can be remediated
TO DECIDE: Should we make this feature configurable? If so, what is the deprecation timeline for
dataallPivotRole-cdk
? ---> At the moment: Yes, this can be configured so that users can decide when to do the upgrade. We will use the maintenance window implementation to update users about this change.ISSUE TO FIX: Bucket sharing or folder sharing in testing part3 will fail for imported datasets because the statement in the KMS policy
DATAALL_KMS_PIVOT_ROLE_PERMISSIONS_SID
is skipped indataall/modules/dataset_sharing/services/share_managers/s3_access_point_share_manager.py:535
keeping the reference to the old pivot role. One quick fix is to rename the Sid, but that will leave orphan resources referencing the old pivot role. A better fix is to ensure that the role is that sid is the one of the pivot role. --> 🟢 fixedISSUE TO FIX: Missing
kms:DescribeKey
permissions for pivot role from KMS key policy making the share fail. I am not sure how this issue did not surface before. I think it has to due with the by default permissions in KMS vs the ones applied by data.all.Failed to get kms key id of alias/loc-new-2-eo336u1r: An error occurred (AccessDeniedException) when calling the DescribeKey operation: User: arn:aws:sts::796821004569:assumed-role/dataallPivotRole-cdk-eu-west-1/dataallPivotRole-cdk-eu-west-1 is not authorized to perform: kms:DescribeKey on resource: arn:aws:kms:eu-west-1:796821004569:key/bb37c05a-0588-45f2-a9e4-4075392ee5e1 because no resource-based policy allows the kms:DescribeKey action
--> permissions added in this commitISSUE TO FIX: Table health verification shows unhealthy after update:
Requestor arn:aws:iam::SOURCE_ACCOUNT:role/dataallPivotRole-cdk-eu-west-1 missing LF permissions: ALL for Glue DB Target: loc_imported_c2_kms_URI Requestor arn:aws:iam::TARGET_ACCOUNT:role/dataallPivotRole-cdk-eu-west-1 missing LF permissions: ALL for Glue DB Target: loc_imported_c2_kms_URI_shared
The issue is solved when we click onRe-Apply share
but it is a bit annoying that customers need to re-apply all their shares. I think we should introduce an ECS task to re-apply all shares. @anushka-singh already suggested this feature for other purposes --->based on @noah-paige's review, I think the best fix will be to ensure that the pivot role gets granted all the permissions in the verification task, that way the verification won't fail and the re-apply won't be necessary for cases in which the missing pivot role permissions are the source problem. ---> Fix in PR: Add grants to pivot role in verify tables functions #1149Less important
ISSUE UNRELATED TO THIS PR: Run profiling job --> it triggers the profiling job but there is an issue unrelated to this PR related to the profiling script in the S3 bucket of the environment:
key: profiling/code/jars/deequ-2.0.0-spark-3.1.jar.The specified key does not exist
---> confirm and create issueISSUE TO INVESTIGATE: the permissions of the pivot role after the shares are reapplied seem incomplete:
--> 🟢 It does not seem to have an impact on shares, all table sharing keeps on working as expected.
Relates
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
No
n/a
n/a
eval
or similar functions are used?n/a
n/a
n/a
No
n/a
n/a
Yes
dataallPivotRole-cdk-*
to account for the multiple region roles. This change is still restrictive enoughBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.