Limit Pivot Role S3 permissions #580

dlpzx · 2023-07-17T10:45:09Z

🆕 [UPDATED WITH THE FEEDBACK FROM COMMENTS]
Is your idea related to a problem? Please describe.
Currently the data.all pivotRole requires permission to all S3 Buckets and KMS keys in the AWS account

Describe the solution you'd like
A solution in which access to the data on S3 buckets is restricted to specific roles only. We would like to prevent any data access from other accounts, especially if the pivot role gets compromised.

Analysis of roles in data.all

In data.all central account

role	used in
graphql-role	Graphql Lambda function
worker-role	Worker Lambda function
esproxy-role	OpenSearch proxy Lambda function
ecs-role	ECS tasks

In the environment accounts

In addition there is also the cdk execution role used for CDK deployments, but it is not relevant for this issue.

role	creation	permissions	trust policies (assumable by)
Pivot role	If AUTO-CREATE = True it is created as part of the environment stack, otherwise it is manually created	Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and	by `graphql-role`, `worker-role` and `ecs-role` from the central account
Environment teams' Roles	As part of the environment stack, one for each team that is added to the environment	Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and tags.	by several module services + by the `pivot-Role` in the account
Dataset Role	As part of the dataset stack, for both imported and created datasets	Read and write permissions to the dataset bucket and to the Glue database and tables of the dataset + it can write profiling results in its corresponding folder in the environment bucket	by Glue + by the `pivot-Role` in the account

In this diagram we can see all roles and the SDK calls that they need to perform. It includes the changes that we want to implement.

Implementation

To be able to remove the S3 and KMS permissions from the pivotRole policies, we need to remove the need of these permissions and then modify the pivot role permissions. At the moment the pivot role is used to access data in the S3 Buckets in the following features:

As registration role in Lake Formation for the datasets
In the Glue crawlers and profiling jobs as execution role
In the upload functionality
In the creation of Folders
(for a particular customer) it is used to manage S3 bucket policies
others that we are not aware of

Because imported dataset buckets can have any name, the S3 permissions granted apply to ALL buckets. We need to restrict the S3 permissions of the pivot role and the resources for those permissions.

1) As registration role in Lake Formation for the datasets

Because in V1.6 the dataset role was modified, we just need to:

- Modify the Dataset stack and the role used in the registration [1h]
- Modify the Dataset stack and grant permissions to the imported KMS key and S3 bucket to the dataset role [2h]
- Test how previously created Datasets are registered and can be updated. [8h]

Est. time ~ 2 days

2) In the Glue crawlers and profiling jobs as execution role

Done in V1.6.0 🚀 - The execution role for the dataset crawler and profiling job is the dataset role. The pivot role calls the "startCrawler" API, Glue assumes the dataset role to crawl the data.

3) 4) In the upload and the create folder functionalities (see diagram above)

Option 1: Pivot role in all SDK calls

PROS: easy to implement. No additional trusts from the central account.
CONS: access to data. putObject permissions

Option 2: Dataset role in SDK calls with access to S3

PROS: no putObject permissions needed for pivotRole
CONS: pivotrole can still assume the dataset role, leading to the same situation.

At first I was more inclined to go for option 1 as the pivot role has restricted access to managed buckets only needed to implement point 5). But after having a look at the code implementing option 2 is quite simple and avoids 'PutObject' permissions for the pivotRole

Est. time ~ 2 days

5) Manage bucket policies

In this case, it is the pivot role the one that needs to execute the updateBucketPolicy api calls. The task is to:

limit the necessary permission to data.all managed buckets only. This is done by updating the pivot role if it is auto-created when a new dataset is created/imported. We fetch the environment datasets and grant permissions to those buckets only
Ensure that the pivot role update is triggered with a dataset creation/import
Test the impact in pre-existing datasets (backwards compatibility)

Est. time ---> not included in the initial estimations

More that we are not aware of

There might be other functionalities that access data from the backend of data.all that in theory seem unaffected, but as this is a big change could break.
E.g. preview data in data.all

Est. time ~ 2 days

In total adding the changes to the pivotRole policies [~2h] and additional testing time, the resulting estimated time to implement this enhancement will take ~ 8 days 🆕 + manage bucket policies

The text was updated successfully, but these errors were encountered:

zsaltys · 2023-07-17T11:38:37Z

@dlpzx my concerns:

file upload (we wont use this feature anyway() if we use datasetAdmin doesn't sound great because it means data.all would be allowed to assume this role directly or indirectly so that's a concern..
can you remind me how glue crawlers are being fixed in 1.6.0?

Can you document as well:

how will datasetAdmin role will be created. What will create it and how. Are we granting any extra permissions to anything to create this role?
what will be allowed to assume datasetAdmin? Where it will be assumed exactly, where in code etc..

dlpzx · 2023-07-18T08:27:27Z

Hi @zsaltys I added the missing info in the issue description except for the last point that I need some time. I suspect that you also want to remove the pivot role form the trust policy of the dataset role right?

manjulaK · 2023-08-08T14:22:35Z

@dlpzx one thing that needs to be addressed is that pivot role should not have putbucketpolicy permission as it can potentially lead to other permissions. One way to address this would be to allow preregistration of buckets to be managed by data.all in this way not all buckets will be controlled by data.all. let me know your thoughts?

dlpzx · 2023-08-09T07:18:40Z

Hi @manjulaK, thanks for the comment! I see your point, let's see. Our target state would be a pivot role with a policy allowing access or any S3 sensitive action to the data.all created buckets + imported buckets.

That is the goal, but the list of S3 imported buckets can be obtained
- a) from the data.all backend directly = if a bucket is imported we add it to the pivot role policy. Similar to what we do with a env-team role.
- b) from an RDS table, DynamoDB, any other place. I am less inclined to this option as it adds more complexity
I am assuming we use auto-created pivotRoles that can be updated by data.all backend.
We need to verify if we will face size limitations in the IAM policies for the pivot role. With modularization the policies for the disabled modules are not added to the role, but still if there are hundreds of S3 imported buckets in an environment it might be a bit too much for IAM.

What do you think about option a)?
Can you confirm that you will be using auto-created pivot roles?

manjulaK · 2023-08-10T13:03:34Z

hi @dlpzx thank you very much for looking into this. I think your option a) looks good. can you kindly confirm the following assumptions:

cdkexec role will be used to import bucket at the time of bucket import the pivot role policy will be tweaked to add the bucket level permissions to the pivot role. 2) pivot role will only be used mainly during sharing folders,buckets ot accesspoint through API and will have s3* sort of access only to the buckets imported into data.all . Addn conditions : a)what happens if imported bucket is deleted outside of data.all b) what happens if imported bucket is deleted through data.all

dlpzx · 2023-08-14T13:00:31Z

cdkexec role will be used to import b

The CDK exec role is assumed by CloudFormation to create the Environment and the Dataset stack resources. When a Dataset is imported, data.all will update the auto-created pivotRole policy (part of the Environment stack). So yes, the CDK exec role grants permissions via cloudformation to the pivot role
Yes, the pivot role is a data access manager (putBucketPolicy, createAccessPoint...). But s3* is something that we will work on limiting. For example, read data should not be necessary (getObject, putObject)
If an imported bucket is deleted outside of data.all users should FIRST delete the imported Dataset from data.all. That will update the pivotRole policies in the environment stack

### Feature or Bugfix - Feature ### Detail The guiding principle is that: 1. dataset IAM role is the role accessing data 2. pivot role is the role used by the central account to perform SDK calls in the environment account In this PR we - Replace pivot role by dataset role in dataset Lake Formation registration - Use pivot role to trigger upload files feature and create folder feature, but use the dataset IAM role to perform the putObject operations-> removes the need for read and `putObject` permissions. for the pivot role - Redefine pivot role CDK stack to manage S3 buckets (bucket policies) for only the datasets S3 buckets that have been created or imported in the environment. - implement IAM policy utils to handle the new dynamic policies. We need to verify that the created policy statements do not exceed the maximum policy size. In addition we replace the previous "divide in chunks of 10 statements" by a function that divides in chunks based on the size of the policy statements. This way we optimize the policy size, which helps us in reducing the number of managed policies attached to the pivot role. --> it can be re-used in other "chunkenization" of policies - We did not implement force update of environments (pivot role nested stack) with new datasets added because it is already forced in `backend/dataall/modules/datasets/services/dataset_service.py` ### Backwards compatibility Testing Pre-update setup: - 1 environment (auto-created pivot role) - 2 datasets in environment, 1 created, 1 imported: with tables and folders - Run profiling jobs in tables Update with the branch changes: - [X] CICD pipeline runs successfully - [X] Click update environment on environment -> successfully updated policy of pivot role with imported datasets in policy. Reduction of policies - [X] Click update datasets --> registration in Lake formation updated to dataset role - [X] Update files works - [X] Create folder works - [X] Crawler and profiling jobs work ### Relates - #580 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Are you introducing any new policies/roles/users? `Yes` - Have you used the least-privilege principle? How? `In this PR we restrict the permissions of the pivot role, a super role that handles SDK calls in the environment accounts. Instead of granting permissions to all S3 buckets, we restrict it to data.all handled S3 buckets only` By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zsaltys · 2023-10-24T14:06:45Z

@dlpzx are you also updating this section in the pivotRole to to make sure that data.all only has access to imported KMS keys?

- Sid: KMS
            Action:
              - 'kms:Decrypt'
              - 'kms:Encrypt'
              - 'kms:GenerateDataKey*'
              - 'kms:PutKeyPolicy'
              - 'kms:ReEncrypt*'
              - 'kms:TagResource'
              - 'kms:UntagResource'
            Effect: Allow
            Resource:
              - '*'

### Feature or Bugfix - Feature ### Detail - read KMS keys with an alias prefixed by the environment resource prefix - read KMS keys imported in imported datasets - restrict pivot role policies to the KMS keys created by data.all and those imported in the imported datasets - move kms client from data_sharing to base as it is used in environments and datasets ### Relates - #580 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). This PR restricts the IAM policies of the pivot role, following the least privilege permissions principle - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dlpzx · 2023-10-30T11:00:05Z

#830 Also needed for this feature

### Feature or Bugfix - Feature ### Detail - read KMS keys with an alias prefixed by the environment resource prefix - read KMS keys imported in imported datasets - restrict pivot role policies to the KMS keys created by data.all and those imported in the imported datasets - move kms client from data_sharing to base as it is used in environments and datasets ### Relates - data-dot-all#580 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). This PR restricts the IAM policies of the pivot role, following the least privilege permissions principle - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dlpzx · 2023-11-08T08:30:11Z

Merged and released with v2.1.0 🚀

dlpzx added the type: enhancement Feature enhacement label Jul 17, 2023

dlpzx added status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up priority: high priority: medium labels Jul 17, 2023

noah-paige mentioned this issue Jul 19, 2023

Limit Pivot Role permissions #491

Closed

anmolsgandhi added the effort: medium label Sep 8, 2023

anmolsgandhi added this to the v2.1.0 milestone Sep 8, 2023

anmolsgandhi assigned dlpzx Sep 8, 2023

anmolsgandhi removed this from the v2.1.0 milestone Oct 2, 2023

dlpzx mentioned this issue Oct 5, 2023

Feat: limit pivot role S3 permissions #780

Merged

6 tasks

dlpzx mentioned this issue Oct 25, 2023

Feat: pivot role limit kms #830

Merged

anmolsgandhi added this to the v2.1.0 milestone Oct 27, 2023

This was linked to pull requests Oct 30, 2023

Feat: limit pivot role S3 permissions #780

Merged

Feat: pivot role limit kms #830

Merged

dlpzx closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit Pivot Role S3 permissions #580

Limit Pivot Role S3 permissions #580

dlpzx commented Jul 17, 2023 •

edited

Loading

zsaltys commented Jul 17, 2023 •

edited

Loading

dlpzx commented Jul 18, 2023

manjulaK commented Aug 8, 2023

dlpzx commented Aug 9, 2023 •

edited

Loading

manjulaK commented Aug 10, 2023

dlpzx commented Aug 14, 2023 •

edited

Loading

zsaltys commented Oct 24, 2023 •

edited

Loading

dlpzx commented Oct 30, 2023

dlpzx commented Nov 8, 2023

Limit Pivot Role S3 permissions #580

Limit Pivot Role S3 permissions #580

Comments

dlpzx commented Jul 17, 2023 • edited Loading

Analysis of roles in data.all

In data.all central account

In the environment accounts

Implementation

1) As registration role in Lake Formation for the datasets

2) In the Glue crawlers and profiling jobs as execution role

3) 4) In the upload and the create folder functionalities (see diagram above)

5) Manage bucket policies

More that we are not aware of

zsaltys commented Jul 17, 2023 • edited Loading

dlpzx commented Jul 18, 2023

manjulaK commented Aug 8, 2023

dlpzx commented Aug 9, 2023 • edited Loading

manjulaK commented Aug 10, 2023

dlpzx commented Aug 14, 2023 • edited Loading

zsaltys commented Oct 24, 2023 • edited Loading

dlpzx commented Oct 30, 2023

dlpzx commented Nov 8, 2023

dlpzx commented Jul 17, 2023 •

edited

Loading

zsaltys commented Jul 17, 2023 •

edited

Loading

dlpzx commented Aug 9, 2023 •

edited

Loading

dlpzx commented Aug 14, 2023 •

edited

Loading

zsaltys commented Oct 24, 2023 •

edited

Loading