Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Pivot Role S3 permissions #580

Closed
4 of 7 tasks
dlpzx opened this issue Jul 17, 2023 · 9 comments · Fixed by #780 or #830
Closed
4 of 7 tasks

Limit Pivot Role S3 permissions #580

dlpzx opened this issue Jul 17, 2023 · 9 comments · Fixed by #780 or #830
Assignees
Labels
effort: medium priority: high priority: medium status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: enhancement Feature enhacement
Milestone

Comments

@dlpzx
Copy link
Contributor

dlpzx commented Jul 17, 2023

🆕 [UPDATED WITH THE FEEDBACK FROM COMMENTS]
Is your idea related to a problem? Please describe.
Currently the data.all pivotRole requires permission to all S3 Buckets and KMS keys in the AWS account

Describe the solution you'd like
A solution in which access to the data on S3 buckets is restricted to specific roles only. We would like to prevent any data access from other accounts, especially if the pivot role gets compromised.


Analysis of roles in data.all

In data.all central account

role used in
graphql-role Graphql Lambda function
worker-role Worker Lambda function
esproxy-role OpenSearch proxy Lambda function
ecs-role ECS tasks

In the environment accounts

In addition there is also the cdk execution role used for CDK deployments, but it is not relevant for this issue.

role creation permissions trust policies (assumable by)
Pivot role If AUTO-CREATE = True it is created as part of the environment stack, otherwise it is manually created Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and by graphql-role, worker-role and ecs-role from the central account
Environment teams' Roles As part of the environment stack, one for each team that is added to the environment Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and tags. by several module services + by the pivot-Role in the account
Dataset Role As part of the dataset stack, for both imported and created datasets Read and write permissions to the dataset bucket and to the Glue database and tables of the dataset + it can write profiling results in its corresponding folder in the environment bucket by Glue + by the pivot-Role in the account

In this diagram we can see all roles and the SDK calls that they need to perform. It includes the changes that we want to implement.

IAMRoles drawio


Implementation

To be able to remove the S3 and KMS permissions from the pivotRole policies, we need to remove the need of these permissions and then modify the pivot role permissions. At the moment the pivot role is used to access data in the S3 Buckets in the following features:

  1. As registration role in Lake Formation for the datasets
  2. In the Glue crawlers and profiling jobs as execution role
  3. In the upload functionality
  4. In the creation of Folders
  5. (for a particular customer) it is used to manage S3 bucket policies
  6. others that we are not aware of

Because imported dataset buckets can have any name, the S3 permissions granted apply to ALL buckets. We need to restrict the S3 permissions of the pivot role and the resources for those permissions.

1) As registration role in Lake Formation for the datasets

Because in V1.6 the dataset role was modified, we just need to:

  • - Modify the Dataset stack and the role used in the registration [1h]
  • - Modify the Dataset stack and grant permissions to the imported KMS key and S3 bucket to the dataset role [2h]
  • - Test how previously created Datasets are registered and can be updated. [8h]

Est. time ~ 2 days

2) In the Glue crawlers and profiling jobs as execution role

  • Done in V1.6.0 🚀 - The execution role for the dataset crawler and profiling job is the dataset role. The pivot role calls the "startCrawler" API, Glue assumes the dataset role to crawl the data.

3) 4) In the upload and the create folder functionalities (see diagram above)

Option 1: Pivot role in all SDK calls

  • PROS: easy to implement. No additional trusts from the central account.
  • CONS: access to data. putObject permissions

Option 2: Dataset role in SDK calls with access to S3

  • PROS: no putObject permissions needed for pivotRole
  • CONS: pivotrole can still assume the dataset role, leading to the same situation.

At first I was more inclined to go for option 1 as the pivot role has restricted access to managed buckets only needed to implement point 5). But after having a look at the code implementing option 2 is quite simple and avoids 'PutObject' permissions for the pivotRole

Est. time ~ 2 days

5) Manage bucket policies

In this case, it is the pivot role the one that needs to execute the updateBucketPolicy api calls. The task is to:

  • limit the necessary permission to data.all managed buckets only. This is done by updating the pivot role if it is auto-created when a new dataset is created/imported. We fetch the environment datasets and grant permissions to those buckets only
  • Ensure that the pivot role update is triggered with a dataset creation/import
  • Test the impact in pre-existing datasets (backwards compatibility)

Est. time ---> not included in the initial estimations

More that we are not aware of

There might be other functionalities that access data from the backend of data.all that in theory seem unaffected, but as this is a big change could break.
E.g. preview data in data.all

Est. time ~ 2 days

In total adding the changes to the pivotRole policies [~2h] and additional testing time, the resulting estimated time to implement this enhancement will take ~ 8 days 🆕 + manage bucket policies

@dlpzx dlpzx added the type: enhancement Feature enhacement label Jul 17, 2023
@dlpzx dlpzx added status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up priority: high priority: medium labels Jul 17, 2023
@zsaltys
Copy link
Contributor

zsaltys commented Jul 17, 2023

@dlpzx my concerns:

  1. file upload (we wont use this feature anyway() if we use datasetAdmin doesn't sound great because it means data.all would be allowed to assume this role directly or indirectly so that's a concern..

  2. can you remind me how glue crawlers are being fixed in 1.6.0?

Can you document as well:

  1. how will datasetAdmin role will be created. What will create it and how. Are we granting any extra permissions to anything to create this role?

  2. what will be allowed to assume datasetAdmin? Where it will be assumed exactly, where in code etc..

@dlpzx
Copy link
Contributor Author

dlpzx commented Jul 18, 2023

Hi @zsaltys I added the missing info in the issue description except for the last point that I need some time. I suspect that you also want to remove the pivot role form the trust policy of the dataset role right?

@manjulaK
Copy link
Contributor

manjulaK commented Aug 8, 2023

@dlpzx one thing that needs to be addressed is that pivot role should not have putbucketpolicy permission as it can potentially lead to other permissions. One way to address this would be to allow preregistration of buckets to be managed by data.all in this way not all buckets will be controlled by data.all. let me know your thoughts?

@dlpzx
Copy link
Contributor Author

dlpzx commented Aug 9, 2023

Hi @manjulaK, thanks for the comment! I see your point, let's see. Our target state would be a pivot role with a policy allowing access or any S3 sensitive action to the data.all created buckets + imported buckets.

  • That is the goal, but the list of S3 imported buckets can be obtained
    • a) from the data.all backend directly = if a bucket is imported we add it to the pivot role policy. Similar to what we do with a env-team role.
    • b) from an RDS table, DynamoDB, any other place. I am less inclined to this option as it adds more complexity
  • I am assuming we use auto-created pivotRoles that can be updated by data.all backend.
  • We need to verify if we will face size limitations in the IAM policies for the pivot role. With modularization the policies for the disabled modules are not added to the role, but still if there are hundreds of S3 imported buckets in an environment it might be a bit too much for IAM.

What do you think about option a)?
Can you confirm that you will be using auto-created pivot roles?

@manjulaK
Copy link
Contributor

hi @dlpzx thank you very much for looking into this. I think your option a) looks good. can you kindly confirm the following assumptions:

  1. cdkexec role will be used to import bucket at the time of bucket import the pivot role policy will be tweaked to add the bucket level permissions to the pivot role. 2) pivot role will only be used mainly during sharing folders,buckets ot accesspoint through API and will have s3* sort of access only to the buckets imported into data.all . Addn conditions : a)what happens if imported bucket is deleted outside of data.all b) what happens if imported bucket is deleted through data.all

@dlpzx
Copy link
Contributor Author

dlpzx commented Aug 14, 2023

  1. cdkexec role will be used to import b
  1. The CDK exec role is assumed by CloudFormation to create the Environment and the Dataset stack resources. When a Dataset is imported, data.all will update the auto-created pivotRole policy (part of the Environment stack). So yes, the CDK exec role grants permissions via cloudformation to the pivot role
  2. Yes, the pivot role is a data access manager (putBucketPolicy, createAccessPoint...). But s3* is something that we will work on limiting. For example, read data should not be necessary (getObject, putObject)
  3. If an imported bucket is deleted outside of data.all users should FIRST delete the imported Dataset from data.all. That will update the pivotRole policies in the environment stack

@anmolsgandhi anmolsgandhi added this to the v2.1.0 milestone Sep 8, 2023
@anmolsgandhi anmolsgandhi removed this from the v2.1.0 milestone Oct 2, 2023
dlpzx added a commit that referenced this issue Oct 12, 2023
### Feature or Bugfix
- Feature

### Detail
The guiding principle is that:
1. dataset IAM role is the role accessing data
2. pivot role is the role used by the central account to perform SDK
calls in the environment account

In this PR we
- Replace pivot role by dataset role in dataset Lake Formation
registration
- Use pivot role to trigger upload files feature and create folder
feature, but use the dataset IAM role to perform the putObject
operations-> removes the need for read and `putObject` permissions. for
the pivot role
- Redefine pivot role CDK stack to manage S3 buckets (bucket policies)
for only the datasets S3 buckets that have been created or imported in
the environment.
- implement IAM policy utils to handle the new dynamic policies. We need
to verify that the created policy statements do not exceed the maximum
policy size. In addition we replace the previous "divide in chunks of 10
statements" by a function that divides in chunks based on the size of
the policy statements. This way we optimize the policy size, which helps
us in reducing the number of managed policies attached to the pivot
role. --> it can be re-used in other "chunkenization" of policies
- We did not implement force update of environments (pivot role nested
stack) with new datasets added because it is already forced in
`backend/dataall/modules/datasets/services/dataset_service.py`

### Backwards compatibility Testing

Pre-update setup:
- 1 environment (auto-created pivot role)
- 2 datasets in environment, 1 created, 1 imported: with tables and
folders
- Run profiling jobs in tables

Update with the branch changes:
- [X] CICD pipeline runs successfully
- [X] Click update environment on environment -> successfully updated
policy of pivot role with imported datasets in policy. Reduction of
policies
- [X] Click update datasets --> registration in Lake formation updated
to dataset role
- [X] Update files works
- [X] Create folder works
- [X] Crawler and profiling jobs work
 

### Relates
- #580 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Are you introducing any new policies/roles/users? `Yes`
- Have you used the least-privilege principle? How? `In this PR we
restrict the permissions of the pivot role, a super role that handles
SDK calls in the environment accounts. Instead of granting permissions
to all S3 buckets, we restrict it to data.all handled S3 buckets only`


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@zsaltys
Copy link
Contributor

zsaltys commented Oct 24, 2023

@dlpzx are you also updating this section in the pivotRole to to make sure that data.all only has access to imported KMS keys?

- Sid: KMS
            Action:
              - 'kms:Decrypt'
              - 'kms:Encrypt'
              - 'kms:GenerateDataKey*'
              - 'kms:PutKeyPolicy'
              - 'kms:ReEncrypt*'
              - 'kms:TagResource'
              - 'kms:UntagResource'
            Effect: Allow
            Resource:
              - '*'

dlpzx added a commit that referenced this issue Oct 27, 2023
### Feature or Bugfix
- Feature

### Detail
- read KMS keys with an alias prefixed by the environment resource
prefix
- read KMS keys imported in imported datasets
- restrict pivot role policies to the KMS keys created by data.all and
those imported in the imported datasets
- move kms client from data_sharing to base as it is used in
environments and datasets

### Relates
- #580

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

This PR restricts the IAM policies of the pivot role, following the
least privilege permissions principle

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@anmolsgandhi anmolsgandhi added this to the v2.1.0 milestone Oct 27, 2023
@dlpzx
Copy link
Contributor Author

dlpzx commented Oct 30, 2023

#830 Also needed for this feature

This was linked to pull requests Oct 30, 2023
anushka-singh pushed a commit to anushka-singh/aws-dataall that referenced this issue Oct 31, 2023
### Feature or Bugfix
- Feature

### Detail
- read KMS keys with an alias prefixed by the environment resource
prefix
- read KMS keys imported in imported datasets
- restrict pivot role policies to the KMS keys created by data.all and
those imported in the imported datasets
- move kms client from data_sharing to base as it is used in
environments and datasets

### Relates
- data-dot-all#580

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

This PR restricts the IAM policies of the pivot role, following the
least privilege permissions principle

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx
Copy link
Contributor Author

dlpzx commented Nov 8, 2023

Merged and released with v2.1.0 🚀

@dlpzx dlpzx closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: medium priority: high priority: medium status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: enhancement Feature enhacement
Projects
None yet
4 participants