Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix 956 #961

Merged
merged 77 commits into from
Jan 11, 2024
Merged

Bugfix 956 #961

merged 77 commits into from
Jan 11, 2024

Conversation

anushka-singh
Copy link
Contributor

@anushka-singh anushka-singh commented Jan 10, 2024

Feature or Bugfix

  • Bugfix

Detail

Error: Failed to attach policy to KMS key xxxxx on yyyy : An error occurred (MalformedPolicyDocumentException) when calling the PutKeyPolicy operation: The new key policy will not allow you to update the key policy in the future.

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

noah-paige and others added 30 commits September 15, 2023 08:18
…ata-dot-all#748)

### Feature or Bugfix
<!-- please choose -->
- Feature Enhancement

### Detail
- Adding additional error messages for KMS Key lookup when importing a
new dataset
  - 1 Error message to determine if the KMS Key Alias Exists
- 1 Error message to determine if the PivotRole has permissions to
describe the KMS Key

### Relates
- data-dot-all#712 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
<!-- please choose -->
- NA

### Detail
- Get latest code in `main` to `v2m1m0` branch to keep in sync

### Relates
- NA

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

NA
```
- Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
  - Is the input sanitized?
  - What precautions are you taking before deserializing the data you consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires authorization?
  - How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?
```

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jaidisido <jaidisido@gmail.com>
Co-authored-by: dlpzx <dlpzx@amazon.com>
Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com>
### Feature or Bugfix
<!-- please choose -->
- Enahncement / Bugfix

### Detail
- When creating an environment and specifying default Env IAM Role we
assume it is of the structure `arn:aws:iam::ACCOUNT:role/NAME_SPECIFIED`
- This does not work when there is a service path in the role arn such
as with SSO: `arn:aws:iam::ACCOUNT:role/sso/NAME_SPECIFIED`
- Causes issues when importing an IAM Role for an invited Team in an
environment and/or with dataset sharing


- This PR takes in the full IAM role ARN when importing the IAM role in
order to correctly determine the role name




### Relates
- [data-dot-all#695 ](data-dot-all#695)

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
<!-- please choose -->
- Enhancement / Bugfix

### Detail
- Ensure the names passed for OpenSearch Domain and OpenSearch
Serverless Collection, Access Policy, Security Policy, and VPC Endpoint
all follow naming conventions required by the service, meaning

    - The name must start with a lowercase letter
    - Must be between 3 and 28 characters
    - Valid characters are a-z (lowercase only), 0-9, and - (hyphen).

### Relates
- data-dot-all#540 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <dlpzx@amazon.com>
# Conflicts:
#	deploy/app.py
### Feature or Bugfix
Update

### Detail


### Relates
See data-dot-all#655: 
> In Nov 27, 2023 the Lambda runtime node14 and Python3.7 will be
deprecated!

Checked all lambdas that explicitly set the runtime engine: only cognito
httpheader redirection lambda used node14.
All lambdas use python3.8 and node16 or node18.

For cdk dependencies: upgraded to a newest `aws-cdk-lib` `v2.99.0` just
in case if python3.7 is hardcoded somewhere inside of 2.78.0 (shouldn't
be)

### 
Testing: 
- [x] uploaded the changes to my isengard account
- [x] deployment is green 
- [x] could access app page, userguide page, and userguide from the app
page.

### Security
`N/A` - upgraded to a newer version of node js

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Feature

### Detail
The guiding principle is that:
1. dataset IAM role is the role accessing data
2. pivot role is the role used by the central account to perform SDK
calls in the environment account

In this PR we
- Replace pivot role by dataset role in dataset Lake Formation
registration
- Use pivot role to trigger upload files feature and create folder
feature, but use the dataset IAM role to perform the putObject
operations-> removes the need for read and `putObject` permissions. for
the pivot role
- Redefine pivot role CDK stack to manage S3 buckets (bucket policies)
for only the datasets S3 buckets that have been created or imported in
the environment.
- implement IAM policy utils to handle the new dynamic policies. We need
to verify that the created policy statements do not exceed the maximum
policy size. In addition we replace the previous "divide in chunks of 10
statements" by a function that divides in chunks based on the size of
the policy statements. This way we optimize the policy size, which helps
us in reducing the number of managed policies attached to the pivot
role. --> it can be re-used in other "chunkenization" of policies
- We did not implement force update of environments (pivot role nested
stack) with new datasets added because it is already forced in
`backend/dataall/modules/datasets/services/dataset_service.py`

### Backwards compatibility Testing

Pre-update setup:
- 1 environment (auto-created pivot role)
- 2 datasets in environment, 1 created, 1 imported: with tables and
folders
- Run profiling jobs in tables

Update with the branch changes:
- [X] CICD pipeline runs successfully
- [X] Click update environment on environment -> successfully updated
policy of pivot role with imported datasets in policy. Reduction of
policies
- [X] Click update datasets --> registration in Lake formation updated
to dataset role
- [X] Update files works
- [X] Create folder works
- [X] Crawler and profiling jobs work
 

### Relates
- data-dot-all#580 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Are you introducing any new policies/roles/users? `Yes`
- Have you used the least-privilege principle? How? `In this PR we
restrict the permissions of the pivot role, a super role that handles
SDK calls in the environment accounts. Instead of granting permissions
to all S3 buckets, we restrict it to data.all handled S3 buckets only`


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
…eation (data-dot-all#781)

### Feature or Bugfix
- Feature
- Bugfix

### Detail


The different alternatives considered are discussed in data-dot-all#556

This PR introduces a new query `listValidEnvironments` that replaces the
query `listEnvironments` for certain operations.
`listEnvironments` - lists all environments independently of their
CloudFormation stack statys with a lot of additional details
`listValidEnvironments` - lists only "CloudFormation" stable and
successful environments. Retrieves only basic info about the
environment.

Operations such as opening a share request or creation a
Dataset/Notebook/etc require the selection of an environment. The
environment options are now retrieved from `listValidEnvironments`
ensuring that only valid environments are selectable. Moreover, this
query is more light and does not need to query and obtain as many fields
as the original `listEnvironments`, improving the efficiency of the
code.

### Relates
- data-dot-all#556 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
<!-- please choose -->
- Feature

### Detail
Allows user to configure a session timeout . Today data.all by default
sets the refresh token to 30 days but with this change it becomes
configurable

### Relates
data-dot-all#421

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

Co-authored-by: Manjula <manjula.kasturi@gmail.com>
### Feature or Bugfix
- Feature
- Bugfix

### Detail
As explained in the [semgrep
docs](https://semgrep.dev/docs/cheat-sheets/python-command-injection/#1b-shelltrue):
"Functions from the subprocess module have the shell argument for
specifying if the command should be executed through the shell. Using
shell=True is dangerous because it propagates current shell settings and
variables. This means that variables, glob patterns, and other special
shell features in the command string are processed before the command is
run, making it much easier for a malicious actor to execute commands.
The subprocess module allows you to start new processes, connect to
their input/output/error pipes, and obtain their return codes. Methods
such as Popen, run, call, check_call, check_output are intended for
running commands provided as an argument ('args'). Allowing user input
in a command that is passed as an argument to one of these methods can
create an opportunity for a command injection vulnerability."

In our case the risk is not exposed as no user input is directly taken
into the subprocess commands. Nevertheless we should strive for the
highest standards on security and this PR works on replacing all the
`shell=True` executions in the data.all
code.

In this PR:
- when possible we have set `shell=False`
- in cases where the command was too complex a `CommandSanitizer`
ensures that the input arguments are strings following the
regex=`[a-zA-Z0-9-_]`

Testing: 
- [X] local testing - deployment of any stack
(`backend/dataall/base/cdkproxy/cdk_cli_wrapper.py`)
- [X] local testing - deployment of cdk pipeline stack
(`backend/dataall/modules/datapipelines/cdk/datapipelines_cdk_pipeline.py`)
- [X] local testing - deployment of codepipeline pipeline stack
(`backend/dataall/modules/datapipelines/cdk/datapipelines_pipeline.py`)
- [ ] AWS testing - deployment of data.all
- [ ] AWS testing - deployment of any stack
(`backend/dataall/base/cdkproxy/cdk_cli_wrapper.py`)
- [ ] AWS testing - deployment of cdk pipeline stack
(`backend/dataall/modules/datapipelines/cdk/datapipelines_cdk_pipeline.py`)
- [ ] AWS testing - deployment of codepipeline pipeline stack
(`backend/dataall/modules/datapipelines/cdk/datapipelines_pipeline.py`)

### Relates
- data-dot-all#738 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
- Is the input sanitized? ---> 🆗 This is exactly what this PR is trying
to do
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
…uester (data-dot-all#793)

### Feature or Bugfix
- Bugfix

### Detail
- Allowing to submit a share when you are both an approver and a
requester

### Security

**DOES NOT APPLY**

Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com>
### Feature or Bugfix
- Feature

### Detail
Adding a redirect to the share UI once a share object is created.
Additionally updating the breadcrumb message to more clearly indicate
that a "Draft share request is created" rather than suggesting that the
share has actually been sent to the data owners team.

### Relates
N/A

### Security
N/A


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com>
### Feature or Bugfix
Fix data-dot-all#792: Fix: condition when there are no public subnets

---------
### Feature or Bugfix
- Feature

### Detail
- Removing unused variable in local graphql server pointing to a fixed
AWS region

### Relates
N/A

### Security
N/A


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com>
### Feature or Bugfix

- Feature

### Detail
- For a dataset to make sense all the tables within a dataset should
have their location pointing to the same place as the dataset S3 bucket.
However it is possible that a database can have tables which do not
point to the same bucket which is perfectly legal in LakeFormation.
Therefore we propose that data.all automatically only lists tables that
have the same S3 bucket location as the dataset. This will solve a
problem for Yahoo where we want to import a database that contains many
tables with different buckets. Additionally Catalog UI should also only
list prefiltered tables.

### Testing
- Tested this in local env. I was able to create and share datasets even
after pre-filtering process takes place.
- Will send separate PR for unit testing. 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

Co-authored-by: Anushka Singh <anushka.singh@yahooinc.com>
### Feature or Bugfix
<!-- please choose -->
- Bugfix

### Detail
- Fix method to detect if other share objects exist on the environment
before cleaning up environment-level shared resources (i.e. RAM
invitation and PivotRole permissions)
- Originally, if TeamA in EnvA had 2 shares approved and succeeded on
DatasetB and TeamA rejects 1 of the pre-existing shares, the method
`other_approved_share_object_exists` was returning `False`and deleting
necessary permissions for the other existing Share
- Also disables the other existing shares ability to Revoke the still
existing share since pivotRole no longer has permissions

- Also fixes the removal of dataall QS Group permissions if there are
still existing shares to EnvA

### Security
NA
```
Please answer the questions below briefly where applicable, or write `N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
  - Is the input sanitized?
  - What precautions are you taking before deserializing the data you consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires authorization?
  - How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?
```

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
### Feature or Bugfix
- Feature

### Detail

Whenever a share request is created and transitions from states (
approved, revoked, etc ) a notification is created. This notification is
displayed on the bell icon on the UI .

We want such a similar notification to be sent to the dataset owner,
requester, etc via email

Please take a look at Github Issue 734 For more details -
data-dot-all#734

### Relates
- data-dot-all#734


### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? No
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization? No
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features? No
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users? Yes
- Have you used the least-privilege principle? How? --> **Permission
granted for SES:sendEmail to Lambda on resources - (Ses identity and
configuration set ) , Also created KMS and SNS for SES setup to handle
email bounces . Used least privleged and restricted access on both
whenever required. **


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>
### Feature or Bugfix
- Feature

### Detail
- Adding frontend support for all feature flags defined in config.json
with a new util method isFeatureEnabled
- Adding a new flag **preview_data** in the datasets module to control
whether previewing data is allowed
- Adding a new flag **glue_crawler** in the datasets module to control
whether running glue crawler is allowed
- Updating environment features to be hidden or visible based on whether
the module is active. Adding a new util isAnyFeatureModuleEnabled to
check whether to render the entire feature box.

### Relates
N/A

### Security
Not relevant

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com>
### Feature or Bugfix
- Refactoring

### Detail
As a rule of thumb, we encourage customization of `modules` while
changes in `core` should be avoided when possible. `notifications` is a
component initially in core which is only used by `dataset_sharing`. To
facilitate customization of the `notifications` module and also to
clearly see its dependencies we have:

- Moved `notifications` code from core to modules as it is a reusable
component that is not needed by any core component.
- Moved dataset_sharing references inside dataset_sharing module and
left `notifications` independent from any other module (done mostly in
data-dot-all#734, so credits to @TejasRGitHub)
- Added depends_on in the dataset_sharing module to load notifications
if the data_sharing module is imported.
- Modified frontend navigation bar to make it conditional of the
notifications module
- Added migration script to modify the notification type column
- Fix tests from data-dot-all#734, some references on the payload of the
notification tasks were wrong
- Small fixes to SES stack: added account in KMS policy and email_id as
input

### [WIP] Testing
Local testing
- [ ] loading of notifications with datasets enabled
- [ ] ...

AWS testing
- [ ] CICD pipeline succeds

### Other remarks
Not for this PR, but as a general note, we should clean up deprecated
ECS tasks

### Relates
- data-dot-all#785 
- data-dot-all#734 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

`N/A` just refactoring


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
# Conflicts:
#	deploy/stacks/backend_stack.py
#	deploy/stacks/backend_stage.py
#	deploy/stacks/lambda_api.py
#	deploy/stacks/pipeline.py
#	template_cdk.json
### Feature or Bugfix
- Feature

### Detail
- read KMS keys with an alias prefixed by the environment resource
prefix
- read KMS keys imported in imported datasets
- restrict pivot role policies to the KMS keys created by data.all and
those imported in the imported datasets
- move kms client from data_sharing to base as it is used in
environments and datasets

### Relates
- data-dot-all#580

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

This PR restricts the IAM policies of the pivot role, following the
least privilege permissions principle

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Bugfix

### Detail
- Make `hosted_zone_id` optional, code update

### Relates
- data-dot-all#797 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? N/A
  - Is the input sanitized? N/A
- What precautions are you taking before deserializing the data you
consume? N/A
  - Is injection prevented by parametrizing queries? N/A
  - Have you ensured no `eval` or similar functions are used? N/A
- Does this PR introduce any functionality or component that requires
authorization? N/A
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
N/A
  - Are you logging failed auth attempts? N/A
- Are you using or adding any cryptographic features? N/A
  - Do you use a standard proven implementations? N/A
- Are the used keys controlled by the customer? Where are they stored?
N/A
- Are you introducing any new policies/roles/users? N/A
  - Have you used the least-privilege principle? How? N/A

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license. YES

### Description

Make `hosted_zone_id` optional and provide `HostedZoneId` and `DNSName`
in CloudFormation Stack Output, so users can create their own [Route53
AliasTarget](https://docs.aws.amazon.com/Route53/latest/APIReference/API_AliasTarget.html).

Following validation checks in
`ecs_patterns.ApplicationLoadBalancedFargateService` were considered:
* `frontend_alternate_domain` and `userguide_alternate_domain` have to
be `None` when the `hosted_zone` is `None`, see checks in
[multiple-target-groups-service-base.ts#L463](https://github.com/aws/aws-cdk/blob/c445b8cc6e20d17e4a536f17262646b291a0fe36/packages/aws-cdk-lib/aws-ecs-patterns/lib/base/network-multiple-target-groups-service-base.ts#L463),
or else a `A Route53 hosted domain zone name is required to configure
the specified domain name` error is raised
* for a HTTPS ALB listener, only the `certificate` is ultimately
required, and not the `domainName` or `domainZone`, as per evaluation
logic in
[application-load-balanced-service-base.ts#L509](https://github.com/aws/aws-cdk/blob/c445b8cc6e20d17e4a536f17262646b291a0fe36/packages/aws-cdk-lib/aws-ecs-patterns/lib/base/application-load-balanced-service-base.ts#L509)
### Feature or Bugfix
- Bugfix

### Detail
- Clean up prints and show better exception message when custom_domain
is not provided for SES

### Relates
- v2.1.0

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
# Conflicts:
#	deploy/stacks/backend_stack.py
#	deploy/stacks/backend_stage.py
#	deploy/stacks/lambda_api.py
#	deploy/stacks/pipeline.py
#	template_cdk.json
### Feature or Bugfix
- Feature

### Detail
- read KMS keys with an alias prefixed by the environment resource
prefix
- read KMS keys imported in imported datasets
- restrict pivot role policies to the KMS keys created by data.all and
those imported in the imported datasets
- move kms client from data_sharing to base as it is used in
environments and datasets

### Relates
- data-dot-all#580

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

This PR restricts the IAM policies of the pivot role, following the
least privilege permissions principle

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Bugfix

### Detail
- Make `hosted_zone_id` optional, code update

### Relates
- data-dot-all#797 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? N/A
  - Is the input sanitized? N/A
- What precautions are you taking before deserializing the data you
consume? N/A
  - Is injection prevented by parametrizing queries? N/A
  - Have you ensured no `eval` or similar functions are used? N/A
- Does this PR introduce any functionality or component that requires
authorization? N/A
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
N/A
  - Are you logging failed auth attempts? N/A
- Are you using or adding any cryptographic features? N/A
  - Do you use a standard proven implementations? N/A
- Are the used keys controlled by the customer? Where are they stored?
N/A
- Are you introducing any new policies/roles/users? N/A
  - Have you used the least-privilege principle? How? N/A

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license. YES

### Description

Make `hosted_zone_id` optional and provide `HostedZoneId` and `DNSName`
in CloudFormation Stack Output, so users can create their own [Route53
AliasTarget](https://docs.aws.amazon.com/Route53/latest/APIReference/API_AliasTarget.html).

Following validation checks in
`ecs_patterns.ApplicationLoadBalancedFargateService` were considered:
* `frontend_alternate_domain` and `userguide_alternate_domain` have to
be `None` when the `hosted_zone` is `None`, see checks in
[multiple-target-groups-service-base.ts#L463](https://github.com/aws/aws-cdk/blob/c445b8cc6e20d17e4a536f17262646b291a0fe36/packages/aws-cdk-lib/aws-ecs-patterns/lib/base/network-multiple-target-groups-service-base.ts#L463),
or else a `A Route53 hosted domain zone name is required to configure
the specified domain name` error is raised
* for a HTTPS ALB listener, only the `certificate` is ultimately
required, and not the `domainName` or `domainZone`, as per evaluation
logic in
[application-load-balanced-service-base.ts#L509](https://github.com/aws/aws-cdk/blob/c445b8cc6e20d17e4a536f17262646b291a0fe36/packages/aws-cdk-lib/aws-ecs-patterns/lib/base/application-load-balanced-service-base.ts#L509)
### Feature or Bugfix
- Bugfix

### Detail
- Clean up prints and show better exception message when custom_domain
is not provided for SES

### Relates
- v2.1.0

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@@ -485,6 +499,8 @@ def delete_dataset_bucket_key_policy(
principal_list.remove(f"{target_requester_arn}")
if len(principal_list) == 0:
statements.pop(DATAALL_ACCESS_POINT_KMS_DECRYPT_SID)
if DATAALL_ACCESS_POINT_KMS_DECRYPT_SID not in statements.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this if statement? If there are no principals in principal_list can we just remove DATAALL_ACCESS_POINT_KMS_DECRYPT_SID

We may need to keep DATAALL_ACCESS_POINT_ENABLE_PIVOT_ROLE_PERMISSIONS_SID Statement to avoid MalformedPolicyDocument error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am popping both Statements if principal list is 0 in the next revision.
I dont think it will lead to MalformedPolicyDocument.. But we can test it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing the above while processing a bucket share worked with this PR, I did receive MalformedPolicyDocumentException when revoking a share

Do we see any issue with checking if DATAALL_ACCESS_POINT_ENABLE_PIVOT_ROLE_PERMISSIONS_SID statement exists and if not just creating (i.e. never removing the permissions from the KMS Key Policy)

These are the required permissions we already implicitly give to the pivotRole in IAM with our conditional statements... I do not see an issue with doing as such but let me know your thoughts @anushka-singh @dlpzx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want to also keep the remove logic then can we provide a default value when we do .pop(), such as:

statements.pop(DATAALL_BUCKET_ENABLE_PIVOT_ROLE_PERMISSIONS_SID, True)

Otherwise we may run into KeyErrors for existing shares after they include the PR and they try to run a revoke

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion it makes sense to not remove it, because the pivot role access to the key is similar to a configuration requirement of a dataset that we apply in a lazy mode.

If we keep it we avoid issues with simultaneous revoke/shares that create and delete it and with Malformedpolicy errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I agree!
Good find @noah-paige

In the remove functions, I will not pop DATAALL_ACCESS_POINT_ENABLE_PIVOT_ROLE_PERMISSIONS_SID

@noah-paige noah-paige marked this pull request as ready for review January 10, 2024 19:39
@@ -23,6 +23,7 @@
IAM_ACCESS_POINT_ROLE_POLICY = "targetDatasetAccessControlPolicy"
DATAALL_ALLOW_OWNER_SID = "AllowAllToAdmin"
DATAALL_ACCESS_POINT_KMS_DECRYPT_SID = "DataAll-Access-Point-KMS-Decrypt"
DATAALL_ACCESS_POINT_ENABLE_PIVOT_ROLE_PERMISSIONS_SID = "DataAll-Access-Point-Enable-Pivot-Role-Permissions"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am realizing is that if a dataset has a folder share and a bucket share then there will be 2 statements enabling PivotRole Permissions in the Key Policy

Would be curious to see if there's a way to not have duplicate here but also understand having them separate is far easier to manage in terms of checking if exists and removing if no more folder and/or bucket shares

... something to think about but deploying now to test as is and confirm the above suspicion

Copy link
Contributor Author

@anushka-singh anushka-singh Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought.
Here is how I think about it: Eventually we will move to having refactored code where we just have 1 file with all these functions consolidated in it. Currently there is a lot of common code between access point and bucket share that can be pulled out into a separate class. At that time, we can have just one of these pivot role auto create permissions in place. Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a share_managers/kms_policy_utils similar to the share_manager_utils? We could have a single function that handles update_dataset_bucket_key_policy for both. I like that it reduces the places where we modify the policy to a single one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlpzx I 100% agree we should consolidate the functions into a single utils file. However, I want to do that as a part of clean up outside of this PR. If thats okay with you, I can keep update_dataset_bucket_key_policy in both access point and bucket share for now and we can extract common functions out as a part of different issue I will create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new revision I have used KMSPivotRolePermissions as Noah advised in both the share managers.

@noah-paige
Copy link
Contributor

One other thing to note is that for created datasets we already create a KMS Key and explicitly give the pivotRole permission in the Key Policy (check out L140-161 in dataall/backend/dataall/modules/datasets/cdk/dataset_stack.py)

My suggestion would be the following to avoid duplicate permissions from different statements:

  • Change DATAALL_BUCKET_ENABLE_PIVOT_ROLE_PERMISSIONS_SID and DATAALL_ACCESS_POINT_ENABLE_PIVOT_ROLE_PERMISSIONS_SID to be the same constant KMSPivotRolePermissions (which is also the same SID used in dataset_stack.py)
  • Only have the logic that checks in the SID exists and if not adds the explicit KMS Permissions
  • Remove the logic that removes the SID if there are no more principals

I think the above would be the cleanest and most straightforward way to resolve the bug but open to other ideas as well!

@noah-paige
Copy link
Contributor

Tested in AWS

  • CICD Pipeline Succeeds

For created dataset:

  • Revoke existing Bucket Share - successful
  • Process Bucket Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Bucket Share (share revoked)
  • Process Folder Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Folder Share (share revoked)

For imported dataset with imported KMS Key:

  • Process Bucket Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Bucket Share (share revoked, pivotRole still has permissions)
  • Process Folder Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Folder Share (share revoked, pivotRole still has permissions)

@noah-paige
Copy link
Contributor

I finished my testing and everything is working perfectly - my one minor commment is that I am not a huge fan of swapping kms_client.get_key_id(key_alias) to kms_client.get_key_id_using_list_aliases(key_alias)

It feels a littly hacky or just not as straightforward to list the key aliases, iterate through to find the matching one and then parse out the associated KMS Key Id from there.

I know @anushka-singh there were some issues earlier in testing with kms_client.get_key_id(key_alias) but I would be inclined to try out the above code with reverting back to kms_client.get_key_id(key_alias) to see if it works - I plan to do so now to see

If you look at some of the earlier functions called in the S3 Bucket Share Processor - such as grant_s3_iam_access() - the methods use the kms_client.get_key_id(key_alias) rather than the other way of getting Key Id

so I do not think it should be an issue to do the same when updating the key policy in update_dataset_bucket_key_policy() or the equivalent delete method

All that being said if @anushka-singh or @dlpzx you think the above is not a requirement I can approve as is and have the above as a later enhancement as required

@anushka-singh
Copy link
Contributor Author

I finished my testing and everything is working perfectly - my one minor commment is that I am not a huge fan of swapping kms_client.get_key_id(key_alias) to kms_client.get_key_id_using_list_aliases(key_alias)

It feels a littly hacky or just not as straightforward to list the key aliases, iterate through to find the matching one and then parse out the associated KMS Key Id from there.

I know @anushka-singh there were some issues earlier in testing with kms_client.get_key_id(key_alias) but I would be inclined to try out the above code with reverting back to kms_client.get_key_id(key_alias) to see if it works - I plan to do so now to see

If you look at some of the earlier functions called in the S3 Bucket Share Processor - such as grant_s3_iam_access() - the methods use the kms_client.get_key_id(key_alias) rather than the other way of getting Key Id

so I do not think it should be an issue to do the same when updating the key policy in update_dataset_bucket_key_policy() or the equivalent delete method

All that being said if @anushka-singh or @dlpzx you think the above is not a requirement I can approve as is and have the above as a later enhancement as required

We did try out using kms_client.get_key_id(key_alias). The issue was a chicken and egg problem. To get key id to then get the existing policy, pivot role needed access to kms. But we were adding access later in the share manager. The error was: Failed to get kms key id of alias/di2: An error occurred (AccessDeniedException) when calling the DescribeKey operation: User: arn:aws:sts::xxxx:assumed-role/dataallPivotRole/dataallPivotRole is not authorized to perform: kms:DescribeKey on resource: arn:aws:kms:us-east-1:xxxx:key/yyyy because no resource-based policy allows the kms:DescribeKey action

Describe key is used in get_key_id function and needs some extra privileges. ListAliases however did not need those extra permissions - which is why Adriana and I decided to use listAliases which we already use in check_key_exists function in a similar pattern already.
Having said that, please let me know if you are able to make it work with just get_key_id in your trial and we can implement that.

@noah-paige
Copy link
Contributor

noah-paige commented Jan 11, 2024

Understood the above - but then in a hypothetical bucket share how does it work since the step grant_s3_iam_access() uses the KMSClient.get_key_id() function and is executed before calling grant_dataset_bucket_key_policy() which updates the KMS Key Policy

The pivotRole should have kms:DescribeKey permissions on all KMS Keys (it is one of the few wildcard permissions we permit) so as long as the pivotRole has the default KMS Key Policy (i.e. delegating IAM to manage permissions) than I would presume it should be okay - will report back with testing findings though!

@noah-paige
Copy link
Contributor

noah-paige commented Jan 11, 2024

Replacing get_key_id_using_list_aliases() with get_key_id() I tested in AWS again:

Tested in AWS

  • CICD Pipeline Succeeds

For created dataset:

  • Process Bucket Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Bucket Share (share revoked)
  • Process Folder Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Folder Share (share revoked)

For imported dataset with imported KMS Key:

  • Process Bucket Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Bucket Share (share revoked, pivotRole still has permissions)
  • Process Folder Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Folder Share (share revoked, pivotRole still has permissions)

@anushka-singh if you would have some time to validate the same on your end and if all works I would encourage we switch back to get_key_id(), remove the get_key_id_using_list_aliases() function and then I am happy to approve from there

@anushka-singh
Copy link
Contributor Author

Replacing get_key_id_using_list_aliases() with get_key_id() I tested in AWS again:

Tested in AWS

  • CICD Pipeline Succeeds

For created dataset:

  • Process Bucket Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Bucket Share (share revoked)
  • Process Folder Share (share succeeds, no additional kms statement for pivotRole)
  • Revoke Folder Share (share revoked)

For imported dataset with imported KMS Key:

  • Process Bucket Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Bucket Share (share revoked, pivotRole still has permissions)
  • Process Folder Share (share succeeds, adds kms statement for pivotRole)
  • Revoke Folder Share (share revoked, pivotRole still has permissions)

@anushka-singh if you would have some time to validate the same on your end and if all works I would encourage we switch back to get_key_id(), remove the get_key_id_using_list_aliases() function and then I am happy to approve from there

Hey Noah!
Thanks for testing this theory out. I was able to replicate the same behavior on my end. I used get_key_id(), instead of the get_key_id_using_list_aliases() function and was able to have successful share and revoke operations. I have pushed another revision with that version. Please approve and merge if you see fit.

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the changes in AWS and all looks good, approving now!

@noah-paige noah-paige merged commit 8b34dd5 into data-dot-all:main Jan 11, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset sharing not working for imported dataset after enabling auto create pivot Role
8 participants