Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

dlpzx · 2024-07-02T07:48:43Z

Feature or Bugfix

Feature

Detail

It implements some tests for s3_datasets (check full list in #1358)

For fresh deployments

🔦 AWS actions outside of data.all
There are some actions that in real life are performed outside of data.all. To run the tests we need to either perform this actions manually before the tests are executed or we can use AWS SDK to automate them. Most important actions performed outside of data.all.

Creation of consumption roles
Creation of imported dataset bucket, kms key and glue database *IN THIS PR
Create VPCs for Notebooks
Validate shares - we assume the share request role for this

To create resources we need to assume a role in the environment account. We could assume the pivot role, but then we need to ensure that it has CreateBucket... permissions; which is not the case. I have opted to create a separate isolated role dataall-integration-tests-role as part of the environment stack ONLY when we are creating environments during integration testing. As part of the global config of environments users can use the boto3 session of this role to perform direct AWS calls in the environment account.

In https://github.com/data-dot-all/dataall/pull/1382/files we discussed some alternatives. In this PR we use the environmentType variable in the environment model, which was not used for anything (it always defaulted to Data environments).
API call create environment (input: environmentType = IntegrationTesting) ---> in environment stack we check the type of environment and deploy the integration test role.

Then we use an SSM parameter to read the tooling account id needed for the assume role trust policy

Relates

Integration tests executed on a real deployment as part of the CICD - Datasets #1358

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…delete tests

…+ aws clients for dataset

…nt id

dlpzx · 2024-07-08T07:14:45Z

Testing locally

Create Environment outside of testing - does not create testing integration role
Create Environment as part of tests - creates testing integration role
test succeed and created infra gets deleted (bucket, glue database, kms)

Testing in AWS

CICD pipeline can assume integration roles and create imported dataset infra
tests succeed (attach screenshot)

noah-paige · 2024-07-09T03:08:24Z

Tested in AWS and all checks are passing

…egration-tests-datasets # Conflicts: # tests_new/integration_tests/core/environment/global_conftest.py

tests_new/integration_tests/modules/s3_datasets/test_s3_dataset.py

tests_new/integration_tests/modules/s3_datasets/global_conftest.py

petrkalos · 2024-07-09T10:35:35Z

tests_new/integration_tests/modules/s3_datasets/test_s3_dataset.py

+    # TODO: Come up with better way to handle wait in progress if applicable
+    # Use time.sleep() instead of poller b/c of case where no changes founds  (i.e. no update required)
+    # check_stack_in_progress(client1, env_uri, stack_uri)
+    time.sleep(10)


I think 10 seconds are not enough to make sure that cdkproxy issues a cdk deploy. It takes about 30 seconds for the fargate to boot the container, another 30 seconds to execute the cdk deploy and then probably a few more seconds for the cdk deploy to return `no diffs.

The best solution would be to make updateStack return a taskId and then expose the apis (that we already have in the backend) to track the status of this taskId.
But a good enough solution would be the following...
updateStack/getStack return a list of events, when we issue the updateStack we can get the latest event, then we can poll with getStack until we get a new event, at that point we will know for sure that the cdk deploy have started executing.

When writing this I realised that if there are no changes in the stack then maybe cdk deploy doesn't generate a new CFN event, so we need to double check that.

You can disregard the above comments, when the are no changes in the stack then cdk deploy doesn't generate a new events. As an alternative we can use the getStackLogs

For the time being I increased the sleep time to give ECS/CFN time to update

petrkalos

I'd only suggest to increase the timeout to 2 minutes as privisioning+running of cdkproxy takes about 1 minute

### Feature or Bugfix - Feature: Testing ### Detail In this PR we add new fixtures for S3 datasets that are used for the tests in S3/tables/folders but also for the tests developed in ##1389: - Fix imported KMS dataset - there was an error in the KMS keys and in the registration of the Glue database - Folders as separate fixture using create_folder data.all API - Tables as separate fixture using boto3 calls to create the table, upload data and then use sync_tables data.all API - the data can be queried! This PR moves dataset_base testing scenarios to datasets_base/test_dataset.py. Testing scenarios have been defined for the S3 datasets and the remaining test scenarios for the datasets_base APIs are defined with their signature and a TODO comment. It also splits the S3 dataset tests into their corresponding API subcategories (in `backend/.../s3_datasets/api`) - test_s3_datasets - test_s3_tables - test_s3_tables_profiling - test_s3_tables_columns - test_s3_folders Implement testing scenarios for `test_s3_folders` covering all APIs and dataset types (parametrized tests). Note that to avoid duplication of tests, unauthorized test cases are tested with only one of the dataset types as the code executed is the same for all cases. Implement testing scenarios for `test_s3_tables` covering all APIs and dataset types (parametrized tests). Same as folders, unauthorized tests are performed on a single dataset type. New tests include: sync_tables with real tables, preview tables with real tables, preview unauthorized depending on the confidentiality level, get_dataset_level, list_dataset_tables For `test_s3_datasets` only test_create_dataset_unauthorized is added, but for other existing tests we add test for all dataset types (parametrized tests). #### Next steps In follow-up PRs we should implement the missing commented TODO tests for: - datasets_base ---> list owned tests - s3_datasets ---> list owned tests - s3_tables ---> data filters tests - s3_tables_profiling ---> some tests - s3_tables_columns ---> all tests - Review backwards compatibility tests and add table and folder test cases ### Relates - #1379 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dlpzx <dlpzx@amazon.com>

dlpzx and others added 2 commits July 1, 2024 19:08

Add integration tests for datasets - basic queries and conftest

b188538

add list + get queries, add persistent datasets, begin create/update/…

45d1407

…delete tests

dlpzx linked an issue Jul 2, 2024 that may be closed by this pull request

Integration tests executed on a real deployment as part of the CICD - Datasets #1358

Open

dlpzx changed the title ~~Feat/integration tests datasets~~ Add Dataset integration tests - 1 Jul 2, 2024

dlpzx changed the title ~~Add Dataset integration tests - 1~~ Add Dataset integration tests - Dataset CRUD + actions outside of data.all Jul 2, 2024

dlpzx force-pushed the feat/integration-tests-datasets branch 2 times, most recently from 9dafa90 to 54d5ff8 Compare July 2, 2024 11:21

Add integration test role in Environment stack + session in conftest …

cd27097

…+ aws clients for dataset

dlpzx force-pushed the feat/integration-tests-datasets branch from 54d5ff8 to cd27097 Compare July 2, 2024 11:25

dlpzx and others added 3 commits July 2, 2024 18:31

simplified conftests for datasets

d04b525

create integration role with region in name

5e5507e

New environment type: IntegrationTests + ssm param with tooling accou…

fa69dde

…nt id

dlpzx force-pushed the feat/integration-tests-datasets branch from 91a6afe to fa69dde Compare July 3, 2024 15:27

Error on cdk add_to_policy

3e19596

noah-paige mentioned this pull request Jul 4, 2024

Integration Tests Notebooks #1382

Merged

noah-paige added 3 commits July 3, 2024 21:23

Add filter term include tags datasets

c05de67

Add sample data and tests for dataset role access

8f2a918

Add sample data and tests for dataset role access

9b2c711

dlpzx added 4 commits July 8, 2024 11:38

Add assume role permissions to codebuild role

2dcd60f

Add naming checks in clients + create table

c261da7

Add permissions, confidentiality and commented tests

1e9732b

revert persistent environment

5ea8b6b

dlpzx force-pushed the feat/integration-tests-datasets branch from 430f5b8 to 5ea8b6b Compare July 8, 2024 13:14

Fix check_stack_ready in dataset creation

520a34e

dlpzx force-pushed the feat/integration-tests-datasets branch from 83f3a07 to 520a34e Compare July 8, 2024 13:50

dlpzx and others added 3 commits July 8, 2024 18:53

Revert session environment and add tests

972c883

fix integration role datasets

7b1c942

Fix presigned URL upload test

d9042dc

noah-paige mentioned this pull request Jul 9, 2024

Add Dataset integration tests - Tables, Folders #1391

Merged

noah-paige marked this pull request as ready for review July 9, 2024 03:00

dlpzx requested a review from petrkalos July 9, 2024 09:10

Merge remote-tracking branch 'refs/remotes/origin/main' into feat/int…

3cb9d25

…egration-tests-datasets # Conflicts: # tests_new/integration_tests/core/environment/global_conftest.py

petrkalos reviewed Jul 9, 2024

View reviewed changes

Remove commented code and increase sleep time for update dataset

112ec46

dlpzx requested a review from petrkalos July 9, 2024 12:52

petrkalos approved these changes Jul 9, 2024

View reviewed changes

dlpzx merged commit e9108ab into main Jul 9, 2024
10 checks passed

dlpzx deleted the feat/integration-tests-datasets branch July 17, 2024 12:25

dlpzx mentioned this pull request Sep 16, 2024

Integration tests executed on a real deployment as part of the CICD - Datasets #1358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

dlpzx commented Jul 2, 2024 •

edited

Loading

dlpzx commented Jul 8, 2024 •

edited

Loading

noah-paige commented Jul 9, 2024

petrkalos Jul 9, 2024

petrkalos Jul 9, 2024

petrkalos Jul 9, 2024

dlpzx Jul 9, 2024

petrkalos left a comment

Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

Conversation

dlpzx commented Jul 2, 2024 • edited Loading

Feature or Bugfix

Detail

For fresh deployments

Relates

Security

dlpzx commented Jul 8, 2024 • edited Loading

Testing locally

Testing in AWS

noah-paige commented Jul 9, 2024

petrkalos Jul 9, 2024

Choose a reason for hiding this comment

petrkalos Jul 9, 2024

Choose a reason for hiding this comment

petrkalos Jul 9, 2024

Choose a reason for hiding this comment

dlpzx Jul 9, 2024

Choose a reason for hiding this comment

petrkalos left a comment

Choose a reason for hiding this comment

dlpzx commented Jul 2, 2024 •

edited

Loading

dlpzx commented Jul 8, 2024 •

edited

Loading