Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset integration tests - Dataset CRUD + actions outside of data.all #1379

Merged
merged 20 commits into from
Jul 9, 2024

Conversation

dlpzx
Copy link
Contributor

@dlpzx dlpzx commented Jul 2, 2024

Feature or Bugfix

  • Feature

Detail

It implements some tests for s3_datasets (check full list in #1358)

For fresh deployments

  • Create Dataset
  • Import Dataset --> IMPORTANT: See below details on AWS actions for testing
  • List Datasets
  • Get Dataset
  • Edit Dataset
  • Delete dataset - decision: I only added explicit test for delete_unauthorized since delete_dataset is covered in the fixtures and it takes a long time to deploy + delete. If needed we can introduce the test for better reporting
  • Access dataset assume role url
  • Generate dataset access token
  • Dataset upload data presigned url
  • Backwards compatibility - update dataset
  • Backwards compatibility - import dataset

🔦 AWS actions outside of data.all
There are some actions that in real life are performed outside of data.all. To run the tests we need to either perform this actions manually before the tests are executed or we can use AWS SDK to automate them. Most important actions performed outside of data.all.

  • Creation of consumption roles
  • Creation of imported dataset bucket, kms key and glue database *IN THIS PR
  • Create VPCs for Notebooks
  • Validate shares - we assume the share request role for this

To create resources we need to assume a role in the environment account. We could assume the pivot role, but then we need to ensure that it has CreateBucket... permissions; which is not the case. I have opted to create a separate isolated role dataall-integration-tests-role as part of the environment stack ONLY when we are creating environments during integration testing. As part of the global config of environments users can use the boto3 session of this role to perform direct AWS calls in the environment account.

In https://github.com/data-dot-all/dataall/pull/1382/files we discussed some alternatives. In this PR we use the environmentType variable in the environment model, which was not used for anything (it always defaulted to Data environments).
API call create environment (input: environmentType = IntegrationTesting) ---> in environment stack we check the type of environment and deploy the integration test role.

Then we use an SSM parameter to read the tooling account id needed for the assume role trust policy

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dlpzx dlpzx changed the title Feat/integration tests datasets Add Dataset integration tests - 1 Jul 2, 2024
@dlpzx dlpzx changed the title Add Dataset integration tests - 1 Add Dataset integration tests - Dataset CRUD + actions outside of data.all Jul 2, 2024
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets branch 2 times, most recently from 9dafa90 to 54d5ff8 Compare July 2, 2024 11:21
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets branch from 54d5ff8 to cd27097 Compare July 2, 2024 11:25
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets branch from 91a6afe to fa69dde Compare July 3, 2024 15:27
@dlpzx
Copy link
Contributor Author

dlpzx commented Jul 8, 2024

Testing locally

  • Create Environment outside of testing - does not create testing integration role
  • Create Environment as part of tests - creates testing integration role
  • test succeed and created infra gets deleted (bucket, glue database, kms)

Testing in AWS

  • CICD pipeline can assume integration roles and create imported dataset infra
  • tests succeed (attach screenshot)

@dlpzx dlpzx force-pushed the feat/integration-tests-datasets branch from 430f5b8 to 5ea8b6b Compare July 8, 2024 13:14
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets branch from 83f3a07 to 520a34e Compare July 8, 2024 13:50
@noah-paige noah-paige marked this pull request as ready for review July 9, 2024 03:00
@noah-paige
Copy link
Contributor

Tested in AWS and all checks are passing

@dlpzx dlpzx requested a review from petrkalos July 9, 2024 09:10
…egration-tests-datasets

# Conflicts:
#	tests_new/integration_tests/core/environment/global_conftest.py
# TODO: Come up with better way to handle wait in progress if applicable
# Use time.sleep() instead of poller b/c of case where no changes founds (i.e. no update required)
# check_stack_in_progress(client1, env_uri, stack_uri)
time.sleep(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 10 seconds are not enough to make sure that cdkproxy issues a cdk deploy. It takes about 30 seconds for the fargate to boot the container, another 30 seconds to execute the cdk deploy and then probably a few more seconds for the cdk deploy to return `no diffs.

The best solution would be to make updateStack return a taskId and then expose the apis (that we already have in the backend) to track the status of this taskId.
But a good enough solution would be the following...
updateStack/getStack return a list of events, when we issue the updateStack we can get the latest event, then we can poll with getStack until we get a new event, at that point we will know for sure that the cdk deploy have started executing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing this I realised that if there are no changes in the stack then maybe cdk deploy doesn't generate a new CFN event, so we need to double check that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can disregard the above comments, when the are no changes in the stack then cdk deploy doesn't generate a new events. As an alternative we can use the getStackLogs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being I increased the sleep time to give ECS/CFN time to update

@dlpzx dlpzx requested a review from petrkalos July 9, 2024 12:52
Copy link
Contributor

@petrkalos petrkalos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only suggest to increase the timeout to 2 minutes as privisioning+running of cdkproxy takes about 1 minute

@dlpzx dlpzx merged commit e9108ab into main Jul 9, 2024
10 checks passed
@dlpzx dlpzx deleted the feat/integration-tests-datasets branch July 17, 2024 12:25
dlpzx added a commit that referenced this pull request Sep 13, 2024
### Feature or Bugfix
- Feature: Testing

### Detail

In this PR we add new fixtures for S3 datasets that are used for the
tests in S3/tables/folders but also for the tests developed in
##1389:
- Fix imported KMS dataset - there was an error in the KMS keys and in
the registration of the Glue database
- Folders as separate fixture using create_folder data.all API
- Tables as separate fixture using boto3 calls to create the table,
upload data and then use sync_tables data.all API - the data can be
queried!

This PR moves dataset_base testing scenarios to
datasets_base/test_dataset.py. Testing scenarios have been defined for
the S3 datasets and the remaining test scenarios for the datasets_base
APIs are defined with their signature and a TODO comment.

It also splits the S3 dataset tests into their corresponding API
subcategories (in `backend/.../s3_datasets/api`)
- test_s3_datasets
- test_s3_tables
- test_s3_tables_profiling
- test_s3_tables_columns
- test_s3_folders

Implement testing scenarios for `test_s3_folders` covering all APIs and
dataset types (parametrized tests). Note that to avoid duplication of
tests, unauthorized test cases are tested with only one of the dataset
types as the code executed is the same for all cases.

Implement testing scenarios for `test_s3_tables` covering all APIs and
dataset types (parametrized tests). Same as folders, unauthorized tests
are performed on a single dataset type. New tests include: sync_tables
with real tables, preview tables with real tables, preview unauthorized
depending on the confidentiality level, get_dataset_level,
list_dataset_tables

For `test_s3_datasets` only test_create_dataset_unauthorized is added,
but for other existing tests we add test for all dataset types
(parametrized tests).


#### Next steps
In follow-up PRs we should implement the missing commented TODO tests
for:
- datasets_base ---> list owned tests
- s3_datasets ---> list owned tests
- s3_tables ---> data filters tests
- s3_tables_profiling ---> some tests
- s3_tables_columns ---> all tests
- Review backwards compatibility tests and add table and folder test
cases

### Relates
- #1379

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <dlpzx@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration tests executed on a real deployment as part of the CICD - Datasets
3 participants