Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset integration tests - Tables, Folders #1391

Merged
merged 49 commits into from
Sep 13, 2024

Conversation

noah-paige
Copy link
Contributor

@noah-paige noah-paige commented Jul 9, 2024

Feature or Bugfix

  • Feature: Testing

Detail

In this PR we add new fixtures for S3 datasets that are used for the tests in S3/tables/folders but also for the tests developed in ##1389:

  • Fix imported KMS dataset - there was an error in the KMS keys and in the registration of the Glue database
  • Folders as separate fixture using create_folder data.all API
  • Tables as separate fixture using boto3 calls to create the table, upload data and then use sync_tables data.all API - the data can be queried!

This PR moves dataset_base testing scenarios to datasets_base/test_dataset.py. Testing scenarios have been defined for the S3 datasets and the remaining test scenarios for the datasets_base APIs are defined with their signature and a TODO comment.

It also splits the S3 dataset tests into their corresponding API subcategories (in backend/.../s3_datasets/api)

  • test_s3_datasets
  • test_s3_tables
  • test_s3_tables_profiling
  • test_s3_tables_columns
  • test_s3_folders

Implement testing scenarios for test_s3_folders covering all APIs and dataset types (parametrized tests). Note that to avoid duplication of tests, unauthorized test cases are tested with only one of the dataset types as the code executed is the same for all cases.

Implement testing scenarios for test_s3_tables covering all APIs and dataset types (parametrized tests). Same as folders, unauthorized tests are performed on a single dataset type. New tests include: sync_tables with real tables, preview tables with real tables, preview unauthorized depending on the confidentiality level, get_dataset_level, list_dataset_tables

For test_s3_datasets only test_create_dataset_unauthorized is added, but for other existing tests we add test for all dataset types (parametrized tests).

Next steps

In follow-up PRs we should implement the missing commented TODO tests for:

  • datasets_base ---> list owned tests
  • s3_datasets ---> list owned tests
  • s3_tables ---> data filters tests
  • s3_tables_profiling ---> some tests
  • s3_tables_columns ---> all tests
  • Review backwards compatibility tests and add table and folder test cases

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dlpzx dlpzx changed the title Feat/integration tests datasets pt2 Add Dataset integration tests - Tables, Folders Jul 9, 2024
dlpzx added 2 commits July 9, 2024 15:40
…egration-tests-datasets-pt2

# Conflicts:
#	tests_new/integration_tests/core/environment/global_conftest.py
#	tests_new/integration_tests/modules/s3_datasets/global_conftest.py
#	tests_new/integration_tests/modules/s3_datasets/test_s3_dataset.py
dataset_name, client, group, env, bucket=None, kms_alias=None, glue_database=None
):
dataset_name = 'persistent_s3_dataset1'
def get_or_create_persistent_s3_dataset(dataset_name, client, group, env, bucket=None, kms_alias='', glue_database=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_name is further used only for bucket naming and tags. Let's make it a dataset name as well. Otherwise all datasets has the same name TestDatasetCreated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@dlpzx dlpzx force-pushed the feat/integration-tests-datasets-pt2 branch from 6e9fbb1 to 5358677 Compare September 12, 2024 07:56
@dlpzx
Copy link
Contributor

dlpzx commented Sep 12, 2024

Tested with latest changes locally and in a real CICD in AWS

@dlpzx dlpzx force-pushed the feat/integration-tests-datasets-pt2 branch from 7e51219 to 4084a3b Compare September 12, 2024 12:05
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets-pt2 branch from 4084a3b to 0eb4f73 Compare September 12, 2024 12:41
@dlpzx dlpzx force-pushed the feat/integration-tests-datasets-pt2 branch from 0eb4f73 to 590909b Compare September 12, 2024 13:35
Copy link
Contributor

@SofiaSazonova SofiaSazonova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested in my deployment. Good to go

).contains('UnauthorizedOperation', 'PROFILE_DATASET_TABLE', dataset_uri)


def test_list_table_profiling_runs():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could mark them as @pytest.mark.skip(reason="no way of currently testing this") to avoid polluting the reports

@dlpzx dlpzx merged commit 405019d into main Sep 13, 2024
9 checks passed
noah-paige added a commit that referenced this pull request Sep 16, 2024
### Feature or Bugfix
<!-- please choose -->
- Feature


### Detail
- Adding integration tests for Dataset Table Data Filters

- PENDING TESTS PASSING IN DEV AWS ENV
- Merge after #1391

### Relates
- related to #1220 and
#1358


### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <dlpzx@amazon.com>
dlpzx added a commit that referenced this pull request Sep 18, 2024
#1533)

### Feature or Bugfix
- Feature: Tests
TO BE MERGED AFTER #1391

### Detail
Follow-up of #1391. This PR
adds:
- Tests for profiling jobs - because it is an easy submodule I decided
to "chain" the tests and make them one dependent on the next one. I
could also create a fixture for a profiling job (check warning,
profiling jobs cannot be deleted)
- Added missing tests in datasets_base - we still need to add redshift
datasets and other types of datasets every time there is a new dataset
added.
- Added missing tests in s3_datasets:
test_list_s3_datasets_owned_by_env_group.

⚠️ Issues discovered during testing.
They are not bugs, they are missing functionalities:
- Profiling jobs can never be deleted. It is just information on the RDS
database, but nevertheless it cannot be deleted.
- It would be nice to have an API that checks the status of a Glue
crawler

### Relates
- #1358
- #1391
### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <noahpaig@amazon.com>
@dlpzx dlpzx deleted the feat/integration-tests-datasets-pt2 branch September 19, 2024 09:28
dlpzx added a commit that referenced this pull request Sep 24, 2024
### Feature or Bugfix
- Feature: Testing

### Detail
Follow-up of #1391 

- Implement Table Column tests

### Relates
- #1358
- #1391 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <noahpaig@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants