Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset integration tests - Tables, Folders #1391

Merged
merged 49 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
b188538
Add integration tests for datasets - basic queries and conftest
dlpzx Jul 1, 2024
45d1407
add list + get queries, add persistent datasets, begin create/update/…
noah-paige Jul 1, 2024
cd27097
Add integration test role in Environment stack + session in conftest …
dlpzx Jul 2, 2024
d04b525
simplified conftests for datasets
dlpzx Jul 2, 2024
5e5507e
create integration role with region in name
noah-paige Jul 2, 2024
fa69dde
New environment type: IntegrationTests + ssm param with tooling accou…
dlpzx Jul 3, 2024
3e19596
Error on cdk add_to_policy
dlpzx Jul 3, 2024
c05de67
Add filter term include tags datasets
noah-paige Jul 4, 2024
8f2a918
Add sample data and tests for dataset role access
noah-paige Jul 4, 2024
9b2c711
Add sample data and tests for dataset role access
noah-paige Jul 4, 2024
2dcd60f
Add assume role permissions to codebuild role
dlpzx Jul 8, 2024
c261da7
Add naming checks in clients + create table
dlpzx Jul 8, 2024
1e9732b
Add permissions, confidentiality and commented tests
dlpzx Jul 8, 2024
5ea8b6b
revert persistent environment
dlpzx Jul 8, 2024
520a34e
Fix check_stack_ready in dataset creation
dlpzx Jul 8, 2024
972c883
Revert session environment and add tests
dlpzx Jul 8, 2024
7b1c942
fix integration role datasets
noah-paige Jul 8, 2024
d9042dc
Fix presigned URL upload test
noah-paige Jul 9, 2024
928c3aa
Merge remote-tracking branch 'refs/remotes/origin/main' into feat/int…
dlpzx Jul 9, 2024
b633938
Uncomment drafted table/folder tests
dlpzx Jul 9, 2024
2330021
Merge branch 'refs/heads/main' into feat/integration-tests-datasets-pt2
dlpzx Sep 5, 2024
a857d0a
Ruff and readme
dlpzx Sep 5, 2024
5968fd3
Split dataset tests and added signature of each test for all APIs. Fi…
dlpzx Sep 5, 2024
052bc7e
Added all dataset query definitions and placeholders for tests
dlpzx Sep 6, 2024
9ad774e
Started parametrization of tests
dlpzx Sep 6, 2024
146c45e
Started parametrization of tests
dlpzx Sep 6, 2024
0907ea3
Started parametrization of tests
dlpzx Sep 6, 2024
7f68d3b
Started parametrization of tests
dlpzx Sep 6, 2024
98c7667
Added persistent tables and folders
dlpzx Sep 9, 2024
92f23e3
Remove unnecessary tests in folders
dlpzx Sep 9, 2024
3907001
Fix issues with KMS datasets
dlpzx Sep 9, 2024
3553d12
Temporary changes for persistent datasets
dlpzx Sep 9, 2024
9be25bb
Add paramtrization in profiling, confidentiality, fix issue in glue t…
dlpzx Sep 9, 2024
1709bb5
Fix s3_table tests, parametrized dataset tests
dlpzx Sep 9, 2024
6889c47
Retouch preview_table tests
dlpzx Sep 9, 2024
3d67e2e
Fixed profiling tables tests
dlpzx Sep 9, 2024
4fd56af
Fixed profiling tables tests
dlpzx Sep 9, 2024
56b12c5
Fix everything except for persistent-sse-s3 tests
dlpzx Sep 9, 2024
ade89e8
Fix API query to filter by tags + add README detail
dlpzx Sep 10, 2024
72259c4
Wrong SSM parameter in README
dlpzx Sep 10, 2024
3b74b03
Merge remote-tracking branch 'refs/remotes/origin/main' into feat/int…
dlpzx Sep 10, 2024
196fb6e
Moving fixture parameters to conftest
dlpzx Sep 10, 2024
d3bb8be
Update requisite in README
dlpzx Sep 10, 2024
ec38ec0
PR review comments - functions to create AWS imported resources, names
dlpzx Sep 11, 2024
ccb6887
PR review comments - 2
dlpzx Sep 11, 2024
01c65c8
Merge branch 'refs/heads/main' into feat/integration-tests-datasets-pt2
dlpzx Sep 11, 2024
5358677
Issue persistent buckets
dlpzx Sep 11, 2024
590909b
Rewrite if-clause existing infra and resource for imported dataset
dlpzx Sep 12, 2024
0674531
Small return issue
dlpzx Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions backend/dataall/modules/datasets_base/db/dataset_repositories.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ def _query_all_user_datasets(session, username, groups, all_subqueries: List[Que
term = filter['term']
query = query.filter(
or_(
DatasetBase.description.ilike(term + '%%'),
DatasetBase.label.ilike(term + '%%'),
DatasetBase.label.ilike('%' + term + '%'),
DatasetBase.description.ilike('%' + term + '%'),
DatasetBase.tags.contains(f'{{{term}}}'),
)
)
Expand All @@ -90,10 +90,12 @@ def _query_user_datasets(session, username, groups, filter) -> Query:
)
)
if filter and filter.get('term'):
term = filter['term']
query = query.filter(
or_(
DatasetBase.description.ilike(filter.get('term') + '%%'),
DatasetBase.label.ilike(filter.get('term') + '%%'),
DatasetBase.label.ilike('%' + term + '%'),
DatasetBase.description.ilike('%' + term + '%'),
DatasetBase.tags.contains(f'{{{term}}}'),
)
)
return query.order_by(DatasetBase.label).distinct(DatasetBase.datasetUri, DatasetBase.label)
Expand Down Expand Up @@ -125,7 +127,6 @@ def _query_environment_datasets(session, uri, filter) -> Query:
DatasetBase.label.ilike('%' + term + '%'),
DatasetBase.description.ilike('%' + term + '%'),
DatasetBase.tags.contains(f'{{{term}}}'),
DatasetBase.region.ilike('%' + term + '%'),
)
)
return query.order_by(DatasetBase.label)
9 changes: 7 additions & 2 deletions tests_new/integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Currently **we support only Cognito based deployments** but support for any IdP
## Pre-requisites

- A real deployment of data.all in AWS
- An SSM parameter (`/{resource_prefix/{env_name}/testdata`) with the following contents
- An SSM parameter (`/dataall/{env_name}/testdata`) in the DEPLOYMENT ACCOUNT with the following contents
dlpzx marked this conversation as resolved.
Show resolved Hide resolved
```
{
"users": {
Expand Down Expand Up @@ -85,4 +85,9 @@ You can also run the tests locally by...

## Coverage

At the moment integration tests only cover Organizations module as an example.
At the moment integration tests cover:
- Organizations
- Environments
- S3 Datasets
- Notebooks
- Worksheets
10 changes: 5 additions & 5 deletions tests_new/integration_tests/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
ENVNAME = os.getenv('ENVNAME', 'dev')


def _retry_if_connection_error(exception):
dlpzx marked this conversation as resolved.
Show resolved Hide resolved
"""Return True if we should retry, False otherwise"""
return isinstance(exception, requests.exceptions.ConnectionError) or isinstance(exception, requests.ReadTimeout)


class Client:
def __init__(self, username, password):
self.username = username
self.password = password
self.token = self._get_jwt_token()

@staticmethod
def _retry_if_connection_error(exception):
"""Return True if we should retry, False otherwise"""
return isinstance(exception, requests.exceptions.ConnectionError) or isinstance(exception, requests.ReadTimeout)

@retry(
retry_on_exception=_retry_if_connection_error,
stop_max_attempt_number=3,
Expand Down
52 changes: 52 additions & 0 deletions tests_new/integration_tests/modules/datasets_base/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,55 @@ def list_datasets(client, term=''):
}
response = client.query(query=query)
return response.data.listDatasets


def list_owned_datasets(client, term=''):
query = {
'operationName': 'listOwnedDatasets',
'variables': {'filter': {'term': term}},
'query': f"""
query listOwnedDatasets($filter: DatasetFilter) {{
listOwnedDatasets(filter: $filter) {{
count
page
pages
hasNext
hasPrevious
nodes {{
{DATASET_BASE_TYPE}
}}
}}
}}
""",
}
response = client.query(query=query)
return response.data.listOwnedDatasets


def list_datasets_created_in_environment(client, environment_uri, term=''):
query = {
'operationName': 'ListDatasets',
'variables': {'environmentUri': environment_uri, 'filter': {'term': term}},
'query': f"""
query ListDatasetsCreatedInEnvironment(
$filter: DatasetFilter
$environmentUri: String!
) {{
listDatasetsCreatedInEnvironment(
environmentUri: $environmentUri
filter: $filter
) {{
count
page
pages
hasNext
hasPrevious
nodes {{
{DATASET_BASE_TYPE}
}}
}}
}}
""",
}
response = client.query(query=query)
return response.data.listDatasetsCreatedInEnvironment
44 changes: 44 additions & 0 deletions tests_new/integration_tests/modules/datasets_base/test_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import logging
from assertpy import assert_that

from integration_tests.modules.datasets_base.queries import (
list_datasets,
list_owned_datasets,
list_datasets_created_in_environment,
)

log = logging.getLogger(__name__)


def test_list_datasets(
client1, session_s3_dataset1, session_imported_sse_s3_dataset1, session_imported_kms_s3_dataset1, session_id
):
assert_that(list_datasets(client1, term=session_id).nodes).is_length(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case further we have to add new datasets, may be it's better not to fix the number?
Anyway, it's better to check the datasets by uri, I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment stays for other tests with fixed number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a list, I wanted to check that the listing of items returns the expected items. But it is true that it can cause issues with other existing test resources. That's why the term=session_id filter is applied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already has this problem with my share_tests, but in table tests. I created another table for sharing tests, so e.g. here I have 3 tables, not 2

def test_list_dataset_tables(client1, dataset_fixture_name, tables_fixture_name, request):
    tables = request.getfixturevalue(tables_fixture_name)
    dataset = request.getfixturevalue(dataset_fixture_name)
    response = list_dataset_tables(client1, dataset.datasetUri)
    assert_that(response.tables.count).is_equal_to(2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I should reuse the existing datasets and don't create others, but I'll revise it after this PR is merged.
Anyway, in future we may have to add more tables/folders/datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that dataset test objects are controlled inside the dataset tests. Other modules can use the fixtures defined in global_conftest. So yes, I think in shares you should be using the datasets/tables and folders from the datasets module



def test_list_datasets_unauthorized(
client2, session_s3_dataset1, session_imported_sse_s3_dataset1, session_imported_kms_s3_dataset1, session_id
):
assert_that(list_datasets(client2, term=session_id).nodes).is_length(0)


def test_list_owned_datasets( # TODO
client1, session_s3_dataset1, session_imported_sse_s3_dataset1, session_imported_kms_s3_dataset1, session_id
):
assert_that(list_owned_datasets(client1, term=session_id).nodes).is_length(3)


def test_list_owned_datasets_unauthorized( # TODO
client2, session_s3_dataset1, session_imported_sse_s3_dataset1, session_imported_kms_s3_dataset1, session_id
):
assert_that(list_owned_datasets(client2, term=session_id).nodes).is_length(0)


def test_list_datasets_created_in_environment():
# TODO
pass


def test_list_datasets_created_in_environment_unauthorized():
# TODO
pass
90 changes: 82 additions & 8 deletions tests_new/integration_tests/modules/s3_datasets/aws_clients.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import logging
import json
import re
import os
from botocore.exceptions import ClientError

log = logging.getLogger(__name__)
Expand All @@ -12,18 +13,34 @@ def __init__(self, session, region):
self._resource = session.resource('s3', region_name=region)
self._region = region

def create_bucket(self, bucket_name, kms_key_id=None):
def bucket_exists(self, bucket_name):
dlpzx marked this conversation as resolved.
Show resolved Hide resolved
"""
Check if an S3 bucket exists.
:param bucket_name: Name of the S3 bucket to check
:return: True if the bucket exists, False otherwise
"""
try:
self._client.head_bucket(Bucket=bucket_name)
return True
except ClientError as e:
if e.response['Error']['Code'] == '404':
return False
else:
log.error(f'Error checking if bucket {bucket_name} exists: {e}')
raise

def create_bucket(self, bucket_name, kms_key_arn=None):
"""
Create an S3 bucket.
:param bucket_name: Name of the S3 bucket to be created
:param kms_key_id: KMS key ID to use for encryption if encryption_type is 'aws:kms'
:param kms_key_arn: KMS key Arn to use for encryption if encryption_type is 'aws:kms'
:return: None
"""
bucket_name = re.sub('[^a-zA-Z0-9-]', '', bucket_name).lower()

encryption_type = 'aws:kms' if kms_key_id else 'AES256'
encryption_type = 'aws:kms' if kms_key_arn else 'AES256'
encryption_config = (
{'SSEAlgorithm': encryption_type, 'KMSMasterKeyID': kms_key_id}
{'SSEAlgorithm': encryption_type, 'KMSMasterKeyID': kms_key_arn}
if encryption_type == 'aws:kms'
else {'SSEAlgorithm': encryption_type}
)
Expand All @@ -41,7 +58,7 @@ def create_bucket(self, bucket_name, kms_key_id=None):
Bucket=bucket_name,
ServerSideEncryptionConfiguration={
'Rules': [
{'ApplyServerSideEncryptionByDefault': encryption_config, 'BucketKeyEnabled': False},
{'ApplyServerSideEncryptionByDefault': encryption_config, 'BucketKeyEnabled': True},
]
},
)
Expand All @@ -67,12 +84,48 @@ def delete_bucket(self, bucket_name):
except ClientError as e:
log.exception(f'Error deleting S3 bucket: {e}')

def upload_file_to_prefix(self, local_file_path, s3_path):
"""
Upload a file from a local path to an S3 bucket with a specified prefix.

:param local_file_path: Path to the local file to be uploaded
:param s3_path: S3 path where the file should be uploaded, including the bucket name and prefix
:return: None
"""
try:
bucket_name, prefix = s3_path.split('/', 1)
object_key = f'{prefix}/{os.path.basename(local_file_path)}'
self._client.upload_file(local_file_path, bucket_name, object_key)
except ClientError as e:
logging.error(f'Error uploading file to S3: {e}')
raise


class KMSClient:
def __init__(self, session, account_id, region):
self._client = session.client('kms', region_name=region)
self._account_id = account_id

def get_key_id_and_alias(self, alias_name):
"""
Get the key ID and alias name for a given alias.
:param alias_name: The alias name to look up
:return: A tuple containing the key ID and alias name if the alias exists, False otherwise
"""
try:
alias_name = alias_name.lower()
response = self._client.describe_key(KeyId=f'alias/{alias_name}')
key_id = response['KeyMetadata']['KeyId']
aliases = response['KeyMetadata']['Aliases']
for alias in aliases:
if alias['AliasName'] == f'alias/{alias_name}':
return key_id, alias['AliasName']
except ClientError as e:
if e.response['Error']['Code'] == 'NotFoundException':
return False, False
else:
log.exception(f'Error getting key ID and alias for {alias_name}: {e}')

def create_key_with_alias(self, alias_name):
try:
response = self._client.create_key()
Expand Down Expand Up @@ -155,6 +208,20 @@ class GlueClient:
def __init__(self, session, region):
self._client = session.client('glue', region_name=region)

def get_database(self, database_name):
"""
Check if a Glue database exists.
:param database_name: Name of the Glue database to check
:return: True if the database exists, False otherwise
"""
try:
database = self._client.get_database(Name=database_name)
return database
except self._client.exceptions.EntityNotFoundException:
return False
except ClientError as e:
log.exception(f'Error checking if database {database_name} exists: {e}')

def create_database(self, database_name, bucket):
try:
database_name = re.sub('[^a-zA-Z0-9_]', '', database_name).lower()
Expand All @@ -172,16 +239,23 @@ def create_table(self, database_name, bucket, table_name):
'Description': 'integration tests',
'StorageDescriptor': {
'Columns': [
{'Name': 'column1', 'Type': 'string'},
{'Name': 'column1', 'Type': 'int'},
{'Name': 'column2', 'Type': 'string'},
{'Name': 'column3', 'Type': 'string'},
],
'Location': f's3://{bucket}/',
'Location': f's3://{bucket}/{table_name}/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
'Parameters': {'field.delim': ','},
},
},
},
)
except ClientError as e:
log.exception(f'Error creating Glue database: {e}')
log.exception(f'Error creating Glue table: {e}')

def delete_database(self, database_name):
"""
Expand Down
Loading
Loading