modularization: Datasets modularization pt.5 #442

nikpodsh · 2023-05-02T14:37:50Z

Feature or Bugfix

Refactoring

Detail

Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions.
Used new has_tenant_permission instead of has_tenant_perm that allows not to pass unused parameters

Relates

#412 and #295

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules

Renamed table column to the python's convention format

Added dataset module to config.json

Moved database table service

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Moved DatasetTableColumn to modules

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Extracted the code for dataset table handler

Extracted the long-running task for datasets

Extracted the subscription service into datasets

Extracted the handler to get table columns

Needed for migration for modules

Fixed tests and added new for dataset module

Glossaries had different target types and had to be treated differently

Created API for glossaries to use modularization

Added and fixed tests

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Solve circular dependecy for redshift. It should go away after the migration of redshift

dbalintx · 2023-05-05T13:34:12Z

backend/dataall/modules/datasets/services/dataset_service.py

@@ -623,3 +619,82 @@ def count_dataset_tables(session, dataset_uri):
            .filter(DatasetTable.datasetUri == dataset_uri)
            .count()
        )
+
+    @staticmethod
+    def query_environment_group_datasets(session, envUri, groupUri, filter) -> Query:


These query functions should be in dataset_repository (following your design pattern), but as we discussed this could also be in a followup :)

Yes, I will extract this methods after migrating the sharing.

backend/dataall/api/Objects/Stack/stack_helper.py

dlpzx · 2023-05-08T09:51:06Z

backend/dataall/modules/datasets/tasks/bucket_policy_updater.py

Unrelated to modularization: this task needs to be reviewed. It is triggered every 15 mins and it was originally conceived to add statements to the pre-existing bucket policy of imported datasets. Through this task we can enforce more restrictive access to imported buckets

Yes, it looks strange. The task runs every 15 mins to update policies for imported datasets. Most of the time it will do nothing, but just incur cost for running.
I wonder why we can't update the policies during the importing of datasets?

backend/dataall/modules/datasets/aws/lf_table_client.py

backend/dataall/modules/datasets/handlers/glue_column_handler.py

backend/dataall/modules/datasets/handlers/glue_dataset_handler.py

backend/dataall/modules/datasets/handlers/glue_table_handler.py

backend/dataall/modules/datasets/handlers/sns_dataset_handler.py

backend/dataall/modules/datasets/handlers/s3_location_handler.py

backend/dataall/api/Objects/Environment/resolvers.py

backend/dataall/aws/handlers/glue.py

dlpzx · 2023-05-08T12:05:18Z

backend/dataall/cdkproxy/stacks/policies/data_policy.py

@@ -86,11 +84,14 @@ def generate_admins_data_access_policy(self) -> iam.Policy:

        return policy

-    def generate_data_access_policy(self) -> iam.Policy:
+    def generate_data_access_policy(self, session) -> iam.Policy:
        """


why do we need to pass the session? I see now, you are using the get_statements defined in the subclass! got it

backend/dataall/db/api/share_object.py

backend/dataall/modules/datasets/cdk/dataset_data_policy.py

dlpzx · 2023-05-08T13:24:43Z

tests/api/test_dataset_location.py

@@ -70,10 +73,8 @@ def test_get_dataset(client, dataset1, env1, user, group):




What is the advantage of using MagicMock?

Honestly? it's just working :)
I couldn't find a better way to mock instance of class. Tried a few other ways to patch an instance, but got only FAILED. This one worked)

dlpzx

thank you!

…ronment stack

nikpodsh · 2023-05-09T09:47:20Z

Added new commits:
Most of them were review remarks, but I also returned triggering the alarms as we discussed (+ added a line to trigger alarm not only create a message) @dlpzx
Extracted DatasetRole (kudos to @dbalintx for noticing it)
Created a way to extend EnvironmentSetup (which is an environment stack) and extracted glue profiler since it looks related to datasets. I didn't do that for the others likelakeformationdefaultsettings and gluedatabasecustomresource since wasn't sure if there are only dataset-specific.

### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nikpodsh added 30 commits April 11, 2023 11:50

Initialization of dataset module

3a5e0de

Refactoring of datasets

a50a02f

Moved dataset table column to modules

Refactoring of datasets

be14986

Renamed table column to the python's convention format

Refactoring of datasets

06f82ad

Added dataset module to config.json

Fixed leftover in loader

38145ae

Dataset refactoring

f0e146a

Moved database table service

Dataset refactoring

b039163

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Dataset refactoring

b7922ed

Moved DatasetTableColumn to modules

Notebooks doesn't require tasks

1771bca

Renamed tasks to handlers

3d1603f

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Dataset refactoring

fb6b515

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Dataset refactoring

e3596a5

Extracted the code for dataset table handler

Dataset refactoring

3af2ecf

Extracted the long-running task for datasets

Dataset refactoring

1a063b2

Extracted the subscription service into datasets

Dataset refactoring

b733714

Extracted the handler to get table columns

Extracted feed registry

2a4e2e0

Needed for migration for modules

Extracted feed and glossary registry and created a model registry

c15d090

Dataset refactoring

052a2b1

Fixed tests and added new for dataset module

Fixed and unignored test_tables_sync

d984483

Split model registry into feed and glossaries

dc0c935

Glossaries had different target types and had to be treated differently

Abstraction for glossaries

727e353

Created API for glossaries to use modularization

Fixed leftovers

49fbb41

Datasets refactoring

7d029e7

Added and fixed tests

Added runtime type registration for Union GraphQL type

be527eb

Changed Feed type registration mechanism

3daf2aa

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Added TODO for future refactoring

db3bfd3

Solve circular dependecy for redshift. It should go away after the migration of redshift

Added GlossaryRegistry for Union scheme

13b6e92

Changed import in redshift module

144dfea

No need for Utils yet

d43b9b3

Fixed linting

39b244c

dbalintx reviewed May 5, 2023

View reviewed changes