modularization: Dataset Modularization pt.1 #413

nikpodsh · 2023-04-13T14:16:22Z

Feature or Bugfix

Refactoring (Modularization)

Relates

Related issues Modularization of data.all code #295 and Modularization of Datasets (backend) #412

Short Summary

First part of migration of Dataset (DatasetTableColumn) TL;DR :)

Long Summary

Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review.
Dataset's API consist of the following items:

Dataset
DatasetTable
DatasetTableColumn
DatasetLocation
DatasetProfiling

In this PR, there is only creation of Dataset module and migration of DatasetTableColumn (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of DatasetTableColumn will be in other PR.
The issues:

Glossaries
Feed
Long tasks for datasets
Redshift

Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns:

The deployment uses a direct module folder references to run them (e.g. dataall.modules.datasets...., so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies.
There is a reference to redshift from long-running tasks = should be address in redshift module

Redshift: it has some references to datasets. So there will be either dependencies among modules or small code duplication (if redshift doesn't rely hard on datasets) = will be addressed in redshift module

Other changes:
Fixed and improved some tests
Extracted glue handler code that related to DatasetTableColumn
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)

Next steps:
Part2 in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules

Renamed table column to the python's convention format

Added dataset module to config.json

Moved database table service

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Moved DatasetTableColumn to modules

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Extracted the code for dataset table handler

Extracted the long-running task for datasets

Extracted the subscription service into datasets

Extracted the handler to get table columns

Needed for migration for modules

Fixed tests and added new for dataset module

Glossaries had different target types and had to be treated differently

Created API for glossaries to use modularization

Added and fixed tests

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Solve circular dependecy for redshift. It should go away after the migration of redshift

backend/dataall/modules/datasets/services/dataset_table.py

dlpzx · 2023-04-17T08:17:10Z

deploy/stacks/container.py

@@ -174,12 +175,13 @@ def __init__(
            update_bucket_policies_task.task.security_groups
        )



This is a #TODO but I think it is not that complicated to add parameters or read parameters. My concern is, how do we design something that scales? Are we going to have deployment changes/architectural components that change or are added based on modules selected? Brainstorming now, we have to types of features:

a) Features that add functionalities to the pipeline or the whole architecture (e.g. tooling networking, pivotrole)
b) Features that modify or add modules (shared_quicksight_dashboards)

With this we have a couple of options.

Current: cdk.json (for a and b)

cdk.json (for a) + config.json (for b)

cdk.json with a section for modules (b)

Yes, it shouldn't be difficult to implement, but I'd rather implement it in a dedicated PR after migrating the other dataset/sharing tasks

I think that we should not replicate changes between config.json and cdk.json, otherwise it'd be mess of configuration. So, I suggest placing all information about modules (even if it can impact deployment) to config.json and consulting with it during the deployment.
Of course, this approach has its own drawbacks, but it should be easy to implement and, more importantly, change if we decide to go this the other approach.
What do you think?

dlpzx · 2023-04-20T13:19:59Z

backend/dataall/modules/datasets/db/table_column_model.py

@@ -18,3 +18,6 @@ class DatasetTableColumn(Resource, Base):
    columnType = Column(
        String, default='column'
    )  # can be either "column" or "partition"
+
+    def uri(self):
+        return self.columnUri


I like that the uri is a class method. Are we going to do this with other data.all objects? and then use them in permissions...

I created with method for Glossaries, but it could also be used in permissions in future

dlpzx · 2023-04-20T13:23:42Z

backend/dataall/modules/datasets/handlers/glue_column_handler.py

+                    f'Failed to update table column {column.name} description: {e}'
+                )
+                raise e
+


This method does not make much sense here. we have the glue_table and the glue_column handlers and this is a function is a lf_table function that is inside the glue_column handler.

In total we have 2 functionalities:

glue (tables and columns) schema handler

lake formation permissions handler

Agree! I was planning to split the handlers code into two pieces: handlers and clients. Handlers is the code that's need for @Worker.handler and clients will contain API to make calls to AWS. I wanted to do it in the refactoring PR, but since you pointed it out, I will do it here at least for LF

dlpzx · 2023-04-20T13:24:24Z

backend/dataall/modules/datasets/services/dataset_table.py

@@ -224,7 +224,7 @@ def get_dataset_table_by_uri(session, table_uri):
        return table

    @staticmethod
-    def sync(session, datasetUri, glue_tables=None):
+    def sync_existing_tables(session, datasetUri, glue_tables=None):


Tried my best :)

dlpzx · 2023-04-20T13:28:06Z

backend/dataall/modules/loader.py

-    TASKS = "tasks"
+    API = auto()
+    CDK = auto()
+    HANDLERS = auto()



I like this, it defaults to the lowercase right?

It's used to apply default values to enums. It can be often seen when the value is not important

dlpzx · 2023-04-20T13:30:54Z

backend/dataall/modules/loader.py

@@ -57,7 +57,7 @@ def load_modules(modes: List[ImportMode]) -> None:
            log.info(f"Module {name} is not active. Skipping...")
            continue



active does not need to be checked because we are checking it in 56

dlpzx · 2023-04-20T14:05:31Z

backend/dataall/modules/loader.py

We will always have a loader per module right?

No, we have only one loader that loads all modules consequently depending on the mode.

### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nikpodsh added 30 commits April 11, 2023 11:50

Initialization of dataset module

3a5e0de

Refactoring of datasets

a50a02f

Moved dataset table column to modules

Refactoring of datasets

be14986

Renamed table column to the python's convention format

Refactoring of datasets

06f82ad

Added dataset module to config.json

Fixed leftover in loader

38145ae

Dataset refactoring

f0e146a

Moved database table service

Dataset refactoring

b039163

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Dataset refactoring

b7922ed

Moved DatasetTableColumn to modules

Notebooks doesn't require tasks

1771bca

Renamed tasks to handlers

3d1603f

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Dataset refactoring

fb6b515

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Dataset refactoring

e3596a5

Extracted the code for dataset table handler

Dataset refactoring

3af2ecf

Extracted the long-running task for datasets

Dataset refactoring

1a063b2

Extracted the subscription service into datasets

Dataset refactoring

b733714

Extracted the handler to get table columns

Extracted feed registry

2a4e2e0

Needed for migration for modules

Extracted feed and glossary registry and created a model registry

c15d090

Dataset refactoring

052a2b1

Fixed tests and added new for dataset module

Fixed and unignored test_tables_sync

d984483

Split model registry into feed and glossaries

dc0c935

Glossaries had different target types and had to be treated differently

Abstraction for glossaries

727e353

Created API for glossaries to use modularization

Fixed leftovers

49fbb41

Datasets refactoring

7d029e7

Added and fixed tests

Added runtime type registration for Union GraphQL type

be527eb

Changed Feed type registration mechanism

3daf2aa

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Added TODO for future refactoring

db3bfd3

Solve circular dependecy for redshift. It should go away after the migration of redshift

Added GlossaryRegistry for Union scheme

13b6e92

Changed import in redshift module

144dfea

No need for Utils yet

d43b9b3

Fixed linting

39b244c

nikpodsh mentioned this pull request Apr 14, 2023

Dataset Modularization pt.2 nikpodsh/aws-dataall#1

Closed

dlpzx reviewed Apr 17, 2023

View reviewed changes

backend/dataall/modules/datasets/services/dataset_table.py Show resolved Hide resolved

dlpzx reviewed Apr 17, 2023

View reviewed changes

Renamed method

ac71f59

dlpzx reviewed Apr 20, 2023

View reviewed changes

dlpzx approved these changes Apr 20, 2023

View reviewed changes

Added AWS for glue and lake formation clients

ec3228f

nikpodsh merged commit 3c4ab2d into data-dot-all:modularization-main Apr 24, 2023

dlpzx changed the title ~~Dataset Modularization pt.1~~ modularization: Dataset Modularization pt.1 May 24, 2023

dlpzx mentioned this pull request May 24, 2023

Generic way to toggle data.all features #473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modularization: Dataset Modularization pt.1 #413

modularization: Dataset Modularization pt.1 #413

nikpodsh commented Apr 13, 2023 •

edited

Loading

dlpzx Apr 17, 2023 •

edited

Loading

nikpodsh Apr 18, 2023

dlpzx Apr 20, 2023

nikpodsh Apr 20, 2023

dlpzx Apr 20, 2023

nikpodsh Apr 20, 2023

dlpzx Apr 20, 2023

nikpodsh Apr 20, 2023

dlpzx Apr 20, 2023

nikpodsh Apr 20, 2023

dlpzx Apr 20, 2023

dlpzx Apr 20, 2023

nikpodsh Apr 20, 2023

		@@ -174,12 +175,13 @@ def __init__(
		update_bucket_policies_task.task.security_groups
		)

		@@ -57,7 +57,7 @@ def load_modules(modes: List[ImportMode]) -> None:
		log.info(f"Module {name} is not active. Skipping...")
		continue

modularization: Dataset Modularization pt.1 #413

modularization: Dataset Modularization pt.1 #413

Conversation

nikpodsh commented Apr 13, 2023 • edited Loading

Feature or Bugfix

Relates

Short Summary

Long Summary

dlpzx Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikpodsh commented Apr 13, 2023 •

edited

Loading

dlpzx Apr 17, 2023 •

edited

Loading