Dataset Modularization pt.2 #1

nikpodsh · 2023-04-14T14:12:34Z

Feature or Bugfix

Refactoring

Detail

Small refactoring of DatasetProfilingRun and moving it to the dataset module

Relates

data-dot-all#413, data-dot-all#295, and data-dot-all#412

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moving datasets profiling to datasets modules

Renaming profiling

Renaming table_column_model to models to easier import other models

Moving DatasetProfilingRun model

Moving dataset profiling service and renaming it

Extracted glue_profiling_handler

Deleted DatasetTableProfilingJob since could not find any usage of it

Returned the name to model after renaming the service

### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nikpodsh added 7 commits April 14, 2023 15:24

Datasets refactoring

cb3800a

Moving datasets profiling to datasets modules

Datasets refactoring

dd8e597

Renaming profiling

Datasets refactoring

8ca7bea

Renaming table_column_model to models to easier import other models

Datasets refactoring

e36ab3b

Moving DatasetProfilingRun model

Datasets refactoring

31720c2

Moving dataset profiling service and renaming it

Datasets refactoring

8a907df

Extracted glue_profiling_handler

Datasets refactoring

561da72

Deleted DatasetTableProfilingJob since could not find any usage of it

nikpodsh mentioned this pull request Apr 14, 2023

modularization: Dataset Modularization pt.1 data-dot-all/dataall#413

Merged

nikpodsh and others added 3 commits April 17, 2023 10:37

Datasets refactoring

47a38cc

Returned the name to model after renaming the service

Resolved code conflict

2ac3ae7

nikpodsh closed this May 4, 2023

nikpodsh deleted the datasets-mod-part2 branch May 4, 2023 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Modularization pt.2 #1

Dataset Modularization pt.2 #1

nikpodsh commented Apr 14, 2023

Dataset Modularization pt.2 #1

Dataset Modularization pt.2 #1

Conversation

nikpodsh commented Apr 14, 2023

Feature or Bugfix

Detail

Relates