forked from data-dot-all/dataall
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Modularization pt.2 #1
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Moving datasets profiling to datasets modules
Renaming profiling
Renaming table_column_model to models to easier import other models
Moving DatasetProfilingRun model
Moving dataset profiling service and renaming it
Extracted glue_profiling_handler
Deleted DatasetTableProfilingJob since could not find any usage of it
Returned the name to model after renaming the service
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this pull request
May 25, 2023
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature or Bugfix
Detail
Small refactoring of DatasetProfilingRun and moving it to the dataset module
Relates
data-dot-all#413, data-dot-all#295, and data-dot-all#412
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.