-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modularization: Datasets modularization pt.2 #432
modularization: Datasets modularization pt.2 #432
Conversation
Moved dataset table column to modules
Renamed table column to the python's convention format
Added dataset module to config.json
Moved database table service
Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable
Moved DatasetTableColumn to modules
Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode
Extracted code from glue to glue_column_handler Added handlers importing for datasets
Extracted the code for dataset table handler
Extracted the long-running task for datasets
Extracted the subscription service into datasets
Extracted the handler to get table columns
Needed for migration for modules
Fixed tests and added new for dataset module
Glossaries had different target types and had to be treated differently
Created API for glossaries to use modularization
Added and fixed tests
Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes
Solve circular dependecy for redshift. It should go away after the migration of redshift
Moving datasets profiling to datasets modules
Renaming profiling
Renaming table_column_model to models to easier import other models
Moving DatasetProfilingRun model
Moving dataset profiling service and renaming it
Extracted glue_profiling_handler
Deleted DatasetTableProfilingJob since could not find any usage of it
Returned the name to model after renaming the service
@@ -1,6 +1,7 @@ | |||
"""The GraphQL schema of datasets and related functionality""" | |||
from dataall.modules.datasets.api import ( | |||
table_column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change the naming convention? api graphql Objects were capitalized before right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct and if you think it should remain the existing way please let me know.
I thought that this modularization is a good opportunity to start following the python convention ( underscore instead of capitalized).
But as I said if you feel that it's not correct (or capitalized is a unique style of data.all :)) I will return it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super small comment - I think we could standardize the names further: api.profiling, models.db.DatasetProfilingRun, services.dataset_profiling. For the handler, I think we should have a Glue client into commons and a profiling_handler, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the naming: I wrote this long names like dataset_profiling_service
because it allows better to navigate across files and it's solves name clashes in IDE when a lot files are opened. But as a downside the names are getting bigger. If you think that it would more beneficial to have something like services.profiling rather than services.profiling_service (and the same for the rest) I will rename it
About the handler: No, unfortunately, we can't keep Glue in common, because it has a dependency for the models of different modules. The only way to solve it is to split it and put a glue code related to modules into the modules. Otherwise, we will have a circular imports
I adding here the description of the PR just to check that I got a clear view of it: In this PR we are extracting the Profiling piece from the Datasets. The profiling feature uses the Glue profiling job that is deployed as part of the Dataset stack. Users trigger the run of this job from the UI "profiling tab" and the results are plotted in the UI.
Profiling is one feature that I see customers enabling/disabling from Datasets. In case they disable it the associated Glue Job deployed in the Dataset should also not be deployed. But we can think about that once the Dataset modularization is complete |
Thanks for that :)
I thought about it this way: Why do we need to separate them?
|
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - data-dot-all#295 and data-dot-all#412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
Detail
Refactoring of DatasetProfilingRun
Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.