-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modularization: Dataset Modularization pt.1 #413
modularization: Dataset Modularization pt.1 #413
Conversation
Moved dataset table column to modules
Renamed table column to the python's convention format
Added dataset module to config.json
Moved database table service
Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable
Moved DatasetTableColumn to modules
Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode
Extracted code from glue to glue_column_handler Added handlers importing for datasets
Extracted the code for dataset table handler
Extracted the long-running task for datasets
Extracted the subscription service into datasets
Extracted the handler to get table columns
Needed for migration for modules
Fixed tests and added new for dataset module
Glossaries had different target types and had to be treated differently
Created API for glossaries to use modularization
Added and fixed tests
Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes
Solve circular dependecy for redshift. It should go away after the migration of redshift
@@ -174,12 +175,13 @@ def __init__( | |||
update_bucket_policies_task.task.security_groups | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a #TODO but I think it is not that complicated to add parameters or read parameters. My concern is, how do we design something that scales? Are we going to have deployment changes/architectural components that change or are added based on modules selected? Brainstorming now, we have to types of features:
a) Features that add functionalities to the pipeline or the whole architecture (e.g. tooling networking, pivotrole)
b) Features that modify or add modules (shared_quicksight_dashboards)
With this we have a couple of options.
- Current: cdk.json (for a and b)
- cdk.json (for a) + config.json (for b)
- cdk.json with a section for modules (b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it shouldn't be difficult to implement, but I'd rather implement it in a dedicated PR after migrating the other dataset/sharing tasks
I think that we should not replicate changes between config.json and cdk.json, otherwise it'd be mess of configuration. So, I suggest placing all information about modules (even if it can impact deployment) to config.json and consulting with it during the deployment.
Of course, this approach has its own drawbacks, but it should be easy to implement and, more importantly, change if we decide to go this the other approach.
What do you think?
@@ -18,3 +18,6 @@ class DatasetTableColumn(Resource, Base): | |||
columnType = Column( | |||
String, default='column' | |||
) # can be either "column" or "partition" | |||
|
|||
def uri(self): | |||
return self.columnUri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the uri is a class method. Are we going to do this with other data.all objects? and then use them in permissions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created with method for Glossaries, but it could also be used in permissions in future
f'Failed to update table column {column.name} description: {e}' | ||
) | ||
raise e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method does not make much sense here. we have the glue_table and the glue_column handlers and this is a function is a lf_table function that is inside the glue_column handler.
In total we have 2 functionalities:
- glue (tables and columns) schema handler
- lake formation permissions handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! I was planning to split the handlers code into two pieces: handlers
and clients
. Handlers is the code that's need for @Worker.handler
and clients
will contain API to make calls to AWS. I wanted to do it in the refactoring PR, but since you pointed it out, I will do it here at least for LF
@@ -224,7 +224,7 @@ def get_dataset_table_by_uri(session, table_uri): | |||
return table | |||
|
|||
@staticmethod | |||
def sync(session, datasetUri, glue_tables=None): | |||
def sync_existing_tables(session, datasetUri, glue_tables=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried my best :)
TASKS = "tasks" | ||
API = auto() | ||
CDK = auto() | ||
HANDLERS = auto() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, it defaults to the lowercase right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used to apply default values to enums. It can be often seen when the value is not important
@@ -57,7 +57,7 @@ def load_modules(modes: List[ImportMode]) -> None: | |||
log.info(f"Module {name} is not active. Skipping...") | |||
continue | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
active does not need to be checked because we are checking it in 56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will always have a loader per module right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have only one loader that loads all modules consequently depending on the mode.
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
Relates
Short Summary
First part of migration of
Dataset
(DatasetTableColumn
) TL;DR :)Long Summary
Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review.
Dataset's API consist of the following items:
Dataset
DatasetTable
DatasetTableColumn
DatasetLocation
DatasetProfiling
In this PR, there is only creation of
Dataset
module and migration ofDatasetTableColumn
(and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring ofDatasetTableColumn
will be in other PR.The issues:
Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this
Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries
Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns:
dataall.modules.datasets....
, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies.redshift
from long-running tasks = should be address inredshift
moduleRedshift: it has some references to
datasets
. So there will be either dependencies among modules or small code duplication (ifredshift
doesn't rely hard ondatasets
) = will be addressed inredshift
moduleOther changes:
Fixed and improved some tests
Extracted glue handler code that related to
DatasetTableColumn
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)
Next steps:
Part2 in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.