Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modularization: Dataset Modularization pt.1 #413

Merged
merged 32 commits into from
Apr 24, 2023

Conversation

nikpodsh
Copy link
Contributor

@nikpodsh nikpodsh commented Apr 13, 2023

Feature or Bugfix

  • Refactoring (Modularization)

Relates

Short Summary

First part of migration of Dataset (DatasetTableColumn) TL;DR :)

Long Summary

Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review.
Dataset's API consist of the following items:

  • Dataset
  • DatasetTable
  • DatasetTableColumn
  • DatasetLocation
  • DatasetProfiling

In this PR, there is only creation of Dataset module and migration of DatasetTableColumn (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of DatasetTableColumn will be in other PR.
The issues:

  1. Glossaries
  2. Feed
  3. Long tasks for datasets
  4. Redshift

Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns:

  1. The deployment uses a direct module folder references to run them (e.g. dataall.modules.datasets...., so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies.
  2. There is a reference to redshift from long-running tasks = should be address in redshift module

Redshift: it has some references to datasets. So there will be either dependencies among modules or small code duplication (if redshift doesn't rely hard on datasets) = will be addressed in redshift module

Other changes:
Fixed and improved some tests
Extracted glue handler code that related to DatasetTableColumn
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)

Next steps:
Part2 in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules
Renamed table column to the python's convention format
Added dataset module to config.json
Moved database table service
Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable
Moved DatasetTableColumn to modules
Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode
Extracted code from glue to glue_column_handler
Added handlers importing for datasets
Extracted the code for dataset table handler
Extracted the long-running task for datasets
Extracted the subscription service into datasets
Extracted the handler to get table columns
Needed for migration for modules
Fixed tests and added new for dataset module
Glossaries had different target types and had to be treated differently
Created API for glossaries to use modularization
Added and fixed tests
Moved FeedRegistry to gql since it's more appropriate place for this
Started using registry to provide types
Renaming and small fixes
Solve circular dependecy for redshift. It should go away after the migration of redshift
@@ -174,12 +175,13 @@ def __init__(
update_bucket_policies_task.task.security_groups
)

Copy link
Contributor

@dlpzx dlpzx Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a #TODO but I think it is not that complicated to add parameters or read parameters. My concern is, how do we design something that scales? Are we going to have deployment changes/architectural components that change or are added based on modules selected? Brainstorming now, we have to types of features:

a) Features that add functionalities to the pipeline or the whole architecture (e.g. tooling networking, pivotrole)
b) Features that modify or add modules (shared_quicksight_dashboards)

With this we have a couple of options.

  1. Current: cdk.json (for a and b)
  2. cdk.json (for a) + config.json (for b)
  3. cdk.json with a section for modules (b)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it shouldn't be difficult to implement, but I'd rather implement it in a dedicated PR after migrating the other dataset/sharing tasks

I think that we should not replicate changes between config.json and cdk.json, otherwise it'd be mess of configuration. So, I suggest placing all information about modules (even if it can impact deployment) to config.json and consulting with it during the deployment.
Of course, this approach has its own drawbacks, but it should be easy to implement and, more importantly, change if we decide to go this the other approach.
What do you think?

@@ -18,3 +18,6 @@ class DatasetTableColumn(Resource, Base):
columnType = Column(
String, default='column'
) # can be either "column" or "partition"

def uri(self):
return self.columnUri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the uri is a class method. Are we going to do this with other data.all objects? and then use them in permissions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created with method for Glossaries, but it could also be used in permissions in future

f'Failed to update table column {column.name} description: {e}'
)
raise e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not make much sense here. we have the glue_table and the glue_column handlers and this is a function is a lf_table function that is inside the glue_column handler.

In total we have 2 functionalities:

  • glue (tables and columns) schema handler
  • lake formation permissions handler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I was planning to split the handlers code into two pieces: handlers and clients. Handlers is the code that's need for @Worker.handler and clients will contain API to make calls to AWS. I wanted to do it in the refactoring PR, but since you pointed it out, I will do it here at least for LF

@@ -224,7 +224,7 @@ def get_dataset_table_by_uri(session, table_uri):
return table

@staticmethod
def sync(session, datasetUri, glue_tables=None):
def sync_existing_tables(session, datasetUri, glue_tables=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried my best :)

TASKS = "tasks"
API = auto()
CDK = auto()
HANDLERS = auto()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, it defaults to the lowercase right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to apply default values to enums. It can be often seen when the value is not important

@@ -57,7 +57,7 @@ def load_modules(modes: List[ImportMode]) -> None:
log.info(f"Module {name} is not active. Skipping...")
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

active does not need to be checked because we are checking it in 56

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will always have a loader per module right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have only one loader that loads all modules consequently depending on the mode.

@nikpodsh nikpodsh merged commit 3c4ab2d into data-dot-all:modularization-main Apr 24, 2023
@dlpzx dlpzx changed the title Dataset Modularization pt.1 modularization: Dataset Modularization pt.1 May 24, 2023
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this pull request May 25, 2023
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues data-dot-all#295 and data-dot-all#412

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :)

### Long Summary
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues:
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)

Next steps:
[Part2 ](nikpodsh#1) in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants