modularization: Datasets modularization pt.2 #432

nikpodsh · 2023-04-24T11:53:55Z

Feature or Bugfix

Refactoring

Detail

Refactoring of DatasetProfilingRun

Relates

Modularization of data.all code #295 and Modularization of Datasets (backend) #412

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules

Renamed table column to the python's convention format

Added dataset module to config.json

Moved database table service

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Moved DatasetTableColumn to modules

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Extracted the code for dataset table handler

Extracted the long-running task for datasets

Extracted the subscription service into datasets

Extracted the handler to get table columns

Needed for migration for modules

Fixed tests and added new for dataset module

Glossaries had different target types and had to be treated differently

Created API for glossaries to use modularization

Added and fixed tests

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Solve circular dependecy for redshift. It should go away after the migration of redshift

Moving datasets profiling to datasets modules

Renaming profiling

Renaming table_column_model to models to easier import other models

Moving DatasetProfilingRun model

Moving dataset profiling service and renaming it

Extracted glue_profiling_handler

Deleted DatasetTableProfilingJob since could not find any usage of it

Returned the name to model after renaming the service

dlpzx · 2023-04-26T06:43:12Z

backend/dataall/modules/datasets/api/__init__.py

@@ -1,6 +1,7 @@
 """The GraphQL schema of datasets and related functionality"""
 from dataall.modules.datasets.api import (
-    table_column


Why did we change the naming convention? api graphql Objects were capitalized before right?

That's correct and if you think it should remain the existing way please let me know.
I thought that this modularization is a good opportunity to start following the python convention ( underscore instead of capitalized).
But as I said if you feel that it's not correct (or capitalized is a unique style of data.all :)) I will return it back

dlpzx · 2023-04-26T06:57:16Z

backend/dataall/modules/datasets/services/dataset_profiling_service.py

super small comment - I think we could standardize the names further: api.profiling, models.db.DatasetProfilingRun, services.dataset_profiling. For the handler, I think we should have a Glue client into commons and a profiling_handler, what do you think?

About the naming: I wrote this long names like dataset_profiling_service because it allows better to navigate across files and it's solves name clashes in IDE when a lot files are opened. But as a downside the names are getting bigger. If you think that it would more beneficial to have something like services.profiling rather than services.profiling_service (and the same for the rest) I will rename it

About the handler: No, unfortunately, we can't keep Glue in common, because it has a dependency for the models of different modules. The only way to solve it is to split it and put a glue code related to modules into the modules. Otherwise, we will have a circular imports

dlpzx · 2023-04-26T07:00:03Z

I adding here the description of the PR just to check that I got a clear view of it:

In this PR we are extracting the Profiling piece from the Datasets. The profiling feature uses the Glue profiling job that is deployed as part of the Dataset stack. Users trigger the run of this job from the UI "profiling tab" and the results are plotted in the UI.

db.models.specific_name are added under the modules.datasets.db.models
In this case we have more than 1 graphql object, we are placing each in a package in the module.datasets.api directory
moved resolvers code to the modules.datasets.services
in this PR the handlers relative to profiling jobs have been extracted to a package inside modules called handlers, but I guess this is going to be in aws module righ @nikpodsh ?

Profiling is one feature that I see customers enabling/disabling from Datasets. In case they disable it the associated Glue Job deployed in the Dataset should also not be deployed. But we can think about that once the Dataset modularization is complete

nikpodsh · 2023-04-26T09:58:42Z

I adding here the description of the PR just to check that I got a clear view of it:

Thanks for that :)

in this PR the handlers relative to profiling jobs have been extracted to a package inside modules called handlers, but I guess this is going to be in aws module right @nikpodsh ?

I thought about it this way:
handlers contains the code that runs in the async lambda (the one where we send a message in SQS to)
aws contains the code (clients) that sends request to AWS. it's an abstraction and API for all request to AWS. The request to AWS can be send both from async and GraphQL lambdas.

Why do we need to separate them?

If code for handlers and clients (aws) are in the same package then GraphQL lambda will load the code of async lambda automatically, when it's need to send a direct request.
It's better layer separation: let's imagine that you got a CloudWatch log that say that the parameter is missing for some request to AWS. If we have a package where we keep API to send request, we know where to look for. The same goes if the error in async lambda. :)

### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - data-dot-all#295 and data-dot-all#412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nikpodsh added 30 commits April 11, 2023 11:50

Initialization of dataset module

3a5e0de

Refactoring of datasets

a50a02f

Moved dataset table column to modules

Refactoring of datasets

be14986

Renamed table column to the python's convention format

Refactoring of datasets

06f82ad

Added dataset module to config.json

Fixed leftover in loader

38145ae

Dataset refactoring

f0e146a

Moved database table service

Dataset refactoring

b039163

Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable

Dataset refactoring

b7922ed

Moved DatasetTableColumn to modules

Notebooks doesn't require tasks

1771bca

Renamed tasks to handlers

3d1603f

Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode

Dataset refactoring

fb6b515

Extracted code from glue to glue_column_handler Added handlers importing for datasets

Dataset refactoring

e3596a5

Extracted the code for dataset table handler

Dataset refactoring

3af2ecf

Extracted the long-running task for datasets

Dataset refactoring

1a063b2

Extracted the subscription service into datasets

Dataset refactoring

b733714

Extracted the handler to get table columns

Extracted feed registry

2a4e2e0

Needed for migration for modules

Extracted feed and glossary registry and created a model registry

c15d090

Dataset refactoring

052a2b1

Fixed tests and added new for dataset module

Fixed and unignored test_tables_sync

d984483

Split model registry into feed and glossaries

dc0c935

Glossaries had different target types and had to be treated differently

Abstraction for glossaries

727e353

Created API for glossaries to use modularization

Fixed leftovers

49fbb41

Datasets refactoring

7d029e7

Added and fixed tests

Added runtime type registration for Union GraphQL type

be527eb

Changed Feed type registration mechanism

3daf2aa

Moved FeedRegistry to gql since it's more appropriate place for this Started using registry to provide types Renaming and small fixes

Added TODO for future refactoring

db3bfd3

Solve circular dependecy for redshift. It should go away after the migration of redshift

Added GlossaryRegistry for Union scheme

13b6e92

Changed import in redshift module

144dfea

No need for Utils yet

d43b9b3

Fixed linting

39b244c

nikpodsh added 9 commits April 14, 2023 15:24

Datasets refactoring

cb3800a

Moving datasets profiling to datasets modules

Datasets refactoring

dd8e597

Renaming profiling

Datasets refactoring

8ca7bea

Renaming table_column_model to models to easier import other models

Datasets refactoring

e36ab3b

Moving DatasetProfilingRun model

Datasets refactoring

31720c2

Moving dataset profiling service and renaming it

Datasets refactoring

8a907df

Extracted glue_profiling_handler

Datasets refactoring

561da72

Deleted DatasetTableProfilingJob since could not find any usage of it

Datasets refactoring

47a38cc

Returned the name to model after renaming the service

Resolved code conflict

2ac3ae7

dlpzx reviewed Apr 26, 2023

View reviewed changes

dlpzx approved these changes Apr 26, 2023

View reviewed changes

nikpodsh merged commit 74a249a into data-dot-all:modularization-main May 2, 2023

nikpodsh deleted the datasets-mod-part2 branch May 4, 2023 12:14

dlpzx changed the title ~~Datasets modularization pt.2~~ modularization: Datasets modularization pt.2 May 24, 2023

dlpzx mentioned this pull request May 24, 2023

Generic way to toggle data.all features #473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modularization: Datasets modularization pt.2 #432

modularization: Datasets modularization pt.2 #432

nikpodsh commented Apr 24, 2023

dlpzx Apr 26, 2023

nikpodsh Apr 26, 2023

dlpzx Apr 26, 2023

nikpodsh Apr 26, 2023

dlpzx commented Apr 26, 2023

nikpodsh commented Apr 26, 2023

modularization: Datasets modularization pt.2 #432

modularization: Datasets modularization pt.2 #432

Conversation

nikpodsh commented Apr 24, 2023

Feature or Bugfix

Detail

Relates

dlpzx Apr 26, 2023

Choose a reason for hiding this comment

nikpodsh Apr 26, 2023

Choose a reason for hiding this comment

dlpzx Apr 26, 2023

Choose a reason for hiding this comment

nikpodsh Apr 26, 2023

Choose a reason for hiding this comment

dlpzx commented Apr 26, 2023

nikpodsh commented Apr 26, 2023