Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modularization: Datasets modularization pt.2 #432

Merged

Conversation

nikpodsh
Copy link
Contributor

Feature or Bugfix

  • Refactoring

Detail

Refactoring of DatasetProfilingRun

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules
Renamed table column to the python's convention format
Added dataset module to config.json
Moved database table service
Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable
Moved DatasetTableColumn to modules
Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode
Extracted code from glue to glue_column_handler
Added handlers importing for datasets
Extracted the code for dataset table handler
Extracted the long-running task for datasets
Extracted the subscription service into datasets
Extracted the handler to get table columns
Needed for migration for modules
Fixed tests and added new for dataset module
Glossaries had different target types and had to be treated differently
Created API for glossaries to use modularization
Added and fixed tests
Moved FeedRegistry to gql since it's more appropriate place for this
Started using registry to provide types
Renaming and small fixes
Solve circular dependecy for redshift. It should go away after the migration of redshift
Moving datasets profiling to datasets modules
Renaming profiling
Renaming table_column_model to models to easier import other models
Moving DatasetProfilingRun model
Moving dataset profiling service and renaming it
Extracted glue_profiling_handler
Deleted DatasetTableProfilingJob since could not find any usage of it
Returned the name to model after renaming the service
@@ -1,6 +1,7 @@
"""The GraphQL schema of datasets and related functionality"""
from dataall.modules.datasets.api import (
table_column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change the naming convention? api graphql Objects were capitalized before right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct and if you think it should remain the existing way please let me know.
I thought that this modularization is a good opportunity to start following the python convention ( underscore instead of capitalized).
But as I said if you feel that it's not correct (or capitalized is a unique style of data.all :)) I will return it back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super small comment - I think we could standardize the names further: api.profiling, models.db.DatasetProfilingRun, services.dataset_profiling. For the handler, I think we should have a Glue client into commons and a profiling_handler, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the naming: I wrote this long names like dataset_profiling_service because it allows better to navigate across files and it's solves name clashes in IDE when a lot files are opened. But as a downside the names are getting bigger. If you think that it would more beneficial to have something like services.profiling rather than services.profiling_service (and the same for the rest) I will rename it

About the handler: No, unfortunately, we can't keep Glue in common, because it has a dependency for the models of different modules. The only way to solve it is to split it and put a glue code related to modules into the modules. Otherwise, we will have a circular imports

@dlpzx
Copy link
Contributor

dlpzx commented Apr 26, 2023

I adding here the description of the PR just to check that I got a clear view of it:

In this PR we are extracting the Profiling piece from the Datasets. The profiling feature uses the Glue profiling job that is deployed as part of the Dataset stack. Users trigger the run of this job from the UI "profiling tab" and the results are plotted in the UI.

  • db.models.specific_name are added under the modules.datasets.db.models
  • In this case we have more than 1 graphql object, we are placing each in a package in the module.datasets.api directory
  • moved resolvers code to the modules.datasets.services
  • in this PR the handlers relative to profiling jobs have been extracted to a package inside modules called handlers, but I guess this is going to be in aws module righ @nikpodsh ?

Profiling is one feature that I see customers enabling/disabling from Datasets. In case they disable it the associated Glue Job deployed in the Dataset should also not be deployed. But we can think about that once the Dataset modularization is complete

@nikpodsh
Copy link
Contributor Author

I adding here the description of the PR just to check that I got a clear view of it:

Thanks for that :)

in this PR the handlers relative to profiling jobs have been extracted to a package inside modules called handlers, but I guess this is going to be in aws module right @nikpodsh ?

I thought about it this way:
handlers contains the code that runs in the async lambda (the one where we send a message in SQS to)
aws contains the code (clients) that sends request to AWS. it's an abstraction and API for all request to AWS. The request to AWS can be send both from async and GraphQL lambdas.

Why do we need to separate them?

  1. If code for handlers and clients (aws) are in the same package then GraphQL lambda will load the code of async lambda automatically, when it's need to send a direct request.
  2. It's better layer separation: let's imagine that you got a CloudWatch log that say that the parameter is missing for some request to AWS. If we have a package where we keep API to send request, we know where to look for. The same goes if the error in async lambda. :)

@nikpodsh nikpodsh merged commit 74a249a into data-dot-all:modularization-main May 2, 2023
@nikpodsh nikpodsh deleted the datasets-mod-part2 branch May 4, 2023 12:14
@dlpzx dlpzx changed the title Datasets modularization pt.2 modularization: Datasets modularization pt.2 May 24, 2023
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this pull request May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- data-dot-all#295 and data-dot-all#412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants