Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modularization: Datasets modularization pt.5 #442

Merged
merged 124 commits into from
May 10, 2023

Conversation

nikpodsh
Copy link
Contributor

@nikpodsh nikpodsh commented May 2, 2023

Feature or Bugfix

  • Refactoring

Detail

Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions.
Used new has_tenant_permission instead of has_tenant_perm that allows not to pass unused parameters

Relates

#412 and #295

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Moved dataset table column to modules
Renamed table column to the python's convention format
Added dataset module to config.json
Moved database table service
Renamed DatasetTable to DatasetTableService to avoid collisions with models.DatasetTable
Moved DatasetTableColumn to modules
Currently, only async handlers require dedicated loading. Long-running tasks (scheduled tasks) might not need to have a dedicated loading mode
Extracted code from glue to glue_column_handler
Added handlers importing for datasets
Extracted the code for dataset table handler
Extracted the long-running task for datasets
Extracted the subscription service into datasets
Extracted the handler to get table columns
Needed for migration for modules
Fixed tests and added new for dataset module
Glossaries had different target types and had to be treated differently
Created API for glossaries to use modularization
Added and fixed tests
Moved FeedRegistry to gql since it's more appropriate place for this
Started using registry to provide types
Renaming and small fixes
Solve circular dependecy for redshift. It should go away after the migration of redshift
@@ -623,3 +619,82 @@ def count_dataset_tables(session, dataset_uri):
.filter(DatasetTable.datasetUri == dataset_uri)
.count()
)

@staticmethod
def query_environment_group_datasets(session, envUri, groupUri, filter) -> Query:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These query functions should be in dataset_repository (following your design pattern), but as we discussed this could also be in a followup :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will extract this methods after migrating the sharing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to modularization: this task needs to be reviewed. It is triggered every 15 mins and it was originally conceived to add statements to the pre-existing bucket policy of imported datasets. Through this task we can enforce more restrictive access to imported buckets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks strange. The task runs every 15 mins to update policies for imported datasets. Most of the time it will do nothing, but just incur cost for running.
I wonder why we can't update the policies during the importing of datasets?

@@ -86,11 +84,14 @@ def generate_admins_data_access_policy(self) -> iam.Policy:

return policy

def generate_data_access_policy(self) -> iam.Policy:
def generate_data_access_policy(self, session) -> iam.Policy:
"""
Copy link
Contributor

@dlpzx dlpzx May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass the session? I see now, you are using the get_statements defined in the subclass! got it

@@ -70,10 +73,8 @@ def test_get_dataset(client, dataset1, env1, user, group):


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of using MagicMock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly? it's just working :)
I couldn't find a better way to mock instance of class. Tried a few other ways to patch an instance, but got only FAILED. This one worked)

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@nikpodsh
Copy link
Contributor Author

nikpodsh commented May 9, 2023

Added new commits:
Most of them were review remarks, but I also returned triggering the alarms as we discussed (+ added a line to trigger alarm not only create a message) @dlpzx
Extracted DatasetRole (kudos to @dbalintx for noticing it)
Created a way to extend EnvironmentSetup (which is an environment stack) and extracted glue profiler since it looks related to datasets. I didn't do that for the others likelakeformationdefaultsettings and gluedatabasecustomresource since wasn't sure if there are only dataset-specific.

@nikpodsh nikpodsh merged commit d9b99b3 into data-dot-all:modularization-main May 10, 2023
@dlpzx dlpzx changed the title Datasets modularization pt.5 modularization: Datasets modularization pt.5 May 24, 2023
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this pull request May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions.
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants