-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257
Conversation
86f1431
to
7d66809
Compare
…eric-dataset-model-refactoring-2
Tested in AWS:
👀 Because how the loader is configured for CDK and CDK_CLI_EXTENSION load modes it is needed to add the |
@@ -9,6 +9,9 @@ | |||
"datapipelines": { | |||
"active": true | |||
}, | |||
"datasets_base": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we would have dataset_base
be automatically activated if any of the child dataset modules (i.e. s3_datasets
) is active and not have to expose it here on the config.json
Configuring the dataset_base
module as active or not does not really mean anything as it is really the child dataset modules that are meaningful to activate or not
I am not sure how easy it is to do such a thing or if it requires a big shift in the loader logic but if an easy fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your comment that it is required here because of how we load ImportModes of CDK and CDK_CLI_EXTENSION - but can you explain why that is / is there an easy way to resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, datasets_base
will be added in the config.json at some point (check the full design steps in #1123) because some parameters are relevant for all datasets no only for s3_datasets or redshift_datasets (e.g. confidentiality). That is why I did not give it more thought.
But if you are curious, the issue is a bit hidden, the loader
has some methods to check that it initialized the modules correctly. With the current implementation I think that a module cannot have a module interface CDK and not have a CDK_CLI_EXTENSION if it is not a config.json module. I think the problem is not that much on the loader but on the way we are using it in the cdk proxy. I will have a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In earlier versions such as v2.4.0
I think we never exposed datasets_base
as an module in config.json
and we had the same type of __init__.py
function where it was only supported by CDK and not CDK_CLI_EXTENSION
Trying to understand why we could do that then but can not now - playing around a bit on how these modules are imported using the loader class
Either way I tested these changes and it looks good so will approve PR but do some more digging here on the side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the issue as to why there is an error when we remove datasets_base
from config.json
:
- When we
load_modules(modes={ImportMode.CDK_CLI_EXTENSION})
- We first import all modules
in_config
as part ofdef _load_modules(...)
- This would theoretically include
s3_datasets
anddataset_sharing
at this point in time - When we import
dataset_sharing
we have the following line in top level of__init__.py
that subsequently importss3_datasets
anddatasets_base
modules
- We first import all modules
from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource
- Then when we
_check_loading_correct(...)
in loader.py- it fails on L252 with
ImportError
sincedatasets_base
is insys.modules.keys()
but is not inchecked_modules_names
(sincedatasets_base
is not in config / expected_load)
- it fails on L252 with
Previously we never exposed datasets_base
or datasets_sharing
as configurable modules in config.json
so we avoided this problem. Moving the line
from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource
in backend/dataall/modules/dataset_sharing/__init__.py
to within the the def __init__(self):
in class SharingApiModuleInterface()
resolves this issue on loading modules but may need some additional testing to ensure share APIs still working as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some digging on the loader class and exposing datasets_base
in config.json but tested this PR and changes look good
@noah-paige I will implement the import change in another PR, probably with other modules that also have global imports in their init files |
### Feature or Bugfix - Bugfix ### Detail The loader class that loads modules dependending on the module interface loads all module init files and then uses the interfaces to load specific parts of the module or to register/perform other operations. The issue is that in some modules we are importing some parts of the package at the top level of the `__init__` file, which means that no matter what the module interface is, those imports will be executed. Let's see it with an example. I am loading modules using the `CDK_CLI_EXTENSION` module in the `cdk_cli_wrapper`. The loader loads all init files of the modules in the config.json, and then uses the interface to load the pipelines module only. The issue is that in the init files of the modules it has also imported things like the `DashboardRepository` or `SagemakerStudioEnvironmentResource` for example. Pieces of code that are not needed in the `CDK_CLI_EXTENSION`. This would be a minor issue, some extra pieces of code are loaded, okey. The bigger problem comes from "child-modules". The loader after loading everything checks if what is has been loaded has its correct interface or if it is a config.json module. Modules such as `feed` that are not directly specified in the config.json could be imported indirectly through some of this top-level imports in the init files and result in failure of the loading. We have 2 options: 1. Make our loader aware of the source of each file it imports 2. Be strict with how we import in the init files In this PR I went for 2 because our loader is complex enough and is already name-sensitive. I think adding more logic would make it very difficult to handle and it might have unexpected results. ### Relates - #1257 - in this PR we realized about the issue ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
…te DatasetBase db model and S3Dataset model) (#1258) ### Feature or Bugfix⚠️ This PR should be merged after #1257. - Feature - Refactoring ### Detail As explained in the design for #1123 we are trying to implement a generic `datasets_base` module that can be used by any type of datasets in a generic way. **This PR does**: - Adds a generic `DatasetBase` model in datasets_base.db that is used in s3_datasets.db to build the `S3Dataset` model using joined table inheritance in [sqlalchemy](https://docs.sqlalchemy.org/en/20/orm/inheritance.html) - Rename all usages of Dataset to S3Dataset (in the future some will be returned to DatasetBase, but for the moment we will keep them as S3Dataset) - Add migration script that backfills `datasets` table and renames `s3_datasets` --->⚠️ In the process of migrating we are doing some "scary" operations on the dataset table, if for any reason the migration encounters any issue it could result in catastrophic loss of information --> for this reason this [PR](#1267) implements RDS snapshots on migrations. **This PR does not**: - Feed registration stays as: `FeedRegistry.register(FeedDefinition('Dataset', S3Dataset))` using `Dataset` with the `S3Dataset` resource type. It is out of the scope of this PR to migrate the Feed definition. - Exactly the same for the GlossaryRegistry registration. We keep `object_type='Dataset'` to avoid backwards compatibility issues. - It does not change the resourceType for permissions. We keep using a generic `Dataset` as target for S3 permissions. If we are to split permissions into DatasetBase permissions and S3Dataset permissions we would do it on a different PR #### Remarks Inserting new items of S3Dataset does not require any changes. SQL Alchemy joined inheritance automatically inserts data in the parent table and then another one to the child table as explained in this stackoverflow [link](https://stackoverflow.com/questions/39926937/sqlalchemy-how-to-insert-a-joined-table-inherited-class-instance-when-the-pare) (I was not able to find it in the official docs) ### Relates - #1123 - #955 - #1267 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
Detail
As explained in the design for #1123 we are trying to implement a generic
datasets_base
module that can be used by any type of datasets in a generic way.This PR:
datasets_base
module consisting of 3 packages:db
,api
,services
. And adds the__init__
file.s3_datasets
todatasets_base
in the__init__
file of thes3_datasets
moduleRelates
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
eval
or similar functions are used?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.