-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) #1258
Conversation
… creation renamed table s3_dataset
…eric-dataset-model-refactoring-2
…t-model-refactoring-2' into feat/generic-dataset-model-refactoring-3
…eric-dataset-model-refactoring-3 # Conflicts: # backend/dataall/modules/dataset_sharing/services/dataset_sharing_service.py # backend/dataall/modules/s3_datasets/api/dataset/resolvers.py # backend/dataall/modules/s3_datasets/db/dataset_models.py # backend/dataall/modules/s3_datasets/services/dataset_service.py # backend/dataall/modules/s3_datasets/services/dataset_table_service.py
Testing (before changes from PR review)
|
...nd/migrations/versions/458572580709_remove_dataset_table_read_permissions_from_env_admins.py
Show resolved
Hide resolved
backend/migrations/versions/c6d01930179d_add_backfill_read_folder_permissions.py
Show resolved
Hide resolved
backend/migrations/versions/d059eead99c2_rename_dataset_table_as_s3_dataset.py
Show resolved
Hide resolved
backend/migrations/versions/d059eead99c2_rename_dataset_table_as_s3_dataset.py
Show resolved
Hide resolved
Overall left some minor comments - additionally tested out the migrations scripts back-filling a single dataset and all works locally for me as well Will do one last look through first thing tomorrow once you are finished with you check list of testing as well |
session.commit() | ||
session.close() | ||
|
||
# Update non-nullable columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On downgrade I am getting an error column "label" of relation "s3_dataset" contains null values
even after we set the label
value in the above for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will re-test with the changes from PR review, see the results below
Testing after changes from review
|
backend/migrations/versions/5cdcf6cc1d73_create_and_backfill_dataset_table.py
Outdated
Show resolved
Hide resolved
### Feature or Bugfix - Feature ### Detail Alembic migrations can get complex and in some cases we are using alembic for not only schema migrations but also data migrations. When moving columns with data from one table to another we might accidentally make a mistake in a migration script. We strive to test all migration scripts and avoid bugs in such sensitive operations, but to protect users from the catastrophic situation in which there is a bug, a service issue or any other exceptional situation this PR introduces the creation of manual database snapshots before running alembic migration scripts. This PR modifies the db_migration handler that is triggered with every backendStack update. It checks if there are new migration scripts (if the current head in the database is different from the new head in the code). If True, it will create a cluster snapshot. Remarks: - Snapshots cannot be created when the cluster in not `available`, the PR introduces a check to wait for this condition. If the Lambda timeout is reached waiting for the cluster, then the CICD pipeline will fail and will need to be retried - During the creation of an snapshot we can still run alembic migration scripts - Snapshots are incremental, the first time will take a long time, but new snapshots will be faster ### Relates - #1258 - This PR is a good example of complex data migration operations. ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Last last things: WHen running alembic autogenerate migration I get the following
Our models and migration scripts may not be in sync? |
Changing datasetUri in
fixes the foreign key change generated by alembic |
Hello hello, thanks @noah-paige for such a deep review :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested and looks good - approving
…te DatasetBaseRepository and move DatasetLock) (#1276) ### Feature or Bugfix⚠️ merge after #1258 - Refactoring ### Detail As explained in the design for #1123 we are trying to implement a generic `datasets_base` module that can be used by any type of datasets in a generic way. In this small PR: - we move the generic DatasetLock model to datasets_base - move the DatasetLock db operations to databasets_base DatasetBaseRepository - move activity to DatasetBaseRepository ### Relates - #1123 - #955 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Feature or Bugfix
Detail
As explained in the design for #1123 we are trying to implement a generic
datasets_base
module that can be used by any type of datasets in a generic way.This PR does:
DatasetBase
model in datasets_base.db that is used in s3_datasets.db to build theS3Dataset
model using joined table inheritance in sqlalchemydatasets
table and renamess3_datasets
--->This PR does not:
FeedRegistry.register(FeedDefinition('Dataset', S3Dataset))
usingDataset
with theS3Dataset
resource type. It is out of the scope of this PR to migrate the Feed definition.object_type='Dataset'
to avoid backwards compatibility issues.Dataset
as target for S3 permissions. If we are to split permissions into DatasetBase permissions and S3Dataset permissions we would do it on a different PRRemarks
Inserting new items of S3Dataset does not require any changes. SQL Alchemy joined inheritance automatically inserts data in the parent table and then another one to the child table as explained in this stackoverflow link (I was not able to find it in the official docs)
Relates
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
eval
or similar functions are used?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.