Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) #1258

Merged
merged 28 commits into from
May 17, 2024

Conversation

dlpzx
Copy link
Contributor

@dlpzx dlpzx commented May 7, 2024

Feature or Bugfix

⚠️ This PR should be merged after #1257.

  • Feature
  • Refactoring

Detail

As explained in the design for #1123 we are trying to implement a generic datasets_base module that can be used by any type of datasets in a generic way.

This PR does:

  • Adds a generic DatasetBase model in datasets_base.db that is used in s3_datasets.db to build the S3Dataset model using joined table inheritance in sqlalchemy
  • Rename all usages of Dataset to S3Dataset (in the future some will be returned to DatasetBase, but for the moment we will keep them as S3Dataset)
  • Add migration script that backfills datasets table and renames s3_datasets ---> ⚠️ In the process of migrating we are doing some "scary" operations on the dataset table, if for any reason the migration encounters any issue it could result in catastrophic loss of information --> for this reason this PR implements RDS snapshots on migrations.

This PR does not:

  • Feed registration stays as: FeedRegistry.register(FeedDefinition('Dataset', S3Dataset)) using Dataset with the S3Dataset resource type. It is out of the scope of this PR to migrate the Feed definition.
  • Exactly the same for the GlossaryRegistry registration. We keep object_type='Dataset' to avoid backwards compatibility issues.
  • It does not change the resourceType for permissions. We keep using a generic Dataset as target for S3 permissions. If we are to split permissions into DatasetBase permissions and S3Dataset permissions we would do it on a different PR

Remarks

Inserting new items of S3Dataset does not require any changes. SQL Alchemy joined inheritance automatically inserts data in the parent table and then another one to the child table as explained in this stackoverflow link (I was not able to find it in the official docs)

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…t-model-refactoring-2' into feat/generic-dataset-model-refactoring-3
dlpzx added 2 commits May 15, 2024 10:13
…eric-dataset-model-refactoring-3

# Conflicts:
#	backend/dataall/modules/dataset_sharing/services/dataset_sharing_service.py
#	backend/dataall/modules/s3_datasets/api/dataset/resolvers.py
#	backend/dataall/modules/s3_datasets/db/dataset_models.py
#	backend/dataall/modules/s3_datasets/services/dataset_service.py
#	backend/dataall/modules/s3_datasets/services/dataset_table_service.py
@dlpzx dlpzx marked this pull request as ready for review May 15, 2024 09:17
@dlpzx dlpzx changed the title WIP - Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) May 15, 2024
@dlpzx
Copy link
Contributor Author

dlpzx commented May 15, 2024

Testing (before changes from PR review)

  • locally update and list pre-existing datasets
  • locally update and create new dataset - check that it creates successfully in database, it indexes in catalog and its permissions are created for the resource type Datasets
  • locally run all migration scripts from scratch - without data
  • locally downgrade each of the migration scripts one by one - without data
  • locally downgrade both migration scrips at once - without data
  • in a pre-existing AWS deployment merge this branch and check migration is correctly executed
  • in a pre-existing AWS deployment check previous datasets are listed and can be accessed (checking permissions)
  • in a pre-existing AWS deployment create a new Dataset

@dlpzx dlpzx requested a review from noah-paige May 15, 2024 16:26
@noah-paige
Copy link
Contributor

Overall left some minor comments - additionally tested out the migrations scripts back-filling a single dataset and all works locally for me as well

Will do one last look through first thing tomorrow once you are finished with you check list of testing as well

session.commit()
session.close()

# Update non-nullable columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On downgrade I am getting an error column "label" of relation "s3_dataset" contains null values even after we set the label value in the above for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-test with the changes from PR review, see the results below

@dlpzx
Copy link
Contributor Author

dlpzx commented May 16, 2024

Testing after changes from review

  • locally create datasets with migrations until revision 458572580709, update schema with alembic migration d059eead99c2 -> check datasetType is backfilled, foreign keys are updated, dataset is renamed
  • downgrade back to 458572580709 -> check that datasetType is deleted and object datasettype enum as well. Check that we have now a foreign key fk_dataset_env_uri and the dataset table is called dataset
  • Upgrade to head -> check that new dataset dataset table is backfilled, datasettypes enum is used for dataset type
  • Same for downgrade

dlpzx added a commit that referenced this pull request May 16, 2024
### Feature or Bugfix
- Feature

### Detail
Alembic migrations can get complex and in some cases we are using
alembic for not only schema migrations but also data migrations. When
moving columns with data from one table to another we might accidentally
make a mistake in a migration script. We strive to test all migration
scripts and avoid bugs in such sensitive operations, but to protect
users from the catastrophic situation in which there is a bug, a service
issue or any other exceptional situation this PR introduces the creation
of manual database snapshots before running alembic migration scripts.

This PR modifies the db_migration handler that is triggered with every
backendStack update. It checks if there are new migration scripts (if
the current head in the database is different from the new head in the
code). If True, it will create a cluster snapshot.

Remarks:
- Snapshots cannot be created when the cluster in not `available`, the
PR introduces a check to wait for this condition. If the Lambda timeout
is reached waiting for the cluster, then the CICD pipeline will fail and
will need to be retried
- During the creation of an snapshot we can still run alembic migration
scripts
- Snapshots are incremental, the first time will take a long time, but
new snapshots will be faster

### Relates
- #1258 - This PR is a good example of complex data migration
operations.

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@noah-paige
Copy link
Contributor

Last last things:

WHen running alembic autogenerate migration I get the following

def upgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.alter_column('dataset', 'datasetType',
               existing_type=postgresql.ENUM('S3', name='datasettypes'),
               type_=sa.Enum('S3', name='datasettype'),
               existing_nullable=False)
    op.drop_constraint('s3_dataset_bucket_datasetUri_fkey', 'dataset_bucket', type_='foreignkey')
    op.create_foreign_key(None, 'dataset_bucket', 'dataset', ['datasetUri'], ['datasetUri'], ondelete='CASCADE')
    # ### end Alembic commands ###

Our models and migration scripts may not be in sync?

@noah-paige
Copy link
Contributor

noah-paige commented May 17, 2024

Changing datasetUri in dataset_bucket table model at backend/dataall/modules/s3_datasets/db/dataset_models.py to

    datasetUri = Column(String, ForeignKey('s3_dataset.datasetUri', ondelete='CASCADE'), nullable=False)

fixes the foreign key change generated by alembic

@dlpzx
Copy link
Contributor Author

dlpzx commented May 17, 2024

Hello hello, thanks @noah-paige for such a deep review :)

  • foreign_key in dataset_bucket - I can make the change, looks reasonable
  • postgresql.ENUM vs sa.Enum - there is a problem with sa.Enum and that is that it forces the creation of the enum object itself. I used postgresql.ENUM because it has the option to create_type=False which uses the existing datasettypes object. otherwise the migration fails

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and looks good - approving

@dlpzx dlpzx merged commit 7bf62d7 into main May 17, 2024
9 checks passed
dlpzx added a commit that referenced this pull request May 21, 2024
…te DatasetBaseRepository and move DatasetLock) (#1276)

### Feature or Bugfix
⚠️ merge after #1258 
- Refactoring

### Detail
As explained in the design for #1123 we are trying to implement a
generic `datasets_base` module that can be used by any type of datasets
in a generic way.

In this small PR:
- we move the generic DatasetLock model to datasets_base
- move the DatasetLock db operations to databasets_base
DatasetBaseRepository
- move activity to DatasetBaseRepository

### Relates
- #1123 
- #955 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx dlpzx deleted the feat/generic-dataset-model-refactoring-3 branch May 22, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants