Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(L2GPrediction): schema validation #642

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

project-defiant
Copy link
Contributor

@project-defiant project-defiant commented Jun 14, 2024

✨ Context

During the 24.06 data release results check team found feature_matrix published at

gsutil du -sh gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix
10.16 GiB    gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix

The resulting table schema

root
 |-- studyLocusId: long (nullable = true)
 |-- geneId: string (nullable = true)
 |-- distanceTssMean: float (nullable = true)
 |-- distanceTssMinimum: float (nullable = true)
 |-- eqtlColocClppMaximum: float (nullable = true)
 |-- eqtlColocClppMaximumNeighborhood: float (nullable = true)
 |-- eqtlColocLlrMaximum: float (nullable = true)
 |-- eqtlColocLlrMaximumNeighborhood: float (nullable = true)
 |-- pqtlColocClppMaximum: float (nullable = true)
 |-- pqtlColocClppMaximumNeighborhood: float (nullable = true)
 |-- pqtlColocLlrMaximum: float (nullable = true)
 |-- pqtlColocLlrMaximumNeighborhood: float (nullable = true)
 |-- sqtlColocClppMaximum: float (nullable = true)
 |-- sqtlColocClppMaximumNeighborhood: float (nullable = true)
 |-- sqtlColocLlrMaximum: float (nullable = true)
 |-- sqtlColocLlrMaximumNeighborhood: float (nullable = true)
 |-- tuqtlColocClppMaximum: float (nullable = true)
 |-- tuqtlColocClppMaximumNeighborhood: float (nullable = true)
 |-- tuqtlColocLlrMaximum: float (nullable = true)
 |-- tuqtlColocLlrMaximumNeighborhood: float (nullable = true)
 |-- vepMaximum: float (nullable = true)
 |-- vepMaximumNeighborhood: float (nullable = true)
 |-- vepMean: float (nullable = true)
 |-- vepMeanNeighborhood: float (nullable = true)
 |-- studyId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- beta: double (nullable = true)
 |-- zScore: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- standardError: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- credibleSetlog10BF: double (nullable = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- locusStart: integer (nullable = true)
 |-- locusEnd: integer (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- ldSet: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tagVariantId: string (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- variantId: string (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)

is different then the expected one in the L2GFeatureMatrix

Dataset object introduces the __post_init__ method which call the schema validation after the construction of the object. In the context of the child class L2GFeatureMatrix the __post_init__ method from it's parent class - Dataset was overwritten without call to the validate_schema method.

Construction of L2GFeatureMatrix did not validate the schema. This issue only persists when feature matrix is construcred by it's default dataclass constructor. it does not exist when trying to read the feature matrix from file with from_parquet method.

Changing this behavior would introduce schema mismatch in the L2GPrediction from_credible_set method, as the schema of join result from L2GFeatureMatrix and credible sets from StudyLocus will append new columns to the object.

To resolve this issues following steps were added as described in next section.

The results for new feature matrix after rerunning:

gsutil du -sh gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix_2 # new
227.07 MiB   gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix_2
gsutil du -sh gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix  # old
10.16 GiB    gs://genetics_etl_python_playground/releases/24.06/locus_to_gene_feature_matrix

🛠 What does this PR implement

  • Addition of validate_schema method call to all classes with __post_init__ method that inherit from Dataset class
  • Selecting only studyLocusId field from credible sets when using them as join filtering to unify the columns
  • Change to L2GFeatureMatrix schema provided to the constructor from L2GPrediction to L2GFeatureMatrix

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@project-defiant project-defiant requested a review from addramir June 14, 2024 10:38
@project-defiant
Copy link
Contributor Author

@addramir this is the fix for the issue we discovered yesterday

Copy link
Contributor

@addramir addramir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me.

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it makes sense! So we were already validating objects after initialising a dataset, the problem was that some classes were overwriting Dataset's post_init.
Cool, thank you!
I'd suggest renaming the PR title to a fix, rather than a feature.

@project-defiant project-defiant changed the title feat(L2GPrediction): schema validation fix(L2GPrediction): schema validation Jun 14, 2024
@project-defiant project-defiant merged commit 7625a79 into dev Jun 14, 2024
4 checks passed
@project-defiant project-defiant deleted the szsz-fix-feature-matrix-schema branch June 14, 2024 12:29
project-defiant added a commit that referenced this pull request Jun 14, 2024
* feat(dataset): schema mismatch issue
* feat(L2GPrediction): schema unification
* fix: swapped data types

---------

Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
project-defiant pushed a commit that referenced this pull request Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants