-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(L2GPrediction): schema validation #642
Conversation
@addramir this is the fix for the issue we discovered yesterday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it makes sense! So we were already validating objects after initialising a dataset, the problem was that some classes were overwriting Dataset's post_init.
Cool, thank you!
I'd suggest renaming the PR title to a fix, rather than a feature.
* feat(dataset): schema mismatch issue * feat(L2GPrediction): schema unification * fix: swapped data types --------- Co-authored-by: Szymon Szyszkowski <ss60@mib117351s.internal.sanger.ac.uk>
This reverts commit 9b2cb5a.
✨ Context
During the 24.06 data release results check team found
feature_matrix
published atThe resulting table schema
is different then the expected one in the L2GFeatureMatrix
Dataset
object introduces the__post_init__
method which call the schema validation after the construction of the object. In the context of the child classL2GFeatureMatrix
the__post_init__
method from it's parent class -Dataset
was overwritten without call to thevalidate_schema
method.Construction of
L2GFeatureMatrix
did not validate the schema. This issue only persists when feature matrix is construcred by it's default dataclass constructor. it does not exist when trying to read the feature matrix from file withfrom_parquet
method.Changing this behavior would introduce schema mismatch in the
L2GPrediction
from_credible_set
method, as the schema of join result fromL2GFeatureMatrix
and credible sets fromStudyLocus
will append new columns to the object.To resolve this issues following steps were added as described in next section.
The results for new feature matrix after rerunning:
🛠 What does this PR implement
validate_schema
method call to all classes with__post_init__
method that inherit fromDataset
classstudyLocusId
field from credible sets when using them as join filtering to unify the columns🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?