-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADT: Add Biomarker/Pathology GX validation suites #153
ADT: Add Biomarker/Pathology GX validation suites #153
Conversation
…mn_values_to_match_json_schema
…run in notebook, error when run with adt --upload
…pecify how to add nested columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beatrizsaldana can you please link to the GX reports generated by the new suites?
"validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n", | ||
"validator.expect_column_values_to_not_be_null(\"type\")\n", | ||
"# allows all alphanumeric characters, underscores, periods, and dashes\n", | ||
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question. Even better: do we have a limited list of values that can go in this field (i.e. ["Type A", "Type B", "Type C"...] or is it a free-for-all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessterB do you know the answer to this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a free-for-all, each of the two datasets has a unique set of valid values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are they and should we make a list of them instead of using regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use the unique set of values in the type
column in the current files, as your list. There's code in one of the other GX suites that Brad wrote, that tells GX to compare values to a list of pre-defined values. Off the top of my head I don't remember which suite to point you to though. Biodomains probably has it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you are thinking of expect_column_values_to_be_in_set
which is used in this suite (among others).
gx_suite_definitions/pathology.ipynb
Outdated
"validator.expect_column_values_to_be_of_type(\"model\", \"str\")\n", | ||
"validator.expect_column_values_to_not_be_null(\"model\")\n", | ||
"# allows all alphanumeric characters, underscores, periods, and dashes\n", | ||
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s\\(\\)\\*_.-]+$\")\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume my questions in the biomarkers file also apply here.
src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json
Show resolved
Hide resolved
CONTRIBUTING.md
Outdated
#### Nested Columns | ||
|
||
If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these four steps: | ||
1. Add the nested column name to the `gx_nested_columns` flag for the specific transform. This will convert the column values to a JSON parsable string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that this goes in the config file.
"validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n", | ||
"validator.expect_column_values_to_not_be_null(\"type\")\n", | ||
"# allows all alphanumeric characters, underscores, periods, and dashes\n", | ||
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a free-for-all, each of the two datasets has a unique set of valid values.
"outputs": [], | ||
"source": [ | ||
"# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n", | ||
"validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove units from this list; we should never have the same model/type/age/tissue with different units...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove it from this list or also from the column-grouping in the transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it from this mutil-value unique expectation only.
src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json
Show resolved
Hide resolved
…sed those unique values for gx validation for biomarkers and pathology datasets
"outputs": [], | ||
"source": [ | ||
"# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n", | ||
"validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it from this mutil-value unique expectation only.
…ecord to address PR comment
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Jira Ticket
Adding Great Expectations validation for the Model AD Biomarker and Pathology data transforms. The
points
column is a nested column, so apoints.json
file was made to validate the column using theexpect_column_values_to_match_json_schema()
function.