ADT: Add Biomarker/Pathology GX validation suites #153

beatrizsaldana · 2024-10-29T17:29:01Z

Adding Great Expectations validation for the Model AD Biomarker and Pathology data transforms. The points column is a nested column, so a points.json file was made to validate the column using the expect_column_values_to_match_json_schema() function.

…mn_values_to_match_json_schema

…run in notebook, error when run with adt --upload

…pecify how to add nested columns

BWMac

@beatrizsaldana can you please link to the GX reports generated by the new suites?

gx_suite_definitions/biomarkers.ipynb

jaclynbeck-sage · 2024-10-31T20:25:05Z

gx_suite_definitions/biomarkers.ipynb

+    "validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n",
+    "validator.expect_column_values_to_not_be_null(\"type\")\n",
+    "# allows all alphanumeric characters, underscores, periods, and dashes\n",
+    "validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")"


Same question. Even better: do we have a limited list of values that can go in this field (i.e. ["Type A", "Type B", "Type C"...] or is it a free-for-all?

@JessterB do you know the answer to this question?

It is not a free-for-all, each of the two datasets has a unique set of valid values.

What are they and should we make a list of them instead of using regex?

You should be able to use the unique set of values in the type column in the current files, as your list. There's code in one of the other GX suites that Brad wrote, that tells GX to compare values to a list of pre-defined values. Off the top of my head I don't remember which suite to point you to though. Biodomains probably has it?

Yup, you are thinking of expect_column_values_to_be_in_set which is used in this suite (among others).

gx_suite_definitions/biomarkers.ipynb

jaclynbeck-sage · 2024-10-31T20:30:21Z

gx_suite_definitions/pathology.ipynb

+    "validator.expect_column_values_to_be_of_type(\"model\", \"str\")\n",
+    "validator.expect_column_values_to_not_be_null(\"model\")\n",
+    "# allows all alphanumeric characters, underscores, periods, and dashes\n",
+    "validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s\\(\\)\\*_.-]+$\")\n"


Assume my questions in the biomarkers file also apply here.

src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json

gx_suite_definitions/biomarkers.ipynb

jaclynbeck-sage · 2024-10-31T21:44:25Z

CONTRIBUTING.md

+#### Nested Columns
+
+If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these four steps:
+1. Add the nested column name to the `gx_nested_columns` flag for the specific transform. This will convert the column values to a JSON parsable string.


Mention that this goes in the config file.

gx_suite_definitions/biomarkers.ipynb

JessterB · 2024-11-01T18:41:41Z

gx_suite_definitions/biomarkers.ipynb

+    "validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n",
+    "validator.expect_column_values_to_not_be_null(\"type\")\n",
+    "# allows all alphanumeric characters, underscores, periods, and dashes\n",
+    "validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")"


It is not a free-for-all, each of the two datasets has a unique set of valid values.

gx_suite_definitions/biomarkers.ipynb

JessterB · 2024-11-01T18:47:27Z

gx_suite_definitions/biomarkers.ipynb

+   "outputs": [],
+   "source": [
+    "# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n",
+    "validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])"


Let's remove units from this list; we should never have the same model/type/age/tissue with different units...

Should I remove it from this list or also from the column-grouping in the transform?

Remove it from this mutil-value unique expectation only.

src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json

…sed those unique values for gx validation for biomarkers and pathology datasets

JessterB · 2024-11-06T18:12:55Z

gx_suite_definitions/biomarkers.ipynb

+   "outputs": [],
+   "source": [
+    "# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n",
+    "validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])"


Remove it from this mutil-value unique expectation only.

gx_suite_definitions/pathology.ipynb

…ecord to address PR comment

sonarcloud · 2024-11-06T19:03:13Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

JessterB

lgtm!

beatrizsaldana and others added 6 commits October 18, 2024 15:07

Added biomarkers notebook to gx expectations

7431a6a

Stored expectation results

8869e1b

gx validation failing, committing to save work

735678c

I cannot get this to work: points has failed expectations expect_colu…

576f7f4

…mn_values_to_match_json_schema

Added gx_enabled: true to pathology dataset in modelad_test_config.yaml

38447dd

Updating pathology gx expectations

7099261

beatrizsaldana changed the title ~~Beatrizsaldana/mg 108/gx validation suites~~ ADT: Add Biomarker/Pathology GX validation suites Oct 29, 2024

Beatriz Saldana added 3 commits October 31, 2024 11:49

Updated gx validation for biomarkrs and pathology, 100% success when …

57e9e60

…run in notebook, error when run with adt --upload

Added gx_nested_columns to config

b924940

Added gx_nested_columns flag to config and updated documentation to s…

e94573c

…pecify how to add nested columns

beatrizsaldana marked this pull request as ready for review October 31, 2024 19:53

beatrizsaldana requested review from BWMac, BryanFauble, JessterB and jaclynbeck-sage October 31, 2024 19:53

BWMac reviewed Oct 31, 2024

View reviewed changes

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

gx_suite_definitions/biomarkers.ipynb Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

gx_suite_definitions/biomarkers.ipynb Show resolved Hide resolved

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

gx_suite_definitions/biomarkers.ipynb Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json Show resolved Hide resolved

Updated documentation

ebe63a0

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

gx_suite_definitions/biomarkers.ipynb Show resolved Hide resolved

jaclynbeck-sage reviewed Oct 31, 2024

View reviewed changes

thomasyu888 removed the request for review from BryanFauble October 31, 2024 23:12

BWMac requested a review from a team November 1, 2024 18:00

JessterB requested changes Nov 1, 2024

View reviewed changes

BWMac self-requested a review November 4, 2024 15:19

BWMac mentioned this pull request Nov 5, 2024

[AG-1455] Documentation Refresh #156

Merged

Beatriz Saldana added 2 commits November 6, 2024 09:33

Created a notebook to obtain unique values for specified fields and u…

1a19f86

…sed those unique values for gx validation for biomarkers and pathology datasets

pre-commit oops

7d4564a

beatrizsaldana requested review from JessterB and jaclynbeck-sage November 6, 2024 17:46

JessterB reviewed Nov 6, 2024

View reviewed changes

Removing units from expect_select_column_values_to_be_unique_within_r…

9580c6f

…ecord to address PR comment

beatrizsaldana requested a review from JessterB November 6, 2024 19:04

JessterB approved these changes Nov 6, 2024

View reviewed changes

beatrizsaldana merged commit 6fd726d into dev Nov 6, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADT: Add Biomarker/Pathology GX validation suites #153

ADT: Add Biomarker/Pathology GX validation suites #153

beatrizsaldana commented Oct 29, 2024 •

edited

Loading

BWMac left a comment

jaclynbeck-sage Oct 31, 2024 •

edited

Loading

beatrizsaldana Oct 31, 2024

JessterB Nov 1, 2024

beatrizsaldana Nov 4, 2024

jaclynbeck-sage Nov 5, 2024

BWMac Nov 6, 2024 •

edited

Loading

jaclynbeck-sage Oct 31, 2024

jaclynbeck-sage Oct 31, 2024

JessterB Nov 1, 2024

JessterB Nov 1, 2024

beatrizsaldana Nov 4, 2024

JessterB Nov 6, 2024

JessterB Nov 6, 2024

sonarcloud bot commented Nov 6, 2024

JessterB left a comment

ADT: Add Biomarker/Pathology GX validation suites #153

ADT: Add Biomarker/Pathology GX validation suites #153

Conversation

beatrizsaldana commented Oct 29, 2024 • edited Loading

BWMac left a comment

Choose a reason for hiding this comment

jaclynbeck-sage Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BWMac Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Nov 6, 2024

Quality Gate passed

JessterB left a comment

Choose a reason for hiding this comment

beatrizsaldana commented Oct 29, 2024 •

edited

Loading

jaclynbeck-sage Oct 31, 2024 •

edited

Loading

BWMac Nov 6, 2024 •

edited

Loading