Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADT: Add Biomarker/Pathology GX validation suites #153

Merged
merged 13 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,5 @@ test_staging_dir/

# dev config file
dev_config.yaml

.vscode/
18 changes: 18 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,24 @@ This repository is currently home to three custom expectations that were created

These expectations are defined in the `/great_expectations/gx/plugins/expectations` folder. To add more custom expectations, follow the instructions [here](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp).

#### Nested Columns

If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), the following must be included:
1. The nested column name must be specified in the config under `gx_nested_columns`.
```
gx_nested_columns:
- nested_column_name
```
2. When creating the validator object in the gx_suite_definitions notebook, the nested column(s) must be included in the `nested_columns` list.
```
df = pd.read_json(data_file)
nested_columns = ['nested_column_name']
df = GreatExpectationsRunner.convert_nested_columns_to_json(df, nested_columns)
validator = context.sources.pandas_default.read_dataframe(df)
validator.expectation_suite_name = "suite_name"
```
3. A JSON file containing the expected schema must be added: `src/agoradatatools/great_expectations/gx/json_schemas/transform_name/column_name.json`. Use the [JSON schema tool](https://jsonschema.net/app/schemas/0) to create the schema template for your nested column.

### DockerHub

Rather than using GitHub actions to build and push Docker images to DockerHub, the Docker images are automatically built in DockerHub. This requires the `sagebiodockerhub` GitHub user to be an Admin of this repo. You can view the docker build [here](https://hub.docker.com/r/sagebionetworks/agora-data-tools).
239 changes: 239 additions & 0 deletions gx_suite_definitions/biomarkers.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import synapseclient\n",
"\n",
"import great_expectations as gx\n",
"import pandas as pd\n",
"import json\n",
"\n",
"context = gx.get_context(project_root_dir='../src/agoradatatools/great_expectations')\n",
"\n",
"from agoradatatools.gx import GreatExpectationsRunner\n",
"from expectations.expect_column_values_to_have_list_length import ExpectColumnValuesToHaveListLength\n",
"from expectations.expect_column_values_to_have_list_members import ExpectColumnValuesToHaveListMembers\n",
"from expectations.expect_column_values_to_have_list_members_of_type import ExpectColumnValuesToHaveListMembersOfType"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create Expectation Suite for Biomarkers Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"syn = synapseclient.Synapse()\n",
"syn.login()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"biomarkers_data_file = syn.get(\"syn63540269\").path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Validator Object on Data File"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_json(biomarkers_data_file)\n",
"nested_columns = ['points']\n",
"df = GreatExpectationsRunner.convert_nested_columns_to_json(df, nested_columns)\n",
"validator = context.sources.pandas_default.read_dataframe(df)\n",
"validator.expectation_suite_name = \"biomarkers\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add Expectations to Validator Object For Each Column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# points\n",
"validator.expect_column_values_to_be_of_type(\"points\", \"str\")\n",
beatrizsaldana marked this conversation as resolved.
Show resolved Hide resolved
"with open(\"../src/agoradatatools/great_expectations/gx/json_schemas/immunohisto/points.json\", \"r\") as file:\n",
" points_schema = json.load(file)\n",
"validator.expect_column_values_to_match_json_schema(\"points\", json_schema=points_schema)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# model\n",
"validator.expect_column_values_to_be_of_type(\"model\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"model\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s\\*_.-]+$\")\n"
JessterB marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# type\n",
"validator.expect_column_values_to_be_of_type(\"type\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"type\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"type\", \"^[A-Za-z0-9\\s_.-]+$\")"
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. Even better: do we have a limited list of values that can go in this field (i.e. ["Type A", "Type B", "Type C"...] or is it a free-for-all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessterB do you know the answer to this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a free-for-all, each of the two datasets has a unique set of valid values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are they and should we make a list of them instead of using regex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use the unique set of values in the type column in the current files, as your list. There's code in one of the other GX suites that Brad wrote, that tells GX to compare values to a list of pre-defined values. Off the top of my head I don't remember which suite to point you to though. Biodomains probably has it?

Copy link
Contributor

@BWMac BWMac Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you are thinking of expect_column_values_to_be_in_set which is used in this suite (among others).

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# units\n",
"validator.expect_column_values_to_be_of_type(\"units\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"units\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"units\", \"^[A-Za-z0-9\\/\\s\\*_.-]+$\")"
JessterB marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# age_death\n",
"validator.expect_column_values_to_be_of_type(\"age_death\", \"int\")\n",
beatrizsaldana marked this conversation as resolved.
Show resolved Hide resolved
"validator.expect_column_values_to_not_be_null(\"age_death\")\n",
"validator.expect_column_values_to_be_between(\"age_death\", strict_min_value=0, max_value=100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# tissue\n",
"validator.expect_column_values_to_be_of_type(\"tissue\", \"str\")\n",
"validator.expect_column_values_to_not_be_null(\"tissue\")\n",
"# allows all alphanumeric characters, underscores, periods, and dashes\n",
"validator.expect_column_values_to_match_regex(\"tissue\", \"^[A-Za-z0-9\\/\\s\\*_.-]+$\")"
]
beatrizsaldana marked this conversation as resolved.
Show resolved Hide resolved
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# unique entries ExpectSelectColumnValuesToBeUniqueWithinRecord\n",
"validator.expect_select_column_values_to_be_unique_within_record(column_list=[\"model\", \"type\", \"age_death\", \"tissue\", \"units\"])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove units from this list; we should never have the same model/type/age/tissue with different units...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove it from this list or also from the column-grouping in the transform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it from this mutil-value unique expectation only.

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"validator.save_expectation_suite(discard_failed_expectations=False)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Checkpoint and View Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"checkpoint = context.add_or_update_checkpoint(\n",
" name=\"agora-test-checkpoint\",\n",
" validator=validator,\n",
")\n",
"checkpoint_result = checkpoint.run()\n",
"context.view_validation_result(checkpoint_result)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build Data Docs - Click on Expectation Suite to View All Expectations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"context.build_data_docs()\n",
"context.open_data_docs()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading