Skip to content

Commit

Permalink
updates documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
BWMac committed Nov 5, 2024
1 parent b189c1f commit ed02417
Show file tree
Hide file tree
Showing 4 changed files with 109 additions and 78 deletions.
28 changes: 25 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,9 @@ This package uses [Great Expectations](https://greatexpectations.io/) to validat

1. Create a new expectation suite by defining the expectations for the dataset in a Jupyter Notebook inside the `gx_suite_definitions` folder. Use `metabolomics.ipynb` as an example. You can find a catalog of existing expectations [here](https://greatexpectations.io/expectations/).
1. Run the notebook to generate the new expectation suite. It should populate as a JSON file in the `/great_expectations/expectations` folder.
1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. After updating the config files reports should be uploaded in the proper locations ([Prod](https://www.synapse.org/#!Synapse:syn52948668), [Testing](https://www.synapse.org/#!Synapse:syn52948670)) when data processing is complete.
- You can prevent Great Expectations from running for a dataset by removing the `gx_enabled: true` from the configuration for the dataset.
1. Test data processing by running `adt test_config.yaml` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse.
1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. Ensure that the `gx_folder` and `gx_table` keys are present in the configuration file and contain valid Synapse IDs for the GX reports and GX table, respectively.
- You can prevent Great Expectations from running for a dataset by setting `gx_enabled: false` in the configuration for the dataset.
1. Test data processing by running `adt test_config.yaml --upload` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse.

#### Custom Expectations

Expand All @@ -181,6 +181,28 @@ This repository is currently home to three custom expectations that were created

These expectations are defined in the `/great_expectations/gx/plugins/expectations` folder. To add more custom expectations, follow the instructions [here](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp).

#### Nested Columns

If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these steps:
1. Add the nested column name to the `gx_nested_columns` flag in the configuration file for the specific transform. This will convert the column values to a JSON parsable string.
```
gx_nested_columns:
- <nested_column_name>
```
1. When creating the validator object in the gx_suite_definitions notebook, the nested column(s) must be included in the `nested_columns` list.
```
df = pd.read_json(<data_file>)
nested_columns = ['<nested_column_name>']
df = GreatExpectationsRunner.convert_nested_columns_to_json(df, nested_columns)
validator = context.sources.pandas_default.read_dataframe(df)
validator.expectation_suite_name = "<suite_name>"
```
1. When validating the value type of the nested column, make sure to specify it as a string (see Step 1 for reasoning):
```
validator.expect_column_values_to_be_of_type("<nested_column_name>", "str")
```
1. A JSON file containing the expected schema must be added here: `src/agoradatatools/great_expectations/gx/json_schemas/<transform_name>/<column_name>.json`. Use the [JSON schema tool](https://jsonschema.net/app/schemas/0) to create the schema template for your nested column.

### DockerHub

Rather than using GitHub actions to build and push Docker images to DockerHub, the Docker images are automatically built in DockerHub. This requires the `sagebiodockerhub` GitHub user to be an Admin of this repo. You can view the docker build [here](https://hub.docker.com/r/sagebionetworks/agora-data-tools).
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,12 +123,21 @@ python -m pytest
Parameters:
- `destination`: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basis
- `staging_path`: Defines the location of the staging folder that the generated json files are written to
- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to
- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_folder` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown.
- `gx_table`: Defines the Synapse ID of the table that generated GX reporting is posted to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_table` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown.
- `sources/<source>`: Source files for each dataset are defined in the `sources` section of the config file.
- `sources/<source>/<source>_files`: A list of source file information for the dataset.
- `sources/<source>/<source>_files/name`: The name of the source file/dataset.
- `sources/<source>/<source>_files/id`: The Synapse ID of the source file. Dot notation is supported to indicate the version of the file to use.
- `sources/<source>/<source>_files/format`: The format of the source file.
- `datasets/<dataset>`: Each generated json file is named `<dataset>.json`
- `datasets/<dataset>/files`: A list of source files for the dataset
- `name`: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration)
- `id`: Synapse id of the file
- `format`: The format of the source file
- `datasets/<dataset>/final_format`: The format of the generated output file.
- `datasets/<dataset>/gx_enabled`: Whether or not GX validation should be run on the dataset. `true` will run GX validation, `false` or the absence of this key will skip GX validation.
- `datasets/<dataset>/gx_nested_columns`: A list of nested columns that should be validated using GX nested validation. Failure to include this key and a valid list of columns will result in an error because the nested fields will not be converted to a JSON-parseable string prior to validation. This key is not needed if `gx_enabled` is not set to `true` or if the dataset does not have nested fields.
- `datasets/<dataset>/provenance`: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity")
- `datasets/<dataset>/destination`: Override the default destination for a specific dataset by specifying a synID, or use `*dest` to use the default destination
- `datasets/<dataset>/column_rename`: Columns to be renamed prior to data transformation
Expand Down
12 changes: 6 additions & 6 deletions src/agoradatatools/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ def process_all_files(
"LOCAL",
"--platform",
"-p",
help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional).",
help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional, defaults to LOCAL).",
show_default=True,
)
run_id_opt = Option(
Expand All @@ -323,17 +323,17 @@ def process_all_files(
False,
"--upload",
"-u",
help="Toggles whether or not files will be uploaded to Synapse. The absence of this option means "
"that neither output data files nor GX reports will be uploaded to Synapse. Setting "
"`--upload` in the command will cause both to be uploaded. This option is used to control "
"the upload behavior of the process.",
help="Boolean value that toggles whether or not files will be uploaded to Synapse. The absence of this option means "
"`False` - that neither output data files nor GX reports will be uploaded to Synapse. Setting "
"`--upload` in the command will cause both to be uploaded. (Optional, defaults to False)",
show_default=True,
)
synapse_auth_opt = Option(
None,
"--token",
"-t",
help="Synapse authentication token. Defaults to environment variable $SYNAPSE_AUTH_TOKEN via syn.login() functionality",
help="Synapse authentication token. (Required, Defaults to environment variable SYNAPSE_AUTH_TOKEN via syn.login() functionality "
"https://python-docs.synapse.org/reference/client/?h=syn.login#synapseclient.Synapse.login)",
show_default=False,
)

Expand Down
136 changes: 68 additions & 68 deletions test_config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
destination: &dest syn17015333
staging_path: ./staging
gx_folder: syn52948670
gx_folder: none
gx_table: syn60527065
sources:
- genes_biodomains:
Expand Down Expand Up @@ -141,73 +141,73 @@ datasets:
destination: *dest
gx_enabled: true

- gene_info:
files:
- name: gene_metadata
id: syn25953363.13
format: feather
- name: igap
id: syn12514826.5
format: csv
- name: eqtl
id: syn12514912.3
format: csv
- <<: *agora_proteomics_files
- <<: *agora_proteomics_tmt_files
- <<: *agora_proteomics_srm_files
- <<: *rna_diff_expr_data_files
- name: target_list
id: syn12540368.51
format: csv
- name: median_expression
id: syn27211878.2
format: csv
- name: druggability
id: syn13363443.11
format: csv
- <<: *genes_biodomains_files
- name: tep_adi_info
id: syn51942280.3
format: csv
final_format: json
custom_transformations:
adjusted_p_value_threshold: 0.05
protein_level_threshold: 0.05
column_rename:
ensg: ensembl_gene_id
ensembl_id: ensembl_gene_id
geneid: ensembl_gene_id
has_eqtl: is_eqtl
minimumlogcpm: min
quartile1logcpm: first_quartile
medianlogcpm: median
meanlogcpm: mean
quartile3logcpm: third_quartile
maximumlogcpm: max
possible_replacement: ensembl_possible_replacements
permalink: ensembl_permalink
provenance:
- syn25953363.13
- syn12514826.5
- syn12514912.3
- *agora_proteomics_provenance
- *agora_proteomics_tmt_provenance
- *agora_proteomics_srm_provenance
- *rna_diff_expr_data_provenance
- syn12540368.51
- syn27211878.2
- syn13363443.11
- *genes_biodomains_provenance
- syn51942280.3
agora_rename:
symbol: hgnc_symbol
destination: *dest
gx_enabled: true
gx_nested_columns:
- target_nominations
- median_expression
- druggability
- ensembl_info
# - gene_info:
# files:
# - name: gene_metadata
# id: syn25953363.13
# format: feather
# - name: igap
# id: syn12514826.5
# format: csv
# - name: eqtl
# id: syn12514912.3
# format: csv
# - <<: *agora_proteomics_files
# - <<: *agora_proteomics_tmt_files
# - <<: *agora_proteomics_srm_files
# - <<: *rna_diff_expr_data_files
# - name: target_list
# id: syn12540368.51
# format: csv
# - name: median_expression
# id: syn27211878.2
# format: csv
# - name: druggability
# id: syn13363443.11
# format: csv
# - <<: *genes_biodomains_files
# - name: tep_adi_info
# id: syn51942280.3
# format: csv
# final_format: json
# custom_transformations:
# adjusted_p_value_threshold: 0.05
# protein_level_threshold: 0.05
# column_rename:
# ensg: ensembl_gene_id
# ensembl_id: ensembl_gene_id
# geneid: ensembl_gene_id
# has_eqtl: is_eqtl
# minimumlogcpm: min
# quartile1logcpm: first_quartile
# medianlogcpm: median
# meanlogcpm: mean
# quartile3logcpm: third_quartile
# maximumlogcpm: max
# possible_replacement: ensembl_possible_replacements
# permalink: ensembl_permalink
# provenance:
# - syn25953363.13
# - syn12514826.5
# - syn12514912.3
# - *agora_proteomics_provenance
# - *agora_proteomics_tmt_provenance
# - *agora_proteomics_srm_provenance
# - *rna_diff_expr_data_provenance
# - syn12540368.51
# - syn27211878.2
# - syn13363443.11
# - *genes_biodomains_provenance
# - syn51942280.3
# agora_rename:
# symbol: hgnc_symbol
# destination: *dest
# gx_enabled: true
# gx_nested_columns:
# - target_nominations
# - median_expression
# - druggability
# - ensembl_info

- team_info:
files:
Expand Down

0 comments on commit ed02417

Please sign in to comment.