updates documentation

Sage-Bionetworks · Nov 5, 2024 · ed02417 · ed02417
1 parent b189c1f
commit ed02417
Show file tree

Hide file tree

Showing 4 changed files with 109 additions and 78 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -167,9 +167,9 @@ This package uses [Great Expectations](https://greatexpectations.io/) to validat
 
 1. Create a new expectation suite by defining the expectations for the dataset in a Jupyter Notebook inside the `gx_suite_definitions` folder. Use `metabolomics.ipynb` as an example. You can find a catalog of existing expectations [here](https://greatexpectations.io/expectations/).
 1. Run the notebook to generate the new expectation suite. It should populate as a JSON file in the `/great_expectations/expectations` folder.
-1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. After updating the config files reports should be uploaded in the proper locations ([Prod](https://www.synapse.org/#!Synapse:syn52948668), [Testing](https://www.synapse.org/#!Synapse:syn52948670)) when data processing is complete.
-   - You can prevent Great Expectations from running for a dataset by removing the `gx_enabled: true` from the configuration for the dataset.
-1. Test data processing by running `adt test_config.yaml` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse.
+1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. Ensure that the `gx_folder` and `gx_table` keys are present in the configuration file and contain valid Synapse IDs for the GX reports and GX table, respectively.
+   - You can prevent Great Expectations from running for a dataset by setting `gx_enabled: false` in the configuration for the dataset.
+1. Test data processing by running `adt test_config.yaml --upload` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse.
 
 #### Custom Expectations
 
@@ -181,6 +181,28 @@ This repository is currently home to three custom expectations that were created
 
 These expectations are defined in the `/great_expectations/gx/plugins/expectations` folder. To add more custom expectations, follow the instructions [here](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp).
 
+#### Nested Columns
+
+If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these steps:
+1. Add the nested column name to the `gx_nested_columns` flag in the configuration file for the specific transform. This will convert the column values to a JSON parsable string.
+```
+gx_nested_columns:
+   - <nested_column_name>
+```
+1. When creating the validator object in the gx_suite_definitions notebook, the nested column(s) must be included in the `nested_columns` list.
+```
+df = pd.read_json(<data_file>)
+nested_columns = ['<nested_column_name>']
+df = GreatExpectationsRunner.convert_nested_columns_to_json(df, nested_columns)
+validator = context.sources.pandas_default.read_dataframe(df)
+validator.expectation_suite_name = "<suite_name>"
+```
+1. When validating the value type of the nested column, make sure to specify it as a string (see Step 1 for reasoning):
+```
+validator.expect_column_values_to_be_of_type("<nested_column_name>", "str")
+```
+1. A JSON file containing the expected schema must be added here: `src/agoradatatools/great_expectations/gx/json_schemas/<transform_name>/<column_name>.json`. Use the [JSON schema tool](https://jsonschema.net/app/schemas/0) to create the schema template for your nested column.
+
 ### DockerHub
 
 Rather than using GitHub actions to build and push Docker images to DockerHub, the Docker images are automatically built in DockerHub. This requires the `sagebiodockerhub` GitHub user to be an Admin of this repo. You can view the docker build [here](https://hub.docker.com/r/sagebionetworks/agora-data-tools).
diff --git a/README.md b/README.md
@@ -123,12 +123,21 @@ python -m pytest
 Parameters:
 - `destination`: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basis
 - `staging_path`: Defines the location of the staging folder that the generated json files are written to
-- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to
+- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_folder` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown.
+- `gx_table`: Defines the Synapse ID of the table that generated GX reporting is posted to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_table` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown.
+- `sources/<source>`: Source files for each dataset are defined in the `sources` section of the config file.
+- `sources/<source>/<source>_files`: A list of source file information for the dataset.
+- `sources/<source>/<source>_files/name`: The name of the source file/dataset.
+- `sources/<source>/<source>_files/id`: The Synapse ID of the source file. Dot notation is supported to indicate the version of the file to use.
+- `sources/<source>/<source>_files/format`: The format of the source file.
 - `datasets/<dataset>`: Each generated json file is named `<dataset>.json`
 - `datasets/<dataset>/files`: A list of source files for the dataset
     - `name`: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration)
     - `id`: Synapse id of the file
     - `format`: The format of the source file
+- `datasets/<dataset>/final_format`: The format of the generated output file.
+- `datasets/<dataset>/gx_enabled`: Whether or not GX validation should be run on the dataset. `true` will run GX validation, `false` or the absence of this key will skip GX validation.
+- `datasets/<dataset>/gx_nested_columns`: A list of nested columns that should be validated using GX nested validation. Failure to include this key and a valid list of columns will result in an error because the nested fields will not be converted to a JSON-parseable string prior to validation. This key is not needed if `gx_enabled` is not set to `true` or if the dataset does not have nested fields.
 - `datasets/<dataset>/provenance`: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity")
 - `datasets/<dataset>/destination`: Override the default destination for a specific dataset by specifying a synID, or use `*dest` to use the default destination
 - `datasets/<dataset>/column_rename`: Columns to be renamed prior to data transformation

diff --git a/src/agoradatatools/process.py b/src/agoradatatools/process.py
@@ -309,7 +309,7 @@ def process_all_files(
     "LOCAL",
     "--platform",
     "-p",
-    help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional).",
+    help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional, defaults to LOCAL).",
     show_default=True,
 )
 run_id_opt = Option(
@@ -323,17 +323,17 @@ def process_all_files(
     False,
     "--upload",
     "-u",
-    help="Toggles whether or not files will be uploaded to Synapse. The absence of this option means "
-    "that neither output data files nor GX reports will be uploaded to Synapse. Setting "
-    "`--upload` in the command will cause both to be uploaded. This option is used to control "
-    "the upload behavior of the process.",
+    help="Boolean value that toggles whether or not files will be uploaded to Synapse. The absence of this option means "
+    "`False` - that neither output data files nor GX reports will be uploaded to Synapse. Setting "
+    "`--upload` in the command will cause both to be uploaded. (Optional, defaults to False)",
     show_default=True,
 )
 synapse_auth_opt = Option(
     None,
     "--token",
     "-t",
-    help="Synapse authentication token. Defaults to environment variable $SYNAPSE_AUTH_TOKEN via syn.login() functionality",
+    help="Synapse authentication token. (Required, Defaults to environment variable SYNAPSE_AUTH_TOKEN via syn.login() functionality "
+    "https://python-docs.synapse.org/reference/client/?h=syn.login#synapseclient.Synapse.login)",
     show_default=False,
 )
 

diff --git a/test_config.yaml b/test_config.yaml
@@ -1,6 +1,6 @@
 destination: &dest syn17015333
 staging_path: ./staging
-gx_folder: syn52948670
+gx_folder: none
 gx_table: syn60527065
 sources:
   - genes_biodomains:
@@ -141,73 +141,73 @@ datasets:
       destination: *dest
       gx_enabled: true
 
-  - gene_info:
-      files:
-        - name: gene_metadata
-          id: syn25953363.13
-          format: feather
-        - name: igap
-          id: syn12514826.5
-          format: csv
-        - name: eqtl
-          id: syn12514912.3
-          format: csv
-        - <<: *agora_proteomics_files
-        - <<: *agora_proteomics_tmt_files
-        - <<: *agora_proteomics_srm_files
-        - <<: *rna_diff_expr_data_files
-        - name: target_list
-          id: syn12540368.51
-          format: csv
-        - name: median_expression
-          id: syn27211878.2
-          format: csv
-        - name: druggability
-          id: syn13363443.11
-          format: csv
-        - <<: *genes_biodomains_files
-        - name: tep_adi_info
-          id: syn51942280.3
-          format: csv
-      final_format: json
-      custom_transformations:
-        adjusted_p_value_threshold: 0.05
-        protein_level_threshold: 0.05
-      column_rename:
-        ensg: ensembl_gene_id
-        ensembl_id: ensembl_gene_id
-        geneid: ensembl_gene_id
-        has_eqtl: is_eqtl
-        minimumlogcpm: min
-        quartile1logcpm: first_quartile
-        medianlogcpm: median
-        meanlogcpm: mean
-        quartile3logcpm: third_quartile
-        maximumlogcpm: max
-        possible_replacement: ensembl_possible_replacements
-        permalink: ensembl_permalink
-      provenance:
-        - syn25953363.13
-        - syn12514826.5
-        - syn12514912.3
-        - *agora_proteomics_provenance
-        - *agora_proteomics_tmt_provenance
-        - *agora_proteomics_srm_provenance
-        - *rna_diff_expr_data_provenance
-        - syn12540368.51
-        - syn27211878.2
-        - syn13363443.11
-        - *genes_biodomains_provenance
-        - syn51942280.3
-      agora_rename:
-        symbol: hgnc_symbol
-      destination: *dest
-      gx_enabled: true
-      gx_nested_columns:
-        - target_nominations
-        - median_expression
-        - druggability
-        - ensembl_info
+  # - gene_info:
+  #     files:
+  #       - name: gene_metadata
+  #         id: syn25953363.13
+  #         format: feather
+  #       - name: igap
+  #         id: syn12514826.5
+  #         format: csv
+  #       - name: eqtl
+  #         id: syn12514912.3
+  #         format: csv
+  #       - <<: *agora_proteomics_files
+  #       - <<: *agora_proteomics_tmt_files
+  #       - <<: *agora_proteomics_srm_files
+  #       - <<: *rna_diff_expr_data_files
+  #       - name: target_list
+  #         id: syn12540368.51
+  #         format: csv
+  #       - name: median_expression
+  #         id: syn27211878.2
+  #         format: csv
+  #       - name: druggability
+  #         id: syn13363443.11
+  #         format: csv
+  #       - <<: *genes_biodomains_files
+  #       - name: tep_adi_info
+  #         id: syn51942280.3
+  #         format: csv
+  #     final_format: json
+  #     custom_transformations:
+  #       adjusted_p_value_threshold: 0.05
+  #       protein_level_threshold: 0.05
+  #     column_rename:
+  #       ensg: ensembl_gene_id
+  #       ensembl_id: ensembl_gene_id
+  #       geneid: ensembl_gene_id
+  #       has_eqtl: is_eqtl
+  #       minimumlogcpm: min
+  #       quartile1logcpm: first_quartile
+  #       medianlogcpm: median
+  #       meanlogcpm: mean
+  #       quartile3logcpm: third_quartile
+  #       maximumlogcpm: max
+  #       possible_replacement: ensembl_possible_replacements
+  #       permalink: ensembl_permalink
+  #     provenance:
+  #       - syn25953363.13
+  #       - syn12514826.5
+  #       - syn12514912.3
+  #       - *agora_proteomics_provenance
+  #       - *agora_proteomics_tmt_provenance
+  #       - *agora_proteomics_srm_provenance
+  #       - *rna_diff_expr_data_provenance
+  #       - syn12540368.51
+  #       - syn27211878.2
+  #       - syn13363443.11
+  #       - *genes_biodomains_provenance
+  #       - syn51942280.3
+  #     agora_rename:
+  #       symbol: hgnc_symbol
+  #     destination: *dest
+  #     gx_enabled: true
+  #     gx_nested_columns:
+  #       - target_nominations
+  #       - median_expression
+  #       - druggability
+  #       - ensembl_info
 
   - team_info:
       files: