fix: add json2tsv script

PharmGKB · Apr 17, 2024 · dfabdc1 · dfabdc1
1 parent 57a956b
commit dfabdc1
Show file tree

Hide file tree

Showing 4 changed files with 513 additions and 23 deletions.
diff --git a/docs/using/Multi-Sample-Analysis.md b/docs/using/Multi-Sample-Analysis.md
@@ -173,25 +173,48 @@ We also provide [an accessory python script](https://github.com/PharmGKB/PharmCA
 
 | Sample   | Gene    | Phenotype       | Activity_Score | Diplotype\*                                     | DPYD_RYR1_Variants    | DPYD_RYR1_Variant_Functions      | DPYD_RYR1_Variant_Genotypes | Haplotype_1             | Haplotype_2            | Haplotype_1_Functions | Haplotype_2_Functions | Haplotype_1_Variants                                                             | Haplotype_2_Variants                                                  | Missing_Positions | Uncallable_Haplotypes |
 |----------|---------|-----------------|----------------|-------------------------------------------------|-----------------------|----------------------------------|-----------------------------|-------------------------|------------------------|-----------------------|-----------------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------|-------------------|-----------------------|
-| sample_1  | ABCG2   |                 |                | rs2231142 reference (G)/rs2231142 reference (G) | |                       |                                  |                             | rs2231142 reference (G) | rs2231142 reference (G) |                       | 88131171:G                                                                       | 88131171:G                                                            |                   |                       |
-| sample_1 | CYP3A5  |                 | | \*1/\*1                                         |                                                 |                       |                                  |                             | \*1                     | \*1                    |                       |                       | | | 99660516;99676198 |                       |
+| sample_1  | ABCG2   |                 |                | rs2231142 reference (G)/rs2231142 reference (G) | |                                     |                             | rs2231142 reference (G) | rs2231142 reference (G) |        |                       |                                                                        |                                                             |                   |                       |
+| sample_1 | CYP3A5  |                 | | \*1/\*1                                         |                        |                                  |                             | \*1                     | \*1                    |                       |                       | | | 99660516;99676198 |                       |
 | sample_1  | SLCO1B1 | Decreased Function | | \*1/\*15                                        |                                                 |                       |                                   | \*1                     | \*15                   | Normal function       | No function           |  | 21178615:C | 21176804 | \*37                  |
-| sample_1  | CYP2C9 | Normal Metabolizer | 2.0            |                                                 |                       |                                  |                             | \*1                     | \*1                    | Normal function       | Normal function           |  |  |  |                   |
-| sample_1  | RYR1 | Uncertain Susceptibility |                | c.13513G>C (heterozygous)                       | Reference;c.13513G>C  | Normal function;Normal function  | 38566986:C                  |                         |                     |       |            |  |  | 38433867;38440747;38440796;<...truncated for visual clarity...> |     c.12115A>T;c.6349G>C;c.178G>T;<...truncated for visual clarity...>              |
+| sample_1  | CYP2C9 | Normal Metabolizer | 2.0            |                                                 |                       |                                  |                             | \*1                     | \*1                    | Normal function       | Normal function       |  |  |  |                   |
+| sample_1  | RYR1 | Uncertain Susceptibility |                | c.13513G>C (heterozygous)                       | Reference;c.13513G>C  | Normal function;Normal function  | 38566986:C                  |                         |                     |       |                       |  |  | 38433867;38440747;38440796;<...truncated for visual clarity...> |     c.12115A>T;c.6349G>C;c.178G>T;<...truncated for visual clarity...>              |
 
 [*] This column only shows [the effectively phased _DPYD_ and _RYR1_ diplotypes](/methods/Gene-Definition-Exceptions/). If the column is empty, please check out other designated columns for _DPYD_ or _RYR1_ variants.
 
 #### Extracting the PharmCAT JSON data into a TSV file
 ```shell
-conda env create -f src/environment.yml
-conda activate json2tsv
+# the yaml file is under the PharmCAT/src/scripts/ folder
+conda env create -f pharmcat_scripts.yaml
+conda activate pharmcat_scripts
 # extract json content into a tsv file
-python3 src/json2tsv_pharmcat.py \
-  -i results/pharmcat_all/ \
-  -a </path/to/pharmcat/allele/definition/json/> \
-  -o results/
+python3 json2tsv_pharmcat.py \
+  -i </path/to/PharmCAT/output/folder/> \
+  -a </path/to/pharmcat/allele/definition/json/*_translation.json> \
+  -g CYP2C19,DPYD \
+  -S <sample.txt> \
+  -c -cp 4 \
+  -o </results/>
 ```
 
+**Mandatory** argument:
+-i `<path/to/PharmCAT/output/folder/>`
+: Path to the directory that contains the PharmCAT Named Allele Matcher and Phenotyper JSON outputs.
+
+**Optional** argument:
+-a `</path/to/*_translation.json>`
+: Path to the PharmCAT allele definition JSON files. Directory path should be included. Pattern matching is acceptable to read more than one sample.
+
+-G `gene1,gene2`
+: A list of genes separated by comma. The script will only process results for these genes.
+
+-S `<txt_file>`
+: A file of samples. One sample per line. The script will only process results for samples listed in this file.
+
+-c
+: Enable concurrent mode. 
+
+-cp `<num processes>`
+: The maximum number of processes to use if concurrent mode is enabled.
 
 ---
 

diff --git a/src/scripts/json2tsv/json2tsv_pharmcat.py b/src/scripts/json2tsv/json2tsv_pharmcat.py
@@ -0,0 +1,218 @@
+#!/usr/bin/env python3
+
+__author__ = 'BinglanLi'
+
+import sys
+import json
+import concurrent.futures
+import pandas as pd
+from glob import glob
+from pathlib import Path
+from utilities import extract_pcat_json, get_json_file_names
+from concurrent.futures import ALL_COMPLETED
+from timeit import default_timer as timer
+
+
+if __name__ == "__main__":
+    import argparse
+
+    # describe the tool
+    parser = argparse.ArgumentParser(description='Extract results from PharmCAT JSONs into a TSV-formatted file.')
+
+    # list input arguments
+    parser.add_argument("-i", "--input-dir", type=str, metavar='<dir>',
+                        help="Directory path to PharmCAT JSONs.")
+    parser.add_argument('-S', '--sample-file', type=str, metavar='<txt_file>',
+                        help='(Optional) a file containing a list of sample IDs, one sample at a line.')
+    parser.add_argument("-m", "--matcher-json-pattern", type=str, metavar='.*match.json',
+                        default='*match.json',
+                        help="(Optional) file name pattern of the Named Allele Matcher JSON files.")
+    parser.add_argument("-p", "--phenotyper-json-pattern", type=str, metavar='.*phenotype.json',
+                        default='*phenotype.json',
+                        help="(Optional) file name pattern of the Phenotyper JSON files.")
+    parser.add_argument("-a", "--allele-definition-files", type=str,
+                        metavar='</path/to/*_translation.json>',
+                        default='../../main/resources/org/pharmgkb/pharmcat/definition/alleles/*_translation.json',
+                        help="(Optional) Path to PharmCAT allele definition JSON files. "
+                             "Directory path should be included. "
+                             "Pattern matching strings are acceptable.")
+    parser.add_argument("-gs", "--guideline-source", type=str, metavar='CPIC/DPWG',
+                        default='CPIC',
+                        help="(Optional) Guideline source to extract, default = CPIC.")
+    parser.add_argument("-g", "--genes", type=str, metavar='gene1,gene2',
+                        default=None,
+                        help="(Optional) List of genes to be process, separated by comma.")
+
+    # output args
+    output_group = parser.add_argument_group('Output arguments')
+    output_group.add_argument("-o", "--output-dir", type=str, metavar='<dir>',
+                              help="(Optional) directory for outputs.  Defaults to the directory of the input.")
+    output_group.add_argument("-bf", "--base-filename", type=str, metavar='<name>',
+                              default='pharmcat_json2tsv',
+                              help="(Optional) output prefix (without file extensions), "
+                                   "by default \"pharmcat_json2tsv\".")
+
+    # concurrency args
+    concurrency_group = parser.add_argument_group('Concurrency arguments')
+    concurrency_group.add_argument("-c", "--concurrent-mode", action="store_true",
+                                   help="(Optional) use multiple processes - maximum number of processes spawned will "
+                                        "default to to two less than the number of cpu cores.")
+    concurrency_group.add_argument("-cp", "--max-concurrent-processes", type=int, metavar='<num processes>',
+                                   default=None,
+                                   help='(Optional) the maximum number of processes to use when concurrent mode ' +
+                                        'is enabled.')
+
+    parser.add_argument("-v", "--verbose", action="count", default=0,
+                        help="(Optional) print more verbose messages")
+
+    # parse arguments
+    args = parser.parse_args()
+
+    try:
+        # parse args into internal variables
+        m_input_dir: Path = Path(args.input_dir).resolve()
+        if not m_input_dir.is_dir():
+            print(f'-i/--input-dir is not a valid directory path: {m_input_dir}')
+            sys.exit(1)
+        print(f'Reading from the input directory {m_input_dir}')
+
+        # define output file name
+        m_output_basename: str = args.base_filename
+        # define output file path
+        if args.output_dir:
+            m_output_dir: Path = Path(args.output_dir).resolve()
+        else:
+            m_output_dir: Path = Path.cwd()
+            print(f'Output directory is not specified.\n'
+                  f'\tSave to the current working directory:{m_output_dir}')
+        # define the full output file path
+        m_output_file: Path = m_output_dir / (m_output_basename + '.tsv')
+
+        # get the list of matcher and phenotyper jsons
+        m_matcher_jsons: list[str] = glob(str(m_input_dir.joinpath(args.matcher_json_pattern)))
+        m_phenotyper_jsons: list[str] = glob(str(m_input_dir.joinpath(args.phenotyper_json_pattern)))
+        # get sample list based on json file names
+        matcher_json_samples: list[str] = [x.split('.')[-3] for x in m_matcher_jsons]
+        phenotyper_json_samples: list[str] = [x.split('.')[-3] for x in m_phenotyper_jsons]
+        # further narrow down sample list based on sample_file
+        m_samples: list[str] = []
+        if args.sample_file:
+            # todo: replace this with preprocessor.util.read_sample_file
+            print(f'Reading samples from file: {args.sample_file}')
+            with open(args.sample_file, 'r') as f:
+                # this assumes a small, non-memory-intensive sample file
+                m_samples = [line.strip() for line in f.readlines()]
+            json_samples: list[str] = list(set(matcher_json_samples) & set(phenotyper_json_samples) & set(m_samples))
+        else:
+            print('No sample file found. Reading all samples.')
+            json_samples: list[str] = list(set(matcher_json_samples) & set(phenotyper_json_samples))
+        # print out the total number of samples
+        print(f'\tFound n = {len(json_samples)}')
+
+        # specify genes to be processed
+        m_genes: list[str] = []
+        if args.guideline_source == 'CPIC':
+            m_genes = ["ABCG2", "CACNA1S", "CFTR", "CYP2B6", "CYP2C9", "CYP2C19", "CYP3A5", "CYP4F2",
+                       "DPYD", "G6PD", "IFNL3", "NUDT15", "RYR1", "SLCO1B1", "TPMT", "UGT1A1",
+                       "VKORC1", "CYP2D6", "HLA-A", "HLA-B", "MT-RNR1"]
+        elif args.guideline_source == 'DPWG':
+            m_genes = ["ABCG2", "CYP2B6", "CYP2C9", "CYP2C19", "CYP3A4", "CYP3A5", "DPYD", "NUDT15",
+                       "SLCO1B1", "TPMT", "UGT1A1", "VKORC1", "CYP2D6"]
+        # further narrow down the list of genes to be process if --genes is specified
+        if args.genes is not None:
+            m_selected_genes: list[str] = [x for x in args.genes.split(",") if x in m_genes]
+            m_genes = m_selected_genes
+
+        # read allele definition json files
+        print(f'Looking for the allele definition JSON files: {args.allele_definition_files}')
+        m_allele_definition_jsons: list[str] = glob(args.allele_definition_files)
+        # read reference alleles at each allele-defining position
+        allele_definition_references: dict[str, list[str]] = {}
+        for json_file in m_allele_definition_jsons:
+            print(f'\tReading {json_file}')
+            # get the gene name
+            gene: str = json_file.split('/')[-1].split('_')[0]
+            # read file
+            with open(json_file, 'r') as f:
+                json_data: dict = json.load(f)
+            # get reference genotype list
+            positions: list[str] = [str(entry['position']) for entry in json_data['variants']]
+            genotypes: list[str] = [entry['ref'] for entry in json_data['variants']]
+            allele_definition_references[gene] = [x + ':' + y for x, y in zip(positions, genotypes)]
+
+        # set up concurrent mode
+        m_max_processes: int = 1
+        if args.concurrent_mode:
+            print('Concurrent mode enabled...')
+            m_max_processes = args.max_concurrent_processes
+        elif args.max_concurrent_processes is not None:
+            print("-cp/--max_processes will be ignored (not running in multiprocess mode)")
+
+        start = timer()
+
+        # extract json results for each sample
+        summary_results: dict[str, list[str]] = {
+            'sample': [], 'gene': [], 'phenotype': [], 'activity_score': [],
+            'diplotype': [],
+            'dpyd_ryr1_variants': [], 'dpyd_ryr1_variant_functions': [], 'dpyd_ryr1_variant_genotypes': [],
+            'haplotype_1': [], 'haplotype_2': [],
+            'haplotype_1_functions': [], 'haplotype_2_functions': [],
+            'haplotype_1_variants': [], 'haplotype_2_variants': [],
+            'missing_positions': [], 'uncallable_haplotypes': []
+        }
+        if args.concurrent_mode:
+            with concurrent.futures.ProcessPoolExecutor(max_workers=m_max_processes) as e:
+                futures = []
+                print(f'Processing samples:')
+                for one_sample in json_samples:
+                    # get the matcher and phenotyper json names for the sample
+                    matcher_file, phenotyper_file = get_json_file_names(one_sample, m_matcher_jsons, m_phenotyper_jsons)
+                    futures.append(e.submit(extract_pcat_json, matcher_file, phenotyper_file, m_genes, one_sample,
+                                            allele_definition_references, args.guideline_source))
+                concurrent.futures.wait(futures, return_when=ALL_COMPLETED)
+                for future in futures:
+                    tmp_results = future.result()
+                    # concatenate temporary dictionary with the summary dictionary
+                    for key in summary_results:
+                        summary_results[key].extend(tmp_results[key])
+        else:
+            print(f'Processing samples.')
+            for one_sample in json_samples:
+                # get the matcher and phenotyper json names for the sample
+                matcher_file, phenotyper_file = get_json_file_names(one_sample, m_matcher_jsons, m_phenotyper_jsons)
+
+                # extract results for a sample
+                tmp_results: dict[str, list[str]] = extract_pcat_json(
+                    matcher_json=matcher_file,
+                    phenotyper_json=phenotyper_file,
+                    genes=m_genes,
+                    sample_id=one_sample,
+                    reference_genotypes=allele_definition_references,
+                    guideline_source=args.guideline_source
+                )
+
+                # concatenate temporary dictionary with the summary dictionary
+                for key in summary_results:
+                    summary_results[key].extend(tmp_results[key])
+
+        # convert the summary dictionary to a pandas data frame
+        rez = pd.DataFrame.from_dict(summary_results)
+        # remove null
+        rez.loc[rez.phenotype == 'n/a', 'phenotype'] = ''
+        # write to an output
+        cols: list[str] = ['Sample', 'Gene', 'Phenotype', 'Activity_Score',
+                           'Diplotype',
+                           'DPYD_RYR1_Variants', 'DPYD_RYR1_Variant_Functions', 'DPYD_RYR1_Variant_Genotypes',
+                           'Haplotype_1', 'Haplotype_2',
+                           'Haplotype_1_Functions', 'Haplotype_2_Functions',
+                           'Haplotype_1_Variants', 'Haplotype_2_Variants',
+                           'Missing_Positions', 'Uncallable_Haplotypes']
+        rez.to_csv(m_output_file.absolute(), mode='w', sep="\t", header=cols, index=False)
+
+        end = timer()
+        print("Done.")
+        print("Preprocessed input VCF in %.2f seconds" % (end - start))
+
+    except Exception as e:
+        print(e)
+        sys.exit(1)