Merge pull request #217 from polio-nanopore/ev-dev

Configurable categories for grouping by, configurable header fields in detailed csv- but dynamically picks.
polio-nanopore · Jan 29, 2024 · 2ca977d · 2ca977d
2 parents cafdbae + 19c6b2d
commit 2ca977d
Show file tree

Hide file tree

Showing 15 changed files with 223 additions and 72 deletions.
diff --git a/README.md b/README.md
@@ -388,6 +388,66 @@ and piranha will check which ones you have installed with your version of medaka
 >```Where a version of Guppy has been used without an exactly corresponding medaka model, the medaka model with the highest version equal to or less than the guppy version should be selected.```
 
 
+## Process enterovirus sequences
+
+Piranha now has the flexibility to cluster sequencing reads by a specified reference group field (`-rg / --reference-group-field`, or `reference_group_field` in a config file). By default, this field is set to `ddns_group` and any records in the reference file must contain this as an annotation in the header description. The reference file that is supplied with piranha contains this field in all records and the values fall into the following categories:
+
+- Sabin1-related
+- Sabin2-related
+- Sabin3-related
+- WPV1
+- WPV2
+- WPV3
+- NonPolioEV
+
+These fields focus on poliovirus detection and all non-polio enterovirus sequences are collaposed into a single category. 
+
+If your use of piranha extends to other enteroviruses, it is now possible to run a more tailored analysis. The reference file that is supplied with piranha has the following  format:
+
+```
+>Poliovirus1-Sabin_AY184219 ddns_group=Sabin1-related species=Sabin1-related cluster=Sabin1-related
+GGGTTAG...CCACATAT
+>CoxsackievirusA21_EF015028 ddns_group=NonPolioEV species=CoxsackievirusA21 cluster=CoxsackievirusA21
+AGATGCA...CAGTGAGA
+```
+
+With the `-rg` flag, you can use the reference file supplied to cluster by ddns_group (default), species or cluster. Then, piranha will dynamically calculate which categories are present during the run. For example, running piranha with `-rg species` will enable clusters of CoxsackievirusA21, or other enterovirus species to be identified and have a consensus sequence generated during piranha analysis.
+
+It is also possible to supply a custom reference file with references and fields of interest to piranha in the following format:
+
+```
+>REF1 my_custom_field=group1
+ACTGT....CTAGCTA
+>REF2 my_custom_field=group1
+ACTGT....CTAGCTA
+>REF3 my_custom_field=group1
+ACTGT....CTAGCTA
+>REF4 my_custom_field=group2
+ACTGT....CTAGCTA
+>REF5 my_custom_field=group2
+ACTGT....CTAGCTA
+```
+and in this example, if `-rg my_custom_field` is supplied, any reads that map to REF1, REF2 or REF3 will be clustered into 'group1' and any reads that map to REF4 or REF5 will be clustered into 'group2'.
+
+Any values will reads mapping will be shown in the final reports.
+
+Pitfalls to note:
+
+1. A caveat with this new approach is that our reference file has been constructed with poliovirus primarily in mind. There are quite a few enterovirus references in the file, covering 113 `species` groups, but there is likely to be gaps in the broad enterovirus diversity covered in this reference file. If you have species of interest you want to be detected, it may be worth supplying a custom reference file. 
+
+2. Enteroviruses are a very diverse group of viruses, it may well be that portions of a read match to something in our reference set, but the rest is too divergent to map. In this case, lowering the alignment block length minimum requirement may enable partial sequences to be recovered. Here, the limitation is the reference-based approach and supplementing the reference file with additional reference sequences may resolve this.
+
+3. Mapping against a reference set that contains multiple sequences that are very similar may significantly decrease the alignment quality score from minimap2. This case arises when there are 2 equally likely matches in the database to your sequencing reaad and the software cannot confidently call the best hit. In the default reference set, we only include a single Sabin1, Sabin2 and Sabin3 reference respectively, as including VDPV sequences in the database may confuse the mapping software. By default, reads with low alignment mapping scores will be filtered out. By lowering the mapping quality threshold, these split cases may be included in the analysis for enteroviruses.
+
+Suggested configuration settings can be found [here](https://github.com/polio-nanopore/piranha/blob/main/docs/config.ev.yaml) and we would recommend to only use this mode if the user can sufficiently appreciate the limitations the reference file composition may have on detection.
+
+With the supplied config file, piranha can be run as follows:
+
+```
+piranha -c config.ev.yaml
+```
+
+
 ## Experimental haplotype calling pipeline
 
 There is now an experimental haplotype calling pipeline that uses freebayes for initial variant calling and flopp for read phasing with the called variants. It has a number internal QC steps for merging identical haplotypes. This pipeline requires further validation, but can theoretically produce multiple consensus sequences for each poliovirus population present within a sample. The pipeline has mostly be tested on VDPVs of two mixtures, but will be further assessed.

diff --git a/config.yaml b/config.yaml
@@ -5,15 +5,21 @@ runname: "MIN003"
 barcodes_csv: "piranha/test/pak_run/barcodes01.csv"
 readdir: "piranha/test/pak_run/demultiplexed"
 
+min_read_length: 500
+max_read_length: 2300
 threads: 1
 
 negative_control: "negative"
 positive_control: "positive"
 
 # run_phylo: True
-supplementary_datadir: "piranha/test/supp_data"
+# supplementary_datadir: "piranha/test/supp_data"
 # update_local_database: True
-local_database_threshold: 5
+# local_database_threshold: 5
+reference_group_field: "species"
+min_aln_block: 0
+min_read_pcent: 0
+
 
 overwrite: True
 no_temp: True
diff --git a/docs/config.ev.yaml b/docs/config.ev.yaml
@@ -0,0 +1,21 @@
+username: "Aine O'Toole"
+institute: "IEE"
+runname: "MIN003"
+
+barcodes_csv: "/localdisk/home/data/min003/barcodes.csv"
+readdir: "/localdisk/home/data/min003/fastq_pass"
+
+min_read_length: 500
+max_read_length: 2300
+min_aln_block: 0
+min_read_pcent: 0
+
+threads: 1
+
+negative_control: "negative"
+positive_control: "positive"
+
+reference_group_field: "species"
+
+overwrite: True
+no_temp: True
diff --git a/environment.yml b/environment.yml
@@ -5,7 +5,7 @@ channels:
   - defaults
 dependencies:
   - python=3.10
-  - bcftools=1.11
+  - bcftools=1.18
   - coreutils=9.1
   - flopp=0.2.0
   - freebayes=1.3.6
@@ -15,6 +15,6 @@ dependencies:
   - minimap2=2.17
   - pyabpoa
   - rasusa=0.7.1
-  - samtools=1.11
+  - samtools=1.18
   - snakemake-minimal
   - tabix=1.11
diff --git a/piranha/analysis/phylo_functions.py b/piranha/analysis/phylo_functions.py
@@ -198,7 +198,7 @@ def update_local_database(sample_sequences,detailed_csv,new_db_seqs,new_db_metad
             for i in desc_list:
                 if i.startswith("variant_count"):
                     count = i.split("=")[1]
-                elif i.startswith(VALUE_REFERENCE_MATCH_FIELD):
+                elif i.startswith(VALUE_REFERENCE_GROUP_FIELD):
                     match_field = i.split("=")[1]
 
             if "Sabin" in match_field:

diff --git a/piranha/analysis/preprocessing.py b/piranha/analysis/preprocessing.py
@@ -41,23 +41,23 @@ def gather_filter_reads_by_length(dir_in,barcode,reads_out,config):
         print(green(f"Total passed reads {barcode}:"),len(fastq_records))
         SeqIO.write(fastq_records,fw, "fastq")
 
-def parse_match_field(description):
+def parse_match_field(description,reference_match_field):
     for item in str(description).split(" "):
-        if item.startswith(VALUE_REFERENCE_MATCH_FIELD):
+        if item.startswith(reference_match_field):
             match_field = item.split("=")[1]
             return match_field
 
-def make_ref_match_field_map(references,positive_references,include_positive_references):
+def make_ref_match_field_map(references,reference_match_field,positive_references,include_positive_references):
     ref_map = {}
     for record in SeqIO.parse(references,KEY_FASTA):
         match_field = ""
         if include_positive_references:
             if record.id in positive_references:
                 ref_map[record.id] = "p" #quicker for parsing below
             else:
-                ref_map[record.id] = parse_match_field(record.description)
+                ref_map[record.id] = parse_match_field(record.description,reference_match_field)
         else:
-            ref_map[record.id] = parse_match_field(record.description)
+            ref_map[record.id] = parse_match_field(record.description,reference_match_field)
     return ref_map
 
 def make_match_field_to_reference_group_map(ref_map):
@@ -239,12 +239,17 @@ def parse_paf_file(paf_file,
                     barcode,
                     analysis_mode,
                     min_map_quality,
+                    reference_match_field,
                     config):
 
     if is_non_zero_file(paf_file):
 
-        permissive_ref_name_map = make_ref_match_field_map(references_sequences,positive_references,include_positive_references)
-        ref_name_map = make_match_field_to_reference_group_map(permissive_ref_name_map)
+        permissive_ref_name_map = make_ref_match_field_map(references_sequences,reference_match_field,positive_references,include_positive_references)
+        if reference_match_field == VALUE_REFERENCE_GROUP_FIELD:
+            # only run the wibble match if using default value for ref group (now ddns_group)
+            ref_name_map = make_match_field_to_reference_group_map(permissive_ref_name_map)
+        else:
+            ref_name_map = permissive_ref_name_map
 
         min_aln_block = config[KEY_MIN_ALN_BLOCK]
 
@@ -281,19 +286,15 @@ def diversity_report(input_files,csv_out,summary_out,ref_file,config):
     summary_rows = {}
     refs_out = collections.defaultdict(list)
 
-    SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_VP1
-
-    if config[KEY_ANALYSIS_MODE] == VALUE_ANALYSIS_MODE_WG:
-        SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_WG
-
+    sample_composition_table_header_fields = config[KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS]
 
     with open(barcodes_csv,"r") as f:
         reader = csv.DictReader(f)
         for row in reader:
             summary_rows[row[KEY_BARCODE]] = {KEY_BARCODE: row[KEY_BARCODE],
                                             KEY_SAMPLE: row[KEY_SAMPLE]
                                             }
-            for field in SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:
+            for field in sample_composition_table_header_fields:
                 if field not in summary_rows[row[KEY_BARCODE]]: # the rest are ref counters
                     summary_rows[row[KEY_BARCODE]][field] = 0
 
@@ -318,7 +319,7 @@ def diversity_report(input_files,csv_out,summary_out,ref_file,config):
 
 
     with open(summary_out,"w") as fw2:
-        writer = csv.DictWriter(fw2, fieldnames=SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS, lineterminator="\n")
+        writer = csv.DictWriter(fw2, fieldnames=sample_composition_table_header_fields, lineterminator="\n")
         writer.writeheader()
         for barcode in summary_rows:
             row = summary_rows[barcode]

diff --git a/piranha/analysis/stool_functions.py b/piranha/analysis/stool_functions.py
@@ -66,7 +66,7 @@ def gather_fasta_files(summary_info, barcodes_csv, input_cns_list,all_metdata,ru
 
                     record_id += f" {KEY_BARCODE}={barcode}"
                     record_id += f" {KEY_REFERENCE}={ref}"
-                    record_id += f" {VALUE_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
+                    record_id += f" {VALUE_REFERENCE_GROUP_FIELD}={info[KEY_REFERENCE_GROUP]}"
 
                     if runname:
                         record_id += f" {KEY_RUNNAME}={runname}"

diff --git a/piranha/command.py b/piranha/command.py
@@ -37,6 +37,7 @@ def main(sysargs = sys.argv[1:]):
     i_group.add_argument('-i','--readdir',help="Path to the directory containing fastq read files",dest="readdir")
     i_group.add_argument('-b','--barcodes-csv',help="CSV file describing which barcodes were used on which sample",dest="barcodes_csv")
     i_group.add_argument("-r","--reference-sequences",action="store",dest="reference_sequences",help="Custom reference sequences file.")
+    i_group.add_argument("-rg","--reference-group-field",action="store",help=f"Specify reference description field to group references by. Default: `{VALUE_REFERENCE_GROUP_FIELD}`")
     i_group.add_argument("-nc","--negative-control",action="store",help=f"Sample name of negative control. If multiple samples, supply as comma-separated string of sample names. E.g. `sample01,sample02` Default: `{VALUE_NEGATIVE[0]}`")
     i_group.add_argument("-pc","--positive-control",action="store",help=f"Sample name of positive control. If multiple samples, supply as comma-separated string of sample names. E.g. `sample01,sample02`. Default: `{VALUE_POSITIVE[0]}`")
     i_group.add_argument("-pr","--positive-references",action="store",help=f"Comma separated string of sequences in the reference file to class as positive control sequences.")
@@ -153,6 +154,7 @@ def main(sysargs = sys.argv[1:]):
     input_qc.parse_input_group(args.barcodes_csv,
                                 args.readdir,
                                 args.reference_sequences,
+                                args.reference_group_field,
                                 config)
 
     input_qc.control_group_parsing(args.positive_control,

diff --git a/piranha/input_parsing/initialising.py b/piranha/input_parsing/initialising.py
@@ -37,6 +37,12 @@ def get_defaults():
                     KEY_MIN_READS:VALUE_MIN_READS,   # where to pad to using datafunk
                     KEY_MIN_PCENT:VALUE_MIN_PCENT,
 
+                    KEY_REFERENCE_GROUP_FIELD:VALUE_REFERENCE_GROUP_FIELD,
+                    KEY_REFERENCE_GROUP_VALUES:VALUE_REFERENCE_GROUP_VALUES,
+
+                    KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:[],
+                    KEY_DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:[],
+
                     KEY_MIN_READ_LENGTH:READ_LENGTH_DICT[VALUE_ANALYSIS_MODE][0],
                     KEY_MAX_READ_LENGTH:READ_LENGTH_DICT[VALUE_ANALYSIS_MODE][1],
                     KEY_MIN_MAP_QUALITY:VALUE_MIN_MAP_QUALITY,

diff --git a/piranha/input_parsing/input_qc.py b/piranha/input_parsing/input_qc.py
@@ -108,7 +108,7 @@ def parse_barcodes_csv(barcodes_csv,config):
 #             total +=1
 #             passed = False
 #             for field in record.description.split(" "):
-#                 if field.startswith(VALUE_REFERENCE_MATCH_FIELD):
+#                 if field.startswith(VALUE_REFERENCE_GROUP_FIELD):
 #                     passed=True
 #                     continue
 #             if not passed:
@@ -118,7 +118,7 @@ def parse_barcodes_csv(barcodes_csv,config):
 #         sys.exit(-1)
 
 #     if incorrect >= 1:
-#         sys.stderr.write(cyan(f"Supplementary sequences file lacks `{VALUE_REFERENCE_MATCH_FIELD}` annotation in header of {incorrect} out of {total} sequences parsed.\n"))
+#         sys.stderr.write(cyan(f"Supplementary sequences file lacks `{VALUE_REFERENCE_GROUP_FIELD}` annotation in header of {incorrect} out of {total} sequences parsed.\n"))
 #         sys.exit(-1)
 #     else:
 #         print(green("Supplementary sequences:"), total, "sequences parsed.")
@@ -146,7 +146,7 @@ def parse_fasta_file(supplementary_datadir,supp_file,seq_records,no_reference_gr
         total_seqs["total"] +=1
         ref_group = ""
         for field in record.description.split(" "):
-            if field.startswith(VALUE_REFERENCE_MATCH_FIELD):
+            if field.startswith(VALUE_REFERENCE_GROUP_FIELD):
                 ref_group = field.split("=")[1]
 
         if ref_group not in config[KEY_REFERENCES_FOR_CNS]:
@@ -159,7 +159,7 @@ def parse_fasta_file(supplementary_datadir,supp_file,seq_records,no_reference_gr
 def check_there_are_seqs(total_seqs,supplementary_datadir,no_reference_group,config):
     if total_seqs["total"]==0:
         sys.stderr.write(cyan(f"Error: No sequence files matched in `{supplementary_datadir}`.\nEnsure the directory provided contains FASTA files with appropriate annotations in the header.\n"))
-        sys.stderr.write(cyan(f"Header must specify one of {config[KEY_REFERENCES_FOR_CNS]} under {VALUE_REFERENCE_MATCH_FIELD}=X, where X is the appropriate group to be included in phylo pipeline.\n"))
+        sys.stderr.write(cyan(f"Header must specify one of {config[KEY_REFERENCES_FOR_CNS]} under {VALUE_REFERENCE_GROUP_FIELD}=X, where X is the appropriate group to be included in phylo pipeline.\n"))
         sys.exit(-1)
 
     elif no_reference_group:
@@ -338,7 +338,17 @@ def parse_read_dir(readdir,config):
             print(green(f"Barcode {barcode}:\t") + f"0 fastq files")
             print(cyan(f"Warning: No read files identified for barcode `{barcode}`.\nThis may be a negative control or a failed sample, but be aware it will not be analysed."))
 
-def parse_input_group(barcodes_csv,readdir,reference_sequences,config):
+def parse_ref_group_values(description,ref_group_key):
+    fields = description.split(" ")
+
+    ref_group = ""
+    for i in fields:
+        if i.startswith(ref_group_key):
+            ref_group=i.split("=")[1]
+
+    return ref_group
+
+def parse_input_group(barcodes_csv,readdir,reference_sequences,reference_group_field,config):
 
     parse_barcodes_csv(barcodes_csv,config)
 
@@ -347,11 +357,49 @@ def parse_input_group(barcodes_csv,readdir,reference_sequences,config):
     misc.add_file_to_config(KEY_REFERENCE_SEQUENCES,reference_sequences,config)
     misc.check_path_exists(config[KEY_REFERENCE_SEQUENCES])
 
+    misc.add_arg_to_config(KEY_REFERENCE_GROUP_FIELD,reference_group_field,config)
     #check they have unique identifiers
     seq_ids = collections.Counter()
+    ref_group_field_in_headers = True
+    ref_group_values = set()
     for record in SeqIO.parse(config[KEY_REFERENCE_SEQUENCES],"fasta"):
+        ref_group_value = parse_ref_group_values(record.description,config[KEY_REFERENCE_GROUP_FIELD])
+        ref_group_values.add(ref_group_value)
+
+        if not config[KEY_REFERENCE_GROUP_FIELD] in record.description:
+            ref_group_field_in_headers = False
         seq_ids[record.id]+=1
 
+    if not ref_group_field_in_headers:
+        sys.stderr.write(cyan(f"\nReference fasta file contains records without the specified `--reference-group-field` ({config[KEY_REFERENCE_GROUP_FIELD]}).\n"))
+        sys.exit(-1)
+
+    print(f"{len(ref_group_values)}" + green(f" reference group values identified in reference file."))
+
+    config[KEY_REFERENCE_GROUP_VALUES] = ref_group_values
+
+    sample_composition_table_header_fields = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_BASIC
+    detailed_sample_composition_table_header_fields = []
+
+    for i in sorted(config[KEY_REFERENCE_GROUP_VALUES]):
+        sample_composition_table_header_fields.append(i)
+
+        for j in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_TEMPLATE:
+            detailed_sample_composition_table_header_fields.append(f"{i}|{j}")
+
+    for i in SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_ADDITIONAL:
+        if i not in sample_composition_table_header_fields:
+            sample_composition_table_header_fields.append(i)
+            for j in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_TEMPLATE:
+                detailed_sample_composition_table_header_fields.append(f"{i}|{j}")
+
+    for i in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_ADDITIONAL:
+        detailed_sample_composition_table_header_fields.append(i)
+
+    config[KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS] = sample_composition_table_header_fields
+    config[KEY_DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS] = detailed_sample_composition_table_header_fields
+
+
     more_than_once = []
     for seq in seq_ids:
         if seq_ids[seq]>1: