Skip to content

Commit

Permalink
Merge pull request #217 from polio-nanopore/ev-dev
Browse files Browse the repository at this point in the history
Configurable categories for grouping by, configurable header fields in detailed csv- but dynamically picks.
  • Loading branch information
aineniamh authored Jan 29, 2024
2 parents cafdbae + 19c6b2d commit 2ca977d
Show file tree
Hide file tree
Showing 15 changed files with 223 additions and 72 deletions.
60 changes: 60 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,66 @@ and piranha will check which ones you have installed with your version of medaka
>```Where a version of Guppy has been used without an exactly corresponding medaka model, the medaka model with the highest version equal to or less than the guppy version should be selected.```

## Process enterovirus sequences

Piranha now has the flexibility to cluster sequencing reads by a specified reference group field (`-rg / --reference-group-field`, or `reference_group_field` in a config file). By default, this field is set to `ddns_group` and any records in the reference file must contain this as an annotation in the header description. The reference file that is supplied with piranha contains this field in all records and the values fall into the following categories:

- Sabin1-related
- Sabin2-related
- Sabin3-related
- WPV1
- WPV2
- WPV3
- NonPolioEV

These fields focus on poliovirus detection and all non-polio enterovirus sequences are collaposed into a single category.

If your use of piranha extends to other enteroviruses, it is now possible to run a more tailored analysis. The reference file that is supplied with piranha has the following format:

```
>Poliovirus1-Sabin_AY184219 ddns_group=Sabin1-related species=Sabin1-related cluster=Sabin1-related
GGGTTAG...CCACATAT
>CoxsackievirusA21_EF015028 ddns_group=NonPolioEV species=CoxsackievirusA21 cluster=CoxsackievirusA21
AGATGCA...CAGTGAGA
```

With the `-rg` flag, you can use the reference file supplied to cluster by ddns_group (default), species or cluster. Then, piranha will dynamically calculate which categories are present during the run. For example, running piranha with `-rg species` will enable clusters of CoxsackievirusA21, or other enterovirus species to be identified and have a consensus sequence generated during piranha analysis.

It is also possible to supply a custom reference file with references and fields of interest to piranha in the following format:

```
>REF1 my_custom_field=group1
ACTGT....CTAGCTA
>REF2 my_custom_field=group1
ACTGT....CTAGCTA
>REF3 my_custom_field=group1
ACTGT....CTAGCTA
>REF4 my_custom_field=group2
ACTGT....CTAGCTA
>REF5 my_custom_field=group2
ACTGT....CTAGCTA
```
and in this example, if `-rg my_custom_field` is supplied, any reads that map to REF1, REF2 or REF3 will be clustered into 'group1' and any reads that map to REF4 or REF5 will be clustered into 'group2'.

Any values will reads mapping will be shown in the final reports.

Pitfalls to note:

1. A caveat with this new approach is that our reference file has been constructed with poliovirus primarily in mind. There are quite a few enterovirus references in the file, covering 113 `species` groups, but there is likely to be gaps in the broad enterovirus diversity covered in this reference file. If you have species of interest you want to be detected, it may be worth supplying a custom reference file.

2. Enteroviruses are a very diverse group of viruses, it may well be that portions of a read match to something in our reference set, but the rest is too divergent to map. In this case, lowering the alignment block length minimum requirement may enable partial sequences to be recovered. Here, the limitation is the reference-based approach and supplementing the reference file with additional reference sequences may resolve this.

3. Mapping against a reference set that contains multiple sequences that are very similar may significantly decrease the alignment quality score from minimap2. This case arises when there are 2 equally likely matches in the database to your sequencing reaad and the software cannot confidently call the best hit. In the default reference set, we only include a single Sabin1, Sabin2 and Sabin3 reference respectively, as including VDPV sequences in the database may confuse the mapping software. By default, reads with low alignment mapping scores will be filtered out. By lowering the mapping quality threshold, these split cases may be included in the analysis for enteroviruses.

Suggested configuration settings can be found [here](https://github.com/polio-nanopore/piranha/blob/main/docs/config.ev.yaml) and we would recommend to only use this mode if the user can sufficiently appreciate the limitations the reference file composition may have on detection.

With the supplied config file, piranha can be run as follows:

```
piranha -c config.ev.yaml
```


## Experimental haplotype calling pipeline

There is now an experimental haplotype calling pipeline that uses freebayes for initial variant calling and flopp for read phasing with the called variants. It has a number internal QC steps for merging identical haplotypes. This pipeline requires further validation, but can theoretically produce multiple consensus sequences for each poliovirus population present within a sample. The pipeline has mostly be tested on VDPVs of two mixtures, but will be further assessed.
Expand Down
10 changes: 8 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,21 @@ runname: "MIN003"
barcodes_csv: "piranha/test/pak_run/barcodes01.csv"
readdir: "piranha/test/pak_run/demultiplexed"

min_read_length: 500
max_read_length: 2300
threads: 1

negative_control: "negative"
positive_control: "positive"

# run_phylo: True
supplementary_datadir: "piranha/test/supp_data"
# supplementary_datadir: "piranha/test/supp_data"
# update_local_database: True
local_database_threshold: 5
# local_database_threshold: 5
reference_group_field: "species"
min_aln_block: 0
min_read_pcent: 0


overwrite: True
no_temp: True
21 changes: 21 additions & 0 deletions docs/config.ev.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
username: "Aine O'Toole"
institute: "IEE"
runname: "MIN003"

barcodes_csv: "/localdisk/home/data/min003/barcodes.csv"
readdir: "/localdisk/home/data/min003/fastq_pass"

min_read_length: 500
max_read_length: 2300
min_aln_block: 0
min_read_pcent: 0

threads: 1

negative_control: "negative"
positive_control: "positive"

reference_group_field: "species"

overwrite: True
no_temp: True
4 changes: 2 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:
- defaults
dependencies:
- python=3.10
- bcftools=1.11
- bcftools=1.18
- coreutils=9.1
- flopp=0.2.0
- freebayes=1.3.6
Expand All @@ -15,6 +15,6 @@ dependencies:
- minimap2=2.17
- pyabpoa
- rasusa=0.7.1
- samtools=1.11
- samtools=1.18
- snakemake-minimal
- tabix=1.11
2 changes: 1 addition & 1 deletion piranha/analysis/phylo_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ def update_local_database(sample_sequences,detailed_csv,new_db_seqs,new_db_metad
for i in desc_list:
if i.startswith("variant_count"):
count = i.split("=")[1]
elif i.startswith(VALUE_REFERENCE_MATCH_FIELD):
elif i.startswith(VALUE_REFERENCE_GROUP_FIELD):
match_field = i.split("=")[1]

if "Sabin" in match_field:
Expand Down
29 changes: 15 additions & 14 deletions piranha/analysis/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,23 +41,23 @@ def gather_filter_reads_by_length(dir_in,barcode,reads_out,config):
print(green(f"Total passed reads {barcode}:"),len(fastq_records))
SeqIO.write(fastq_records,fw, "fastq")

def parse_match_field(description):
def parse_match_field(description,reference_match_field):
for item in str(description).split(" "):
if item.startswith(VALUE_REFERENCE_MATCH_FIELD):
if item.startswith(reference_match_field):
match_field = item.split("=")[1]
return match_field

def make_ref_match_field_map(references,positive_references,include_positive_references):
def make_ref_match_field_map(references,reference_match_field,positive_references,include_positive_references):
ref_map = {}
for record in SeqIO.parse(references,KEY_FASTA):
match_field = ""
if include_positive_references:
if record.id in positive_references:
ref_map[record.id] = "p" #quicker for parsing below
else:
ref_map[record.id] = parse_match_field(record.description)
ref_map[record.id] = parse_match_field(record.description,reference_match_field)
else:
ref_map[record.id] = parse_match_field(record.description)
ref_map[record.id] = parse_match_field(record.description,reference_match_field)
return ref_map

def make_match_field_to_reference_group_map(ref_map):
Expand Down Expand Up @@ -239,12 +239,17 @@ def parse_paf_file(paf_file,
barcode,
analysis_mode,
min_map_quality,
reference_match_field,
config):

if is_non_zero_file(paf_file):

permissive_ref_name_map = make_ref_match_field_map(references_sequences,positive_references,include_positive_references)
ref_name_map = make_match_field_to_reference_group_map(permissive_ref_name_map)
permissive_ref_name_map = make_ref_match_field_map(references_sequences,reference_match_field,positive_references,include_positive_references)
if reference_match_field == VALUE_REFERENCE_GROUP_FIELD:
# only run the wibble match if using default value for ref group (now ddns_group)
ref_name_map = make_match_field_to_reference_group_map(permissive_ref_name_map)
else:
ref_name_map = permissive_ref_name_map

min_aln_block = config[KEY_MIN_ALN_BLOCK]

Expand Down Expand Up @@ -281,19 +286,15 @@ def diversity_report(input_files,csv_out,summary_out,ref_file,config):
summary_rows = {}
refs_out = collections.defaultdict(list)

SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_VP1

if config[KEY_ANALYSIS_MODE] == VALUE_ANALYSIS_MODE_WG:
SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_WG

sample_composition_table_header_fields = config[KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS]

with open(barcodes_csv,"r") as f:
reader = csv.DictReader(f)
for row in reader:
summary_rows[row[KEY_BARCODE]] = {KEY_BARCODE: row[KEY_BARCODE],
KEY_SAMPLE: row[KEY_SAMPLE]
}
for field in SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:
for field in sample_composition_table_header_fields:
if field not in summary_rows[row[KEY_BARCODE]]: # the rest are ref counters
summary_rows[row[KEY_BARCODE]][field] = 0

Expand All @@ -318,7 +319,7 @@ def diversity_report(input_files,csv_out,summary_out,ref_file,config):


with open(summary_out,"w") as fw2:
writer = csv.DictWriter(fw2, fieldnames=SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS, lineterminator="\n")
writer = csv.DictWriter(fw2, fieldnames=sample_composition_table_header_fields, lineterminator="\n")
writer.writeheader()
for barcode in summary_rows:
row = summary_rows[barcode]
Expand Down
2 changes: 1 addition & 1 deletion piranha/analysis/stool_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ def gather_fasta_files(summary_info, barcodes_csv, input_cns_list,all_metdata,ru

record_id += f" {KEY_BARCODE}={barcode}"
record_id += f" {KEY_REFERENCE}={ref}"
record_id += f" {VALUE_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
record_id += f" {VALUE_REFERENCE_GROUP_FIELD}={info[KEY_REFERENCE_GROUP]}"

if runname:
record_id += f" {KEY_RUNNAME}={runname}"
Expand Down
2 changes: 2 additions & 0 deletions piranha/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def main(sysargs = sys.argv[1:]):
i_group.add_argument('-i','--readdir',help="Path to the directory containing fastq read files",dest="readdir")
i_group.add_argument('-b','--barcodes-csv',help="CSV file describing which barcodes were used on which sample",dest="barcodes_csv")
i_group.add_argument("-r","--reference-sequences",action="store",dest="reference_sequences",help="Custom reference sequences file.")
i_group.add_argument("-rg","--reference-group-field",action="store",help=f"Specify reference description field to group references by. Default: `{VALUE_REFERENCE_GROUP_FIELD}`")
i_group.add_argument("-nc","--negative-control",action="store",help=f"Sample name of negative control. If multiple samples, supply as comma-separated string of sample names. E.g. `sample01,sample02` Default: `{VALUE_NEGATIVE[0]}`")
i_group.add_argument("-pc","--positive-control",action="store",help=f"Sample name of positive control. If multiple samples, supply as comma-separated string of sample names. E.g. `sample01,sample02`. Default: `{VALUE_POSITIVE[0]}`")
i_group.add_argument("-pr","--positive-references",action="store",help=f"Comma separated string of sequences in the reference file to class as positive control sequences.")
Expand Down Expand Up @@ -153,6 +154,7 @@ def main(sysargs = sys.argv[1:]):
input_qc.parse_input_group(args.barcodes_csv,
args.readdir,
args.reference_sequences,
args.reference_group_field,
config)

input_qc.control_group_parsing(args.positive_control,
Expand Down
6 changes: 6 additions & 0 deletions piranha/input_parsing/initialising.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ def get_defaults():
KEY_MIN_READS:VALUE_MIN_READS, # where to pad to using datafunk
KEY_MIN_PCENT:VALUE_MIN_PCENT,

KEY_REFERENCE_GROUP_FIELD:VALUE_REFERENCE_GROUP_FIELD,
KEY_REFERENCE_GROUP_VALUES:VALUE_REFERENCE_GROUP_VALUES,

KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:[],
KEY_DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS:[],

KEY_MIN_READ_LENGTH:READ_LENGTH_DICT[VALUE_ANALYSIS_MODE][0],
KEY_MAX_READ_LENGTH:READ_LENGTH_DICT[VALUE_ANALYSIS_MODE][1],
KEY_MIN_MAP_QUALITY:VALUE_MIN_MAP_QUALITY,
Expand Down
58 changes: 53 additions & 5 deletions piranha/input_parsing/input_qc.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def parse_barcodes_csv(barcodes_csv,config):
# total +=1
# passed = False
# for field in record.description.split(" "):
# if field.startswith(VALUE_REFERENCE_MATCH_FIELD):
# if field.startswith(VALUE_REFERENCE_GROUP_FIELD):
# passed=True
# continue
# if not passed:
Expand All @@ -118,7 +118,7 @@ def parse_barcodes_csv(barcodes_csv,config):
# sys.exit(-1)

# if incorrect >= 1:
# sys.stderr.write(cyan(f"Supplementary sequences file lacks `{VALUE_REFERENCE_MATCH_FIELD}` annotation in header of {incorrect} out of {total} sequences parsed.\n"))
# sys.stderr.write(cyan(f"Supplementary sequences file lacks `{VALUE_REFERENCE_GROUP_FIELD}` annotation in header of {incorrect} out of {total} sequences parsed.\n"))
# sys.exit(-1)
# else:
# print(green("Supplementary sequences:"), total, "sequences parsed.")
Expand Down Expand Up @@ -146,7 +146,7 @@ def parse_fasta_file(supplementary_datadir,supp_file,seq_records,no_reference_gr
total_seqs["total"] +=1
ref_group = ""
for field in record.description.split(" "):
if field.startswith(VALUE_REFERENCE_MATCH_FIELD):
if field.startswith(VALUE_REFERENCE_GROUP_FIELD):
ref_group = field.split("=")[1]

if ref_group not in config[KEY_REFERENCES_FOR_CNS]:
Expand All @@ -159,7 +159,7 @@ def parse_fasta_file(supplementary_datadir,supp_file,seq_records,no_reference_gr
def check_there_are_seqs(total_seqs,supplementary_datadir,no_reference_group,config):
if total_seqs["total"]==0:
sys.stderr.write(cyan(f"Error: No sequence files matched in `{supplementary_datadir}`.\nEnsure the directory provided contains FASTA files with appropriate annotations in the header.\n"))
sys.stderr.write(cyan(f"Header must specify one of {config[KEY_REFERENCES_FOR_CNS]} under {VALUE_REFERENCE_MATCH_FIELD}=X, where X is the appropriate group to be included in phylo pipeline.\n"))
sys.stderr.write(cyan(f"Header must specify one of {config[KEY_REFERENCES_FOR_CNS]} under {VALUE_REFERENCE_GROUP_FIELD}=X, where X is the appropriate group to be included in phylo pipeline.\n"))
sys.exit(-1)

elif no_reference_group:
Expand Down Expand Up @@ -338,7 +338,17 @@ def parse_read_dir(readdir,config):
print(green(f"Barcode {barcode}:\t") + f"0 fastq files")
print(cyan(f"Warning: No read files identified for barcode `{barcode}`.\nThis may be a negative control or a failed sample, but be aware it will not be analysed."))

def parse_input_group(barcodes_csv,readdir,reference_sequences,config):
def parse_ref_group_values(description,ref_group_key):
fields = description.split(" ")

ref_group = ""
for i in fields:
if i.startswith(ref_group_key):
ref_group=i.split("=")[1]

return ref_group

def parse_input_group(barcodes_csv,readdir,reference_sequences,reference_group_field,config):

parse_barcodes_csv(barcodes_csv,config)

Expand All @@ -347,11 +357,49 @@ def parse_input_group(barcodes_csv,readdir,reference_sequences,config):
misc.add_file_to_config(KEY_REFERENCE_SEQUENCES,reference_sequences,config)
misc.check_path_exists(config[KEY_REFERENCE_SEQUENCES])

misc.add_arg_to_config(KEY_REFERENCE_GROUP_FIELD,reference_group_field,config)
#check they have unique identifiers
seq_ids = collections.Counter()
ref_group_field_in_headers = True
ref_group_values = set()
for record in SeqIO.parse(config[KEY_REFERENCE_SEQUENCES],"fasta"):
ref_group_value = parse_ref_group_values(record.description,config[KEY_REFERENCE_GROUP_FIELD])
ref_group_values.add(ref_group_value)

if not config[KEY_REFERENCE_GROUP_FIELD] in record.description:
ref_group_field_in_headers = False
seq_ids[record.id]+=1

if not ref_group_field_in_headers:
sys.stderr.write(cyan(f"\nReference fasta file contains records without the specified `--reference-group-field` ({config[KEY_REFERENCE_GROUP_FIELD]}).\n"))
sys.exit(-1)

print(f"{len(ref_group_values)}" + green(f" reference group values identified in reference file."))

config[KEY_REFERENCE_GROUP_VALUES] = ref_group_values

sample_composition_table_header_fields = SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_BASIC
detailed_sample_composition_table_header_fields = []

for i in sorted(config[KEY_REFERENCE_GROUP_VALUES]):
sample_composition_table_header_fields.append(i)

for j in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_TEMPLATE:
detailed_sample_composition_table_header_fields.append(f"{i}|{j}")

for i in SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_ADDITIONAL:
if i not in sample_composition_table_header_fields:
sample_composition_table_header_fields.append(i)
for j in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_TEMPLATE:
detailed_sample_composition_table_header_fields.append(f"{i}|{j}")

for i in DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS_ADDITIONAL:
detailed_sample_composition_table_header_fields.append(i)

config[KEY_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS] = sample_composition_table_header_fields
config[KEY_DETAILED_SAMPLE_COMPOSITION_TABLE_HEADER_FIELDS] = detailed_sample_composition_table_header_fields


more_than_once = []
for seq in seq_ids:
if seq_ids[seq]>1:
Expand Down
Loading

0 comments on commit 2ca977d

Please sign in to comment.