Increase min depth of coverage for Flu assembly using IRMA #676

molly-hetheringtonrauth · 2024-11-19T18:56:56Z

🆒

📌 Explain the Request

Regarding the flu module of the wf_theicov_illumina_pe.wdl:
Based on the task_irma.wdl you are pulling the consensus fasta files from the main sample directory output from IRMA. These fasta files use a min depth of coverage of 1x for base calling. Based on my conversations with CDC, you want to set the MIN_CONS_SUPPORT="50" in the irma config file which will use a min depth of coverage of 50x for base calling. And here's the tricky part once you have set the config file- the consensus fasta files in the main sample directory output from IRMA still will only reflect base calling using a min depth of coverage of 1x. The consensus fasta files in the amended_consensus directory will have the consensus fasta files that reflect base calling using a min depth of 50x. Additionally the consensus fasta files in the amended_consensus directory will also use IUPAC nucleotides for mixed base calls (I'm not sure what the minor allele frequency has to be in order for it to be considered a mixed base call). And finally the conesnsus fasta files in the amended_consensus directory will be named according to their genbank segment number (note the difference in segment number for PB1 and PB2 for Flu A and Flu B):

FluA=["PB2"]="1" ["PB1"]="2" ["PA"]="3" ["HA"]="4" ["NP"]="5" ["NA"]="6" ["MP"]="7" ["NS"]="8" 
FluB=["PB1"]="1" ["PB2"]="2" ["PA"]="3" ["HA"]="4" ["NP"]="5" ["NA"]="6" ["MP"]="7" ["NS"]="8"

You can see how I handled using the conensus fasta files in the amended_consensus directory at our CDPHE-bioinformatics/CDPHE-influenza/ github repo,.

The text was updated successfully, but these errors were encountered:

kevinlibuit · 2024-12-02T20:19:18Z

Thank you for pointing this out, @molly-hetheringtonrauth. Definitely a lot of non-intuitive action happening under the hood here.

We appreciate the insight and are taking a closer look at things now!

kapsakcj · 2025-01-17T21:39:15Z

This is super helpful @molly-hetheringtonrauth . Thanks so much for making an issue here

We're going to be updating our IRMA task and it's useful to see how you're running the tool.

I may reach out to you with questions in the next couple of weeks

molly-hetheringtonrauth · 2025-01-19T18:30:50Z

Happy to help! And looking forward to the updated tool!

kapsakcj · 2025-01-24T17:43:35Z

@molly-hetheringtonrauth quick question.

Roughly how often do you see IRMA output non ATGCN characters in the amended_consenus/*.fa FASTA files? Thinking about the IUPAC characters for degenerate bases: https://github.com/CDPHE-bioinformatics/CDPHE-influenza/blob/51b8d62e992fd853908623e8b9f2ada407836893/tasks/irma_task.wdl#L58

I don't believe I've seen it happen at all before, but it could just be the datasets I'm working with...(or I'm not looking close enough!)

molly-hetheringtonrauth · 2025-01-24T18:20:50Z

@kapsakcj, like you I haven't seen many degenerate bases (but also haven't looked closely); however I prefer to not work with degenerate bases, so that's why I changed all the degenerate base to N. Do you have thoughts about this?

Also currently working on a new branch mhr-coinfection on pulling out the secondary assembly that IRMA creates so that we can flag potential coinfections (e.g. H1/H5 coinfections). Does Theiagens's pipeline do that?

kapsakcj · 2025-01-24T18:28:36Z

Yeah I agree, I think Ns would be more amenable to downstream analyses like with VADR, nextclade, and tree building software. I'm not sure if they are capable of handling those characters.

We have not considered co-infection samples & secondary assemblies. The TheiaCoV workflow currently doesn't do anything with those but maybe that's worth exploring. Thanks for bringing this up!

kapsakcj · 2025-01-27T23:39:09Z

@molly-hetheringtonrauth FYI the default value for mixed alleles is 0.2 (20%) and is controlled by the IRMA param MIN_AMBIG. We're exposing this option to the user in the TheiaCoV workflow

I take back what I said previously, we are planning to go the route of outputting FASTAs with degenerate bases as they don't seem to negatively impact downstream tools like VADR, nextclade, and others.

After a closer look, I do see degenerate bases output from IRMA at a low frequency. Anywhere from 0-20 degenerate bases present in the whole genome but it will really vary depending on your input FASTQ files

It's the periods . that cause issues with downstream tools like VADR and we've seen this in the past and have some sed commands to replace those with N's.

Still a work in progress but you can see our dev branch for this work here: main...smw-flu-dev

sage-wright self-assigned this Dec 26, 2024

kapsakcj linked a pull request Feb 4, 2025 that will close this issue

TheiaCoV Flu optimizations #749

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase min depth of coverage for Flu assembly using IRMA #676

Increase min depth of coverage for Flu assembly using IRMA #676

molly-hetheringtonrauth commented Nov 19, 2024

kevinlibuit commented Dec 2, 2024

kapsakcj commented Jan 17, 2025

molly-hetheringtonrauth commented Jan 19, 2025

kapsakcj commented Jan 24, 2025

molly-hetheringtonrauth commented Jan 24, 2025

kapsakcj commented Jan 24, 2025

kapsakcj commented Jan 27, 2025

Increase min depth of coverage for Flu assembly using IRMA #676

Increase min depth of coverage for Flu assembly using IRMA #676

Comments

molly-hetheringtonrauth commented Nov 19, 2024

📌 Explain the Request

kevinlibuit commented Dec 2, 2024

kapsakcj commented Jan 17, 2025

molly-hetheringtonrauth commented Jan 19, 2025

kapsakcj commented Jan 24, 2025

molly-hetheringtonrauth commented Jan 24, 2025

kapsakcj commented Jan 24, 2025

kapsakcj commented Jan 27, 2025