Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase min depth of coverage for Flu assembly using IRMA #676

Open
molly-hetheringtonrauth opened this issue Nov 19, 2024 · 7 comments · May be fixed by #749
Open

Increase min depth of coverage for Flu assembly using IRMA #676

molly-hetheringtonrauth opened this issue Nov 19, 2024 · 7 comments · May be fixed by #749
Assignees

Comments

@molly-hetheringtonrauth

🆒

📌 Explain the Request

Regarding the flu module of the wf_theicov_illumina_pe.wdl:
Based on the task_irma.wdl you are pulling the consensus fasta files from the main sample directory output from IRMA. These fasta files use a min depth of coverage of 1x for base calling. Based on my conversations with CDC, you want to set the MIN_CONS_SUPPORT="50" in the irma config file which will use a min depth of coverage of 50x for base calling. And here's the tricky part once you have set the config file- the consensus fasta files in the main sample directory output from IRMA still will only reflect base calling using a min depth of coverage of 1x. The consensus fasta files in the amended_consensus directory will have the consensus fasta files that reflect base calling using a min depth of 50x. Additionally the consensus fasta files in the amended_consensus directory will also use IUPAC nucleotides for mixed base calls (I'm not sure what the minor allele frequency has to be in order for it to be considered a mixed base call). And finally the conesnsus fasta files in the amended_consensus directory will be named according to their genbank segment number (note the difference in segment number for PB1 and PB2 for Flu A and Flu B):

FluA=["PB2"]="1" ["PB1"]="2" ["PA"]="3" ["HA"]="4" ["NP"]="5" ["NA"]="6" ["MP"]="7" ["NS"]="8" 
FluB=["PB1"]="1" ["PB2"]="2" ["PA"]="3" ["HA"]="4" ["NP"]="5" ["NA"]="6" ["MP"]="7" ["NS"]="8" 

You can see how I handled using the conensus fasta files in the amended_consensus directory at our CDPHE-bioinformatics/CDPHE-influenza/ github repo,.

@kevinlibuit
Copy link
Contributor

Thank you for pointing this out, @molly-hetheringtonrauth. Definitely a lot of non-intuitive action happening under the hood here.

We appreciate the insight and are taking a closer look at things now!

@sage-wright sage-wright self-assigned this Dec 26, 2024
@kapsakcj
Copy link
Contributor

This is super helpful @molly-hetheringtonrauth . Thanks so much for making an issue here

We're going to be updating our IRMA task and it's useful to see how you're running the tool.

I may reach out to you with questions in the next couple of weeks

@molly-hetheringtonrauth
Copy link
Author

Happy to help! And looking forward to the updated tool!

@kapsakcj
Copy link
Contributor

@molly-hetheringtonrauth quick question.

Roughly how often do you see IRMA output non ATGCN characters in the amended_consenus/*.fa FASTA files? Thinking about the IUPAC characters for degenerate bases: https://github.com/CDPHE-bioinformatics/CDPHE-influenza/blob/51b8d62e992fd853908623e8b9f2ada407836893/tasks/irma_task.wdl#L58

I don't believe I've seen it happen at all before, but it could just be the datasets I'm working with...(or I'm not looking close enough!)

@molly-hetheringtonrauth
Copy link
Author

@kapsakcj, like you I haven't seen many degenerate bases (but also haven't looked closely); however I prefer to not work with degenerate bases, so that's why I changed all the degenerate base to N. Do you have thoughts about this?

Also currently working on a new branch mhr-coinfection on pulling out the secondary assembly that IRMA creates so that we can flag potential coinfections (e.g. H1/H5 coinfections). Does Theiagens's pipeline do that?

@kapsakcj
Copy link
Contributor

Yeah I agree, I think Ns would be more amenable to downstream analyses like with VADR, nextclade, and tree building software. I'm not sure if they are capable of handling those characters.

We have not considered co-infection samples & secondary assemblies. The TheiaCoV workflow currently doesn't do anything with those but maybe that's worth exploring. Thanks for bringing this up!

@kapsakcj
Copy link
Contributor

@molly-hetheringtonrauth FYI the default value for mixed alleles is 0.2 (20%) and is controlled by the IRMA param MIN_AMBIG. We're exposing this option to the user in the TheiaCoV workflow

I take back what I said previously, we are planning to go the route of outputting FASTAs with degenerate bases as they don't seem to negatively impact downstream tools like VADR, nextclade, and others.

After a closer look, I do see degenerate bases output from IRMA at a low frequency. Anywhere from 0-20 degenerate bases present in the whole genome but it will really vary depending on your input FASTQ files

It's the periods . that cause issues with downstream tools like VADR and we've seen this in the past and have some sed commands to replace those with N's.

Still a work in progress but you can see our dev branch for this work here: main...smw-flu-dev

@kapsakcj kapsakcj linked a pull request Feb 4, 2025 that will close this issue
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants