-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase min depth of coverage for Flu assembly using IRMA #676
Comments
Thank you for pointing this out, @molly-hetheringtonrauth. Definitely a lot of non-intuitive action happening under the hood here. We appreciate the insight and are taking a closer look at things now! |
This is super helpful @molly-hetheringtonrauth . Thanks so much for making an issue here We're going to be updating our IRMA task and it's useful to see how you're running the tool. I may reach out to you with questions in the next couple of weeks |
Happy to help! And looking forward to the updated tool! |
@molly-hetheringtonrauth quick question. Roughly how often do you see IRMA output non ATGCN characters in the I don't believe I've seen it happen at all before, but it could just be the datasets I'm working with...(or I'm not looking close enough!) |
@kapsakcj, like you I haven't seen many degenerate bases (but also haven't looked closely); however I prefer to not work with degenerate bases, so that's why I changed all the degenerate base to N. Do you have thoughts about this? Also currently working on a new branch |
Yeah I agree, I think Ns would be more amenable to downstream analyses like with VADR, nextclade, and tree building software. I'm not sure if they are capable of handling those characters. We have not considered co-infection samples & secondary assemblies. The TheiaCoV workflow currently doesn't do anything with those but maybe that's worth exploring. Thanks for bringing this up! |
@molly-hetheringtonrauth FYI the default value for mixed alleles is I take back what I said previously, we are planning to go the route of outputting FASTAs with degenerate bases as they don't seem to negatively impact downstream tools like VADR, nextclade, and others. After a closer look, I do see degenerate bases output from IRMA at a low frequency. Anywhere from 0-20 degenerate bases present in the whole genome but it will really vary depending on your input FASTQ files It's the periods Still a work in progress but you can see our dev branch for this work here: main...smw-flu-dev |
🆒
📌 Explain the Request
Regarding the flu module of the
wf_theicov_illumina_pe.wdl
:Based on the
task_irma.wdl
you are pulling the consensus fasta files from the main sample directory output from IRMA. These fasta files use a min depth of coverage of 1x for base calling. Based on my conversations with CDC, you want to set theMIN_CONS_SUPPORT="50"
in the irma config file which will use a min depth of coverage of 50x for base calling. And here's the tricky part once you have set the config file- the consensus fasta files in the main sample directory output from IRMA still will only reflect base calling using a min depth of coverage of 1x. The consensus fasta files in theamended_consensus
directory will have the consensus fasta files that reflect base calling using a min depth of 50x. Additionally the consensus fasta files in theamended_consensus
directory will also use IUPAC nucleotides for mixed base calls (I'm not sure what the minor allele frequency has to be in order for it to be considered a mixed base call). And finally the conesnsus fasta files in theamended_consensus
directory will be named according to their genbank segment number (note the difference in segment number for PB1 and PB2 for Flu A and Flu B):You can see how I handled using the conensus fasta files in the
amended_consensus
directory at our CDPHE-bioinformatics/CDPHE-influenza/ github repo,.The text was updated successfully, but these errors were encountered: