Replies: 6 comments 2 replies
-
Tagging @andrewprzh |
Beta Was this translation helpful? Give feedback.
-
Dear Martin Thank you for your question and thanks for pointing out inconsistencies in the manuals! Here is a brief description of hybrid workflow in SPAdes (and rnaSPAdes):
Regarding the polyA and adapters. rnaSPAdes does remove polyA k-mers during graph construction, but does not trim long reads in any way. Since long reads are mapped to the graph, my guess is that polyA tails and adapters will simply remain unmapped in most cases. However, if trimming can be easily done, I'd suggest to feed trimmed reads to ensure these artifacts do not affect the assembly. rnaSPAdes does not detect polyAs or adapters, so if reads are fed using --pacbio they are assumed to be PacBio reads with possible truncation at the ends. When reads are provided via --fl-rna, rnaSPAdes assumes that they capture the entire transcript and outputs the entire corresponding path in the graph (without duplicates of course). And yes, this option was designed specifically to support FLNCs. So in your case I'd probably extract FLNC reads (not HQ isoforms) and feed them with --fl-rna. Don't hesitate to ask more question if something remains unclear. |
Beta Was this translation helpful? Give feedback.
-
Dear Andrey, Thanks a lot for the information, this was very enlightening. First, I would like to note that PacBio FLNC reads are not necessarily true full-length reads. There are IsoSeq library preparation protocols available that can be used to select for full-length mRNA transcripts, based on an integer polyA tail AND a 5' cap. But, if such protocol is NOT used, than IsoSeq also captures mRNA transcripts that are partially 5'-degraded (I think the official PacBio IsoSeq protocol does not include 5' cap selection). In such cases, FLNC reads have integer, flanking cDNA primers and a poly(A) tails, but also include partially 5'-degraded mRNA. Not sure if this will effect the outcome of the --fl-rna option. Why I started this discussion: Comparison of Hybrid assembly vs. Illumina-only assembly, using rnaQUAST:
Alignment of Illumina-only assembly and hybrid assembly to reference genome (non-model organism, circa 150 Mbp):
It appears that the --fl-rna option is not aware about strand specific information, and I am curious if you encountered similar problems. We will now re-run hybrid rnaSPAdes with --pacbio and CCS reads (not using --fl-rna). I hope this makes a difference. Best, |
Beta Was this translation helpful? Give feedback.
-
Hi, We compared now the --pacbio and --fl-rna option of hybrid rnaSPAdes.
Based on this, it appears to me that the current best practice should be to use PacBio CCS reads and the --pacbio option for hybrid rnaSPAdes. Did you observed similar problems with --fl-rna? Best, |
Beta Was this translation helpful? Give feedback.
-
Dear Martin I'm very sorry for a delayed response. Thanks for pointing out details of IsoSeq protocols! As far as I see in the code, there shouldn't be a big problem. When there are multiple reads from the same isoform, rnaSPAdes selects the longest path in the graph, and all subpaths (originated from 5' degraded molecules) are removed. Thus, I expect the most complete isoform to be reported. Thank you for sharing you results. That is, indeed discouraging and encouraging at same time depending on which IGV panel I look :) In fact, I cannot say I have experienced something like this before, but I see a clear flaw here. Would it be possible to share some data so I could dig into the problem? E.g. subset of reads that corresponds to the shown genome region? Out of curiosity, why de novo assembly was used when the reference genome and annotation are available? All the best |
Beta Was this translation helpful? Give feedback.
-
Hi Andrey, Thanks a lot for the reply. Yes, I would be happy to share some data. I will send it to you in private. For your question: The reference genome is from a non-model organism with high genomic variability; the assembly is fragmented, with unclassified repeats and duplications and possible contaminations. Also the annotations are rather uncertain. I was afraid of introducing artifacts when using the reference genome for the transcriptome assembly (but I never tried). Good luck with IsoQuant, this is surely a worthwhile effort! (especially when PacBio sequencing depth will eventually become competitive). Best, |
Beta Was this translation helpful? Give feedback.
-
Dear SPAdes team,
I am a bit puzzled about the best practice for the hybrid rnaSPAdes algorithm, when used with PacBio IsoSeq long reads and Illumina short reads.
Form the different spades manuals (and publications), I understood that I can choose for PacBio long read integration between the options --pacbio, -s and --fl-rna.
The main spades manual (https://github.com/ablab/spades) explains to use the --pacbio option ONLY with PacBio CLR reads (which are not the right reads for hybrid rnaSPAdes, since the PacBio IsoSeq long read sequencing protocol works in CCS mode, and not CLR). Therefore, it appears that the -s option is the right choice for PacBio CCS read integration.
The main spades manual further says that "PacBio CCS/Reads of Insert reads [...] can be simply provided as single reads to SPAdes." (with -s). I just want to make sure: PacBio CCS reads, as produced by the SMRT Link software, are Q20 filtered, include non-full length reads, and contain poly(A) tails and cDNA primer artifacts. Is the spades assembler taking care of the poly(A) tails and cDNA primers? If not, the SMRT Link software has the option to filter the CCS reads for those that are having both cDNA primers and a poly(A) tail. The SMRT link software can further trim those sequences, generating full-length reads, called FLNC reads. Means, in case that the spades assembler does not account for poly(A) tails and cDNA primers in CCS reads, would you say that using the PacBio FLNC reads is the best practice for hybrid rnaSPAdes? Should the FLNC reads be added as FASTQ?
Now it gets confusing. Within the rnaSPAdes manual, the option of using the -s flag is not mentioned. From the rnaSPAdes manual, it seems that the --pacbio option is the right choice for hybrid rnaSPAdes (though we learned in the main manual that the --pacbio option is for CLR reads, and not for CCS reads). Does it mean that the --pacbio option can actually handle CCS reads (and as said above, should the CCS reads be trimmed from poly(A) tails and cDNA primers?)?
Finally, the rnaSPAdes manual says that besides using the --pacbio option, full length PacBio reads can be additionally added with the --fl-rna option. With this option, I could now either add the above mentioned PacBio full length FLNC reads, or PacBio full length HQ isoforms, both produced by the SMRT Link software (the latter are FLNC reads that underwent rigorous clustering, consensus calling, and filtering, and have therefore a much higher quality than the FLNC reads, but miss some lowly expressed transcripts). I wonder now what is more important for the --fl-rna option: having rather a mass of PacBio FLNC full length reads, or preferentially only those with high quality? (it is probably the responsibility of the user to test the right options; but I would be curious about your opinion). Considering all the above mentioned, the best practice appears to be using hybrid rnaSPAes with --pacbio FLNC.fastq AND --fl-rna FLNC.fasta (?)
Please don't understand me wrong, I love rnaSPAdes and the idea of the hybrid assembly is awesome. I am just curious about the right choice (I am using rnaQUAST to evaluate assemblies, but I just wonder what makes conceptually more sense).
Best,
Martin
Beta Was this translation helpful? Give feedback.
All reactions