-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does TrimGalore handle the Illumina T-overhang #127
Comments
Hi Paul, This is very weird. I have never ever seen RNA-seq libraries where the first base of both R1 and R2 were Trim Galore certainly doesn't look for a T at the start, but just in case you ever come across a library with 100%
If you ever encounter this data, maybe you could share a FastQC base composition plot in here? |
I received the Illumina bulletin link from the sequencing core. Plus I spoke with our local technical support specialist from Illumina about this. He said the sequences of the sequencing primers are proprietary so I never got a straight answer about where the sequencing actually starts. He talked about newer chemistries and sequencing machines (but I don't have a list of which do this or do not) are handling this T-overhang at the sequencing level during base calling. He told me that both TruSeq and the new branding "Illumina" kits use A-tailing in the library preps. So, I think this situation as always existed but as of 2020 they are starting to talk about it. Possible evidence of this is from this popular link (https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html; from 2016). It was updated in 3/31/21. Take a look at the very bottom where they discuss the T-overhang. I've read this page many times over my career and never saw that last line. Currently, I'm not seeing 100% T at 1st position. I analyzed 3 RNAseq using Illumina mRNA kit vs. 3 using TruSeq. The Illumina first base percentage was dominated by T and A but T ranges from 8 to 55% and A from 17 to 62%. This confuses me as I expect the first base to have a very high T frequency. For the TruSeq kits, A and T percentage was very low at the first base so there is something different going on between the two kits. One more note: the |
I also talked with Simon about this more this morning, this is definitely nothing we have ever seen, and that is since 2009 and counting... I don't think I am bound to keeping proprietary sequences a secret, so as I understand the sequencing primer sequences are:
As you can see, both of them contain the T as the last priming base, so the 100% is never read but should be part of the priming itself. The first base that is read comes from the fragment of interest, and not from the A-tailing process. Let me know if you discover anything new. Cheers, Felix |
And we have solid proof that this correct, as the default Illumina sequence Trim Galore looks for is |
Thank you for that info. The primer sequences you posted are the ones that I found in the Adapter PDF (https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf) but notice they are in the Obsolete section on page 73 (page 81 of the PDF overall). That is when my tech rep said the primer sequences are proprietary (maybe they have been changed?). Another thing to point out it that the adapter sequences found in the
but in the TruSeq sample sheets is
Since the Adapter begins with a C for Illumina and an A for TruSeq, perhaps this is why there is an enrichment for T and A in the 1st base for Illumina kits and for C and G in 1st base for TruSeq. So the reason for having high T and A with Illumina kits (in my 3 RNAseq expts; maybe others would have a different result) has nothing to do with the T-overhang. Since TrimGalore uses the TruSeq adapter prefix as default and that since Illumina is rebranding to Illumina kits (which I'm assuming will use the different adapter |
I think it should all be fine. The mRNA kit sequence you posted is the Nextera sequence, which is part of the auto-detection. You can also specify it manually if you wanted to (but there shouldn't be any need to do so).
I think there are ultimately 2 different aspects here:
In a nutshell, for the time being I am pretty sure all is fine, and you can go ahead using Trim Galore - and it will just do the right thing. PS: I am happy to re-visit this judgement if someone shows evidence that the system has indeed been changed. |
I have new information from our field application scientist from Illumina. He emailed the following:
This doesn't state whether other sequencing machines like NovaSeq are skipping the first base. Also, the 3rd point validated my concern about the Nextera adapter. Since Nextera adapter does not start with an A, the mono-A read through will not be clipped. Thoughts? |
My thoughts are:
Again, so far I have not seen any evidence that anything would need attention, I think it still all just work fine. Does that make sense? |
I agree with your 1st and 2nd point. We are currently not implementing trimming the first base on Illumina kit libraries that show an enrichment for T and A but will consider trimming when the T percentage approaches 100% like you mentioned To clarify your 3rd point. A-tailing is occurring in all the library preparations: Truseq, Illumina, etc... (you can find this information in the kit manuals). TruSeq sequencing used sequencing primers and adapter trimming sequences that abrogated the need to pay attention to the T/A overhang (see FAS comment 1 in my post from yesterday). But now, with the Illumina kit the sequencing is starting at the T-overhang (but possibly skipped by the sequencing machine) and the adapter sequence that Illumina recommends (webpage link) for this kit is In any case, I don't see anything new that TrimGalore should account for nor I expect any changes. I was interested in brining this to your attention and others who might find it useful. Thank you |
Hi Paul, it is indeed much appreciated to bring this up, just in case there are new developments it is certainly good to keep up-to-date! Regarding point 3 one last time, I just checked some of our recently run RNA-seq runs, and found e.g. the following:
from what I can see here it certainly doesn't look like there was a trailing A before the Nextera adapter sequence... I was always under the impression that the older (TruSeq, standard Illumina) adapters required end-repair and A-tailing, however newer library generation protocols use enzymatic steps (tagmentation etc) that do everything, abrogating the need for A-tailing (and hence there is also no need to take care of it afterwards). I'd be interested to see if you have a dramatically different ratio of bases just before the removed adapter Nextera adapter sequence in your samples? |
Here might be more details as well: |
illumina-stranded-total-rna-prep-reference-guide-1000000124514-02.pdf (catalog number: 20040525)
I do not know much about tagmentation and its molecular characteristics. I think Nextera kits are for DNA. For RNA the new branding name "Illumina" kits are using the same adapter sequence (now poorly named 'nextera') as used in the Nextera kits. Also take a look at this bulletin (figure 1) which I think I sent to you before that shows the A-tailing on these kits. I wouldn't expect you to see the A-tailing in actual Nextera reads. I'm curious to see what I find in my Illumina kit reads. I'm going to ask the sequencing core for the untrimmed fastqs and I'll let you know what I find. |
Excellent, thanks for the extra details |
Here are my results for the first 100K sequences of the raw R1 and R2 fastq for a sample using the Illumina kit. The elusive A-tailing exists!
I hope Illumia will advertise this by updating their website of adapters to use for trimming https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html. I’ve stressed this to my field applications specialist several times now. To be precise, maybe TrimGalore does need updated with an option for Illumia kits by adding an A to the 5’ end of the Nextera adapter |
That's fascinating, thanks for posting! I think you are right that this warrants an additional option for this type of RNA-seq kits. I am not sure it needs to be included in the auto-detection,but there should be option one can set manually. (For more context, I am a little worried that this would interfere with the Nextera detection/removal because if this adapter with a starting A would by chance be found more often than the Nextera sequence, all other sequence contexts would evade the Nextera trimming (unless this counts towards the error rate which is 0.1 by default)). Would you agree? |
Hi Paul, I have just added an option Please note that this stranded Illumina kit is not part of the auto-detection, as I fear it would potentially cause more harm than do good... |
To your previous comment above, I'm sorry but can't offer a reasonable argument since I don't truly understand how TG works. Great news about adding the illumina stranded option. Sorry again but won't be able to share the sequences but I might be able to get an example dataset from Illumina. Here is the test that I performed on the same set of 100K sequences above #127 (comment)
Here is the output file from TG raw.txt. I'm surprised to see that ~45% of the sequences have the adapter. I think it is due to the aggressive default setting of looking for partial matches. Thank you for adding the option |
That's right, it is the agressive trimming Trim Galore performs by default (more than half of the sequence that had any kind of adapter trimmed were trimmed by 1-3bp):
I'm glad it seems to be working as you expected, and don't worry about a test set of files, I am sure some point one will become available. |
Hi @FelixKrueger thanks for the very nice tool and adding this option. I was just running through a stranded mrna illumina prep through. And looks like the tool is picking up some of the t overhang but not all cases? when i run fast qc there is still som bias of base at first cycle. Then when i look in IGV there is some stack of reads showing up as a T or A and igv is calling it as a variant, but im pretty sure its just this T overhang that didnt get trimmed off? What do you think? Im new to data analysis though so i could be missing something. Maybe i should have also used what you commented at the start of this thread? GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R1.fastq.gz_trimming_report.txt |
Thanks for your report. At first glance it would appear that Read 2 always starts with a T, which is currently not spotted or removed. This is something that hadn't been spotted before as the original poster's data was single-end.... The |
I recently came across this bulletin from Illumina. https://www.illumina.com/content/dam/illumina/gcs/assembled-assets/marketing-literature/illumina-stranded-rna-t-overhang-tech-note-470-2020-010/illumina-stranded-rna-t-overhang-tech-note-470-2020-010.pdf
It says to remove the T-overhang on the 5' end of R1 and R2. Does TrimGalore look for this and remove it? If not, What options do I use to remove the T-overhang with TrimGalore? I could not find this information in the manual or web searches.
The text was updated successfully, but these errors were encountered: