Does TrimGalore handle the Illumina T-overhang #127

pcantalupo · 2022-01-07T15:52:12Z

I recently came across this bulletin from Illumina. https://www.illumina.com/content/dam/illumina/gcs/assembled-assets/marketing-literature/illumina-stranded-rna-t-overhang-tech-note-470-2020-010/illumina-stranded-rna-t-overhang-tech-note-470-2020-010.pdf

It says to remove the T-overhang on the 5' end of R1 and R2. Does TrimGalore look for this and remove it? If not, What options do I use to remove the T-overhang with TrimGalore? I could not find this information in the manual or web searches.

FelixKrueger · 2022-01-07T17:04:14Z

Hi Paul,

This is very weird. I have never ever seen RNA-seq libraries where the first base of both R1 and R2 were Ts, and I am sure I seen a thousands of them... If the post wasn't by lllumina themselves and reasonably up-to-date, I would probably just not believe it. I can of course follow the logic, and can also see that 100% Ts might be a problem, but maybe the sequencing primers simply start reading 1 bp later, and this start at the very first genomic base rather than a the T in the adapter?

Trim Galore certainly doesn't look for a T at the start, but just in case you ever come across a library with 100% T as the first base(s), you can simply remove it using

--clip_r1 1 for single-end, or
--clip_r1 1 --clip_r2 1 for paired-end data

If you ever encounter this data, maybe you could share a FastQC base composition plot in here?

pcantalupo · 2022-01-10T13:42:44Z

I received the Illumina bulletin link from the sequencing core. Plus I spoke with our local technical support specialist from Illumina about this. He said the sequences of the sequencing primers are proprietary so I never got a straight answer about where the sequencing actually starts.

He talked about newer chemistries and sequencing machines (but I don't have a list of which do this or do not) are handling this T-overhang at the sequencing level during base calling. He told me that both TruSeq and the new branding "Illumina" kits use A-tailing in the library preps. So, I think this situation as always existed but as of 2020 they are starting to talk about it. Possible evidence of this is from this popular link (https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html; from 2016). It was updated in 3/31/21. Take a look at the very bottom where they discuss the T-overhang. I've read this page many times over my career and never saw that last line.

Currently, I'm not seeing 100% T at 1st position. I analyzed 3 RNAseq using Illumina mRNA kit vs. 3 using TruSeq. The Illumina first base percentage was dominated by T and A but T ranges from 8 to 55% and A from 17 to 62%. This confuses me as I expect the first base to have a very high T frequency. For the TruSeq kits, A and T percentage was very low at the first base so there is something different going on between the two kits.

One more note: the SampleSheet.csv for the Illumina mRNA libraries show that the read length is 74 but for TruSeq libraries, the read length is 76. So, I'm wondering if the sequencing core is automatically trimming the T-overhang and that no further processing is necessary. I'm following up with the sequencing core to get more info about their chemistry and machine that was used. I'll let you know how it goes

FelixKrueger · 2022-01-10T13:52:24Z

I also talked with Simon about this more this morning, this is definitely nothing we have ever seen, and that is since 2009 and counting...

I don't think I am bound to keeping proprietary sequences a secret, so as I understand the sequencing primer sequences are:

Multiplexing Read 1 Sequencing Primer
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Multiplexing Read 2 Sequencing Primer
5' GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

As you can see, both of them contain the T as the last priming base, so the 100% is never read but should be part of the priming itself. The first base that is read comes from the fragment of interest, and not from the A-tailing process. Let me know if you discover anything new. Cheers, Felix

FelixKrueger · 2022-01-10T13:54:28Z

And we have solid proof that this correct, as the default Illumina sequence Trim Galore looks for is AGATCGGAAGAGC, so the exact reverse complement of both sequencing primers are found as read-through contaminants....

pcantalupo · 2022-01-10T14:52:14Z

Thank you for that info. The primer sequences you posted are the ones that I found in the Adapter PDF (https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf) but notice they are in the Obsolete section on page 73 (page 81 of the PDF overall). That is when my tech rep said the primer sequences are proprietary (maybe they have been changed?).

Another thing to point out it that the adapter sequences found in the SampleSheet.csv for the Illumina mRNA kits is

[Settings]
Adapter,CTGTCTCTTATACACATCT

but in the TruSeq sample sheets is

[Settings]
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Since the Adapter begins with a C for Illumina and an A for TruSeq, perhaps this is why there is an enrichment for T and A in the 1st base for Illumina kits and for C and G in 1st base for TruSeq. So the reason for having high T and A with Illumina kits (in my 3 RNAseq expts; maybe others would have a different result) has nothing to do with the T-overhang.

Since TrimGalore uses the TruSeq adapter prefix as default and that since Illumina is rebranding to Illumina kits (which I'm assuming will use the different adapter CTGTCTCTTATACACATCT), will this cause issues for TrimGalore? or would the automatic adapter detection pick this up?

FelixKrueger · 2022-01-10T16:02:54Z

I think it should all be fine. The mRNA kit sequence you posted is the Nextera sequence, which is part of the auto-detection. You can also specify it manually if you wanted to (but there shouldn't be any need to do so).

--nextera               Adapter sequence to be trimmed is the first 12bp of the Nextera adapter
                        'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence.

I think there are ultimately 2 different aspects here:

The adapter sequence which is ligated to the 3' end of sequences, which is removed as read-through adapter contamination. This would be the Nextera or standard Illumina adapter sequence you pasted above, and I don't think anything has changed there.
a potential sequencing bias by selecting sequencing primers that bind to a region in the adapter that would result in sequencing 100% of the same base as the first base pair. Since this known to cause issues with Illumina sequencers' calibration, and would add an un-necessary and predictable base, I just can't believe anyone would have introduced this change. Also the fact that we have never heard of this before would further argue that this is simply not happening.

In a nutshell, for the time being I am pretty sure all is fine, and you can go ahead using Trim Galore - and it will just do the right thing.

PS: I am happy to re-visit this judgement if someone shows evidence that the system has indeed been changed.

pcantalupo · 2022-01-11T18:16:15Z

I have new information from our field application scientist from Illumina. He emailed the following:

Thank you again for taking time to discuss further. I wanted to follow up on our conversation:

TruSeq Stranded mRNA/Total RNA libraries use TruSeq adapters so the first base/cycle of sequencing does not need to be trimmed as this is accommodated in the sequencing design.

Illumina RNA Stranded mRNA/Total RNA libraries utilize a different adapter sequence so the mono-T at cycle 1 will be included in sequencing and will need to be trimmed. NextSeq 2000’s utilize a custom sequencing recipe that will not sequence this first base so trimming will not need to occur if this recipe is used.

Mono-A at the end of each fragment when reading through inserts for Illumina RNA Stranded mRNA/Total RNA libraries: this extra base should not matter for alignment, depending on the aligner you are using. If needed, the last base of each fragment can be excluded from mapping however.

This doesn't state whether other sequencing machines like NovaSeq are skipping the first base. Also, the 3rd point validated my concern about the Nextera adapter. Since Nextera adapter does not start with an A, the mono-A read through will not be clipped.

Thoughts?

FelixKrueger · 2022-01-12T17:52:18Z

My thoughts are:

If the first base is a T in 100% of cases it is certainly a technical bias that should be removed (even though a single base probably has almost no adverse effects (this would be different for bisulfite applications)). If you don't see this 100% bias - it's a non-issue
Trim Galore trims very stringently, so if the adapter used was the Illumina standard adapter and the last base is an A, it will be trimmed off
If you used Nextera adapters, no A-tailing is carried out. Instead, any trailing C would be removed

Again, so far I have not seen any evidence that anything would need attention, I think it still all just work fine. Does that make sense?

pcantalupo · 2022-01-12T20:35:52Z

I agree with your 1st and 2nd point. We are currently not implementing trimming the first base on Illumina kit libraries that show an enrichment for T and A but will consider trimming when the T percentage approaches 100% like you mentioned

To clarify your 3rd point. A-tailing is occurring in all the library preparations: Truseq, Illumina, etc... (you can find this information in the kit manuals). TruSeq sequencing used sequencing primers and adapter trimming sequences that abrogated the need to pay attention to the T/A overhang (see FAS comment 1 in my post from yesterday). But now, with the Illumina kit the sequencing is starting at the T-overhang (but possibly skipped by the sequencing machine) and the adapter sequence that Illumina recommends (webpage link) for this kit is CTGTCTCTTATACACATCT. Since this adapter does not begin with an A, it will not trim the A-tailing base as indicated by the FAS comment 3 in my post from yesterday. Like he said, the extra A at the end won't cause much problems with alignment. It is odd that Illumina is recommending to trim the 5' T-overhang but not mentioning the need for trimming the 3' A-tailing (probably as a result of it only occurring rarely on read-through sequences).

In any case, I don't see anything new that TrimGalore should account for nor I expect any changes. I was interested in brining this to your attention and others who might find it useful. Thank you

FelixKrueger · 2022-01-13T14:02:16Z

Hi Paul, it is indeed much appreciated to bring this up, just in case there are new developments it is certainly good to keep up-to-date!

Regarding point 3 one last time, I just checked some of our recently run RNA-seq runs, and found e.g. the following:

Using Nextera adapter for trimming (count: 20712). Second best hit was smallRNA (count: 0)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
...
This is cutadapt 3.5 with Python 3.9.7
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA === Summary ===

Total reads processed:               1,647,999
=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 590885 times

Bases preceding removed adapters:
  A: 23.4%
  C: 27.4%
  G: 22.7%
  T: 26.6%
  none/other: 0.0%

from what I can see here it certainly doesn't look like there was a trailing A before the Nextera adapter sequence...

I was always under the impression that the older (TruSeq, standard Illumina) adapters required end-repair and A-tailing, however newer library generation protocols use enzymatic steps (tagmentation etc) that do everything, abrogating the need for A-tailing (and hence there is also no need to take care of it afterwards).

I'd be interested to see if you have a dramatically different ratio of bases just before the removed adapter Nextera adapter sequence in your samples?

FelixKrueger · 2022-01-13T14:03:07Z

Here might be more details as well:
https://www.illumina.com/techniques/sequencing/ngs-library-prep/ligation.html

pcantalupo · 2022-01-13T16:04:16Z

illumina-stranded-total-rna-prep-reference-guide-1000000124514-02.pdf (catalog number: 20040525)
Here is the manual of the kit that was used by my sequencing core. There is no mention of tagmentation. See page 17 where is says:

Adenylate 3ʹ Ends
This step adds an adenine (A) nucleotide to the 3ʹ ends of the blunt fragments to prevent them from ligating to each other during adapter ligation. A corresponding thymine (T) nucleotide on the 3ʹ end of the adapter provides a complementary overhang for ligating the adapter to the fragment.

I do not know much about tagmentation and its molecular characteristics. I think Nextera kits are for DNA. For RNA the new branding name "Illumina" kits are using the same adapter sequence (now poorly named 'nextera') as used in the Nextera kits.

Also take a look at this bulletin (figure 1) which I think I sent to you before that shows the A-tailing on these kits.

I wouldn't expect you to see the A-tailing in actual Nextera reads. I'm curious to see what I find in my Illumina kit reads. I'm going to ask the sequencing core for the untrimmed fastqs and I'll let you know what I find.

FelixKrueger · 2022-01-13T17:11:28Z

Excellent, thanks for the extra details

pcantalupo · 2022-01-14T14:34:42Z

Here are my results for the first 100K sequences of the raw R1 and R2 fastq for a sample using the Illumina kit. The elusive A-tailing exists!

sed -n '2~4p' raw1_100K.fastq | perl -ne 'if (/(.)CTGTCTCTTATACACATCT/) { print $1,$/ }  ' | sort | uniq -c | awk '{a[NR]=$0;x+=(b[NR]=$1)}END{while(++i<=NR)print a[i]"\t"100*b[i]/x"%"}' | awk '{print $2"\t"$1"\t"$3}'
A	14508	85.5828%
C	17	0.100283%
G	54	0.318546%
T	2373	13.9983%

sed -n '2~4p' raw2_100K.fastq | perl -ne 'if (/(.)CTGTCTCTTATACACATCT/) { print $1,$/ }  ' | sort | uniq -c | awk '{a[NR]=$0;x+=(b[NR]=$1)}END{while(++i<=NR)print a[i]"\t"100*b[i]/x"%"}' | awk '{print $2"\t"$1"\t"$3}'
A	13885	97.8851%
C	262	1.84702%
G	13	0.0916461%
T	25	0.176243%

I hope Illumia will advertise this by updating their website of adapters to use for trimming https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html. I’ve stressed this to my field applications specialist several times now.

To be precise, maybe TrimGalore does need updated with an option for Illumia kits by adding an A to the 5’ end of the Nextera adapter CTGTCTCTTATACACATCT --> ACTGTCTCTTATACACATCT?

FelixKrueger · 2022-01-14T14:51:49Z

That's fascinating, thanks for posting!

I think you are right that this warrants an additional option for this type of RNA-seq kits. I am not sure it needs to be included in the auto-detection,but there should be option one can set manually.

(For more context, I am a little worried that this would interfere with the Nextera detection/removal because if this adapter with a starting A would by chance be found more often than the Nextera sequence, all other sequence contexts would evade the Nextera trimming (unless this counts towards the error rate which is 0.1 by default)). Would you agree?

… Addresses #127.

FelixKrueger · 2022-01-15T10:21:02Z

Hi Paul,

I have just added an option --stranded_illumina to allow trimming of the additional elusive A-tail. Can you checkout the current development version and see if it treats your samples well? If possible at all, I'd be interested in getting a small subset (100K sequences?) of one of your libraries (paired-end if possible) so I could add them here to the test_files folder? But don't worry if you are not at liberty to share.

Please note that this stranded Illumina kit is not part of the auto-detection, as I fear it would potentially cause more harm than do good...

pcantalupo · 2022-01-15T12:39:20Z

(For more context, I am a little worried that this would interfere with the Nextera detection/removal because if this adapter with a starting A would by chance be found more often than the Nextera sequence, all other sequence contexts would evade the Nextera trimming (unless this counts towards the error rate which is 0.1 by default))

To your previous comment above, I'm sorry but can't offer a reasonable argument since I don't truly understand how TG works.

Great news about adding the illumina stranded option. Sorry again but won't be able to share the sequences but I might be able to get an example dataset from Illumina.

Here is the test that I performed on the same set of 100K sequences above #127 (comment)

TrimGalore/trim_galore --stranded_illumina --paired raw1_100K.fastq raw2_100K.fastq > raw.txt 2>&1

Here is the output file from TG raw.txt. I'm surprised to see that ~45% of the sequences have the adapter. I think it is due to the aggressive default setting of looking for partial matches.

Thank you for adding the option

FelixKrueger · 2022-01-17T11:07:01Z

That's right, it is the agressive trimming Trim Galore performs by default (more than half of the sequence that had any kind of adapter trimmed were trimmed by 1-3bp):

Overview of removed sequences
length	count	expect	max.err	error counts
1	20751	25000.0	0	20751
2	4773	6250.0	0	4773
3	1420	1562.5	0	1420
...

I'm glad it seems to be working as you expected, and don't worry about a test set of files, I am sure some point one will become available.

senidr · 2022-05-24T02:10:45Z

Hi @FelixKrueger thanks for the very nice tool and adding this option. I was just running through a stranded mrna illumina prep through. And looks like the tool is picking up some of the t overhang but not all cases? when i run fast qc there is still som bias of base at first cycle. Then when i look in IGV there is some stack of reads showing up as a T or A and igv is calling it as a variant, but im pretty sure its just this T overhang that didnt get trimmed off? What do you think? Im new to data analysis though so i could be missing something.
I ran your tool with the following command.
~/TrimGalore/trim_galore --stranded_illumina --paired -o /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/trimmed /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R1.fastq.gz /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2.fastq.gz

Maybe i should have also used what you commented at the start of this thread?
--clip_r1 1 --clip_r2 1
Ill try adding that to the command.

GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R1.fastq.gz_trimming_report.txt
GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2_val_2_fastqc.zip
GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2_fastqc.zip

FelixKrueger · 2022-05-24T09:09:33Z

Thanks for your report. At first glance it would appear that Read 2 always starts with a T, which is currently not spotted or removed. This is something that hadn't been spotted before as the original poster's data was single-end.... The --clip_ option is probably the right thing to specify, and this should probably also happen automatically. Would you be able to share the two untrimmed files with me so I can take a look?

FelixKrueger closed this as completed Jan 10, 2022

FelixKrueger pushed a commit that referenced this issue Jan 15, 2022

Added option --stranded_illumina for stranded mRNA and Total RNA kit.…

072ecf9

… Addresses #127.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does TrimGalore handle the Illumina T-overhang #127

Does TrimGalore handle the Illumina T-overhang #127

pcantalupo commented Jan 7, 2022

FelixKrueger commented Jan 7, 2022

pcantalupo commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

pcantalupo commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

pcantalupo commented Jan 11, 2022

FelixKrueger commented Jan 12, 2022 •

edited

Loading

pcantalupo commented Jan 12, 2022

FelixKrueger commented Jan 13, 2022

FelixKrueger commented Jan 13, 2022

pcantalupo commented Jan 13, 2022 •

edited

Loading

FelixKrueger commented Jan 13, 2022

pcantalupo commented Jan 14, 2022

FelixKrueger commented Jan 14, 2022

FelixKrueger commented Jan 15, 2022

pcantalupo commented Jan 15, 2022

FelixKrueger commented Jan 17, 2022

senidr commented May 24, 2022 •

edited

Loading

FelixKrueger commented May 24, 2022

Does TrimGalore handle the Illumina T-overhang #127

Does TrimGalore handle the Illumina T-overhang #127

Comments

pcantalupo commented Jan 7, 2022

FelixKrueger commented Jan 7, 2022

pcantalupo commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

pcantalupo commented Jan 10, 2022

FelixKrueger commented Jan 10, 2022

pcantalupo commented Jan 11, 2022

FelixKrueger commented Jan 12, 2022 • edited Loading

pcantalupo commented Jan 12, 2022

FelixKrueger commented Jan 13, 2022

FelixKrueger commented Jan 13, 2022

pcantalupo commented Jan 13, 2022 • edited Loading

FelixKrueger commented Jan 13, 2022

pcantalupo commented Jan 14, 2022

FelixKrueger commented Jan 14, 2022

FelixKrueger commented Jan 15, 2022

pcantalupo commented Jan 15, 2022

FelixKrueger commented Jan 17, 2022

senidr commented May 24, 2022 • edited Loading

FelixKrueger commented May 24, 2022

FelixKrueger commented Jan 12, 2022 •

edited

Loading

pcantalupo commented Jan 13, 2022 •

edited

Loading

senidr commented May 24, 2022 •

edited

Loading