Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does TrimGalore handle the Illumina T-overhang #127

Closed
pcantalupo opened this issue Jan 7, 2022 · 20 comments
Closed

Does TrimGalore handle the Illumina T-overhang #127

pcantalupo opened this issue Jan 7, 2022 · 20 comments

Comments

@pcantalupo
Copy link

I recently came across this bulletin from Illumina. https://www.illumina.com/content/dam/illumina/gcs/assembled-assets/marketing-literature/illumina-stranded-rna-t-overhang-tech-note-470-2020-010/illumina-stranded-rna-t-overhang-tech-note-470-2020-010.pdf

It says to remove the T-overhang on the 5' end of R1 and R2. Does TrimGalore look for this and remove it? If not, What options do I use to remove the T-overhang with TrimGalore? I could not find this information in the manual or web searches.

@FelixKrueger
Copy link
Owner

Hi Paul,

This is very weird. I have never ever seen RNA-seq libraries where the first base of both R1 and R2 were Ts, and I am sure I seen a thousands of them... If the post wasn't by lllumina themselves and reasonably up-to-date, I would probably just not believe it. I can of course follow the logic, and can also see that 100% Ts might be a problem, but maybe the sequencing primers simply start reading 1 bp later, and this start at the very first genomic base rather than a the T in the adapter?

Trim Galore certainly doesn't look for a T at the start, but just in case you ever come across a library with 100% T as the first base(s), you can simply remove it using

--clip_r1 1 for single-end, or
--clip_r1 1 --clip_r2 1 for paired-end data

If you ever encounter this data, maybe you could share a FastQC base composition plot in here?

@pcantalupo
Copy link
Author

I received the Illumina bulletin link from the sequencing core. Plus I spoke with our local technical support specialist from Illumina about this. He said the sequences of the sequencing primers are proprietary so I never got a straight answer about where the sequencing actually starts.

He talked about newer chemistries and sequencing machines (but I don't have a list of which do this or do not) are handling this T-overhang at the sequencing level during base calling. He told me that both TruSeq and the new branding "Illumina" kits use A-tailing in the library preps. So, I think this situation as always existed but as of 2020 they are starting to talk about it. Possible evidence of this is from this popular link (https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html; from 2016). It was updated in 3/31/21. Take a look at the very bottom where they discuss the T-overhang. I've read this page many times over my career and never saw that last line.

Currently, I'm not seeing 100% T at 1st position. I analyzed 3 RNAseq using Illumina mRNA kit vs. 3 using TruSeq. The Illumina first base percentage was dominated by T and A but T ranges from 8 to 55% and A from 17 to 62%. This confuses me as I expect the first base to have a very high T frequency. For the TruSeq kits, A and T percentage was very low at the first base so there is something different going on between the two kits.

One more note: the SampleSheet.csv for the Illumina mRNA libraries show that the read length is 74 but for TruSeq libraries, the read length is 76. So, I'm wondering if the sequencing core is automatically trimming the T-overhang and that no further processing is necessary. I'm following up with the sequencing core to get more info about their chemistry and machine that was used. I'll let you know how it goes

@FelixKrueger
Copy link
Owner

I also talked with Simon about this more this morning, this is definitely nothing we have ever seen, and that is since 2009 and counting...

I don't think I am bound to keeping proprietary sequences a secret, so as I understand the sequencing primer sequences are:

Multiplexing Read 1 Sequencing Primer
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Multiplexing Read 2 Sequencing Primer
5' GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

As you can see, both of them contain the T as the last priming base, so the 100% is never read but should be part of the priming itself. The first base that is read comes from the fragment of interest, and not from the A-tailing process. Let me know if you discover anything new. Cheers, Felix

@FelixKrueger
Copy link
Owner

And we have solid proof that this correct, as the default Illumina sequence Trim Galore looks for is AGATCGGAAGAGC, so the exact reverse complement of both sequencing primers are found as read-through contaminants....

@pcantalupo
Copy link
Author

Thank you for that info. The primer sequences you posted are the ones that I found in the Adapter PDF (https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf) but notice they are in the Obsolete section on page 73 (page 81 of the PDF overall). That is when my tech rep said the primer sequences are proprietary (maybe they have been changed?).

Another thing to point out it that the adapter sequences found in the SampleSheet.csv for the Illumina mRNA kits is

[Settings]
Adapter,CTGTCTCTTATACACATCT

but in the TruSeq sample sheets is

[Settings]
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Since the Adapter begins with a C for Illumina and an A for TruSeq, perhaps this is why there is an enrichment for T and A in the 1st base for Illumina kits and for C and G in 1st base for TruSeq. So the reason for having high T and A with Illumina kits (in my 3 RNAseq expts; maybe others would have a different result) has nothing to do with the T-overhang.

Since TrimGalore uses the TruSeq adapter prefix as default and that since Illumina is rebranding to Illumina kits (which I'm assuming will use the different adapter CTGTCTCTTATACACATCT), will this cause issues for TrimGalore? or would the automatic adapter detection pick this up?

@FelixKrueger
Copy link
Owner

I think it should all be fine. The mRNA kit sequence you posted is the Nextera sequence, which is part of the auto-detection. You can also specify it manually if you wanted to (but there shouldn't be any need to do so).

--nextera               Adapter sequence to be trimmed is the first 12bp of the Nextera adapter
                        'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence.

I think there are ultimately 2 different aspects here:

  1. The adapter sequence which is ligated to the 3' end of sequences, which is removed as read-through adapter contamination. This would be the Nextera or standard Illumina adapter sequence you pasted above, and I don't think anything has changed there.
  2. a potential sequencing bias by selecting sequencing primers that bind to a region in the adapter that would result in sequencing 100% of the same base as the first base pair. Since this known to cause issues with Illumina sequencers' calibration, and would add an un-necessary and predictable base, I just can't believe anyone would have introduced this change. Also the fact that we have never heard of this before would further argue that this is simply not happening.

In a nutshell, for the time being I am pretty sure all is fine, and you can go ahead using Trim Galore - and it will just do the right thing.

PS: I am happy to re-visit this judgement if someone shows evidence that the system has indeed been changed.

@pcantalupo
Copy link
Author

I have new information from our field application scientist from Illumina. He emailed the following:

Thank you again for taking time to discuss further. I wanted to follow up on our conversation:

  • TruSeq Stranded mRNA/Total RNA libraries use TruSeq adapters so the first base/cycle of sequencing does not need to be trimmed as this is accommodated in the sequencing design.
  • Illumina RNA Stranded mRNA/Total RNA libraries utilize a different adapter sequence so the mono-T at cycle 1 will be included in sequencing and will need to be trimmed. NextSeq 2000’s utilize a custom sequencing recipe that will not sequence this first base so trimming will not need to occur if this recipe is used.
  • Mono-A at the end of each fragment when reading through inserts for Illumina RNA Stranded mRNA/Total RNA libraries: this extra base should not matter for alignment, depending on the aligner you are using. If needed, the last base of each fragment can be excluded from mapping however.

This doesn't state whether other sequencing machines like NovaSeq are skipping the first base. Also, the 3rd point validated my concern about the Nextera adapter. Since Nextera adapter does not start with an A, the mono-A read through will not be clipped.

Thoughts?

@FelixKrueger
Copy link
Owner

FelixKrueger commented Jan 12, 2022

My thoughts are:

  • If the first base is a T in 100% of cases it is certainly a technical bias that should be removed (even though a single base probably has almost no adverse effects (this would be different for bisulfite applications)). If you don't see this 100% bias - it's a non-issue

  • Trim Galore trims very stringently, so if the adapter used was the Illumina standard adapter and the last base is an A, it will be trimmed off

  • If you used Nextera adapters, no A-tailing is carried out. Instead, any trailing C would be removed

Again, so far I have not seen any evidence that anything would need attention, I think it still all just work fine. Does that make sense?

@pcantalupo
Copy link
Author

I agree with your 1st and 2nd point. We are currently not implementing trimming the first base on Illumina kit libraries that show an enrichment for T and A but will consider trimming when the T percentage approaches 100% like you mentioned

To clarify your 3rd point. A-tailing is occurring in all the library preparations: Truseq, Illumina, etc... (you can find this information in the kit manuals). TruSeq sequencing used sequencing primers and adapter trimming sequences that abrogated the need to pay attention to the T/A overhang (see FAS comment 1 in my post from yesterday). But now, with the Illumina kit the sequencing is starting at the T-overhang (but possibly skipped by the sequencing machine) and the adapter sequence that Illumina recommends (webpage link) for this kit is CTGTCTCTTATACACATCT. Since this adapter does not begin with an A, it will not trim the A-tailing base as indicated by the FAS comment 3 in my post from yesterday. Like he said, the extra A at the end won't cause much problems with alignment. It is odd that Illumina is recommending to trim the 5' T-overhang but not mentioning the need for trimming the 3' A-tailing (probably as a result of it only occurring rarely on read-through sequences).

In any case, I don't see anything new that TrimGalore should account for nor I expect any changes. I was interested in brining this to your attention and others who might find it useful. Thank you

@FelixKrueger
Copy link
Owner

Hi Paul, it is indeed much appreciated to bring this up, just in case there are new developments it is certainly good to keep up-to-date!

Regarding point 3 one last time, I just checked some of our recently run RNA-seq runs, and found e.g. the following:

Using Nextera adapter for trimming (count: 20712). Second best hit was smallRNA (count: 0)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
...
This is cutadapt 3.5 with Python 3.9.7
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA === Summary ===

Total reads processed:               1,647,999
=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 590885 times

Bases preceding removed adapters:
  A: 23.4%
  C: 27.4%
  G: 22.7%
  T: 26.6%
  none/other: 0.0%

from what I can see here it certainly doesn't look like there was a trailing A before the Nextera adapter sequence...

I was always under the impression that the older (TruSeq, standard Illumina) adapters required end-repair and A-tailing, however newer library generation protocols use enzymatic steps (tagmentation etc) that do everything, abrogating the need for A-tailing (and hence there is also no need to take care of it afterwards).

I'd be interested to see if you have a dramatically different ratio of bases just before the removed adapter Nextera adapter sequence in your samples?

@FelixKrueger
Copy link
Owner

Here might be more details as well:
https://www.illumina.com/techniques/sequencing/ngs-library-prep/ligation.html

@pcantalupo
Copy link
Author

pcantalupo commented Jan 13, 2022

illumina-stranded-total-rna-prep-reference-guide-1000000124514-02.pdf (catalog number: 20040525)
Here is the manual of the kit that was used by my sequencing core. There is no mention of tagmentation. See page 17 where is says:

Adenylate 3ʹ Ends
This step adds an adenine (A) nucleotide to the 3ʹ ends of the blunt fragments to prevent them from ligating to each other during adapter ligation. A corresponding thymine (T) nucleotide on the 3ʹ end of the adapter provides a complementary overhang for ligating the adapter to the fragment.

I do not know much about tagmentation and its molecular characteristics. I think Nextera kits are for DNA. For RNA the new branding name "Illumina" kits are using the same adapter sequence (now poorly named 'nextera') as used in the Nextera kits.

Also take a look at this bulletin (figure 1) which I think I sent to you before that shows the A-tailing on these kits.

I wouldn't expect you to see the A-tailing in actual Nextera reads. I'm curious to see what I find in my Illumina kit reads. I'm going to ask the sequencing core for the untrimmed fastqs and I'll let you know what I find.

@FelixKrueger
Copy link
Owner

Excellent, thanks for the extra details

@pcantalupo
Copy link
Author

Here are my results for the first 100K sequences of the raw R1 and R2 fastq for a sample using the Illumina kit. The elusive A-tailing exists!

sed -n '2~4p' raw1_100K.fastq | perl -ne 'if (/(.)CTGTCTCTTATACACATCT/) { print $1,$/ }  ' | sort | uniq -c | awk '{a[NR]=$0;x+=(b[NR]=$1)}END{while(++i<=NR)print a[i]"\t"100*b[i]/x"%"}' | awk '{print $2"\t"$1"\t"$3}'
A	14508	85.5828%
C	17	0.100283%
G	54	0.318546%
T	2373	13.9983%

sed -n '2~4p' raw2_100K.fastq | perl -ne 'if (/(.)CTGTCTCTTATACACATCT/) { print $1,$/ }  ' | sort | uniq -c | awk '{a[NR]=$0;x+=(b[NR]=$1)}END{while(++i<=NR)print a[i]"\t"100*b[i]/x"%"}' | awk '{print $2"\t"$1"\t"$3}'
A	13885	97.8851%
C	262	1.84702%
G	13	0.0916461%
T	25	0.176243%

I hope Illumia will advertise this by updating their website of adapters to use for trimming https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html. I’ve stressed this to my field applications specialist several times now.

To be precise, maybe TrimGalore does need updated with an option for Illumia kits by adding an A to the 5’ end of the Nextera adapter CTGTCTCTTATACACATCT --> ACTGTCTCTTATACACATCT?

@FelixKrueger
Copy link
Owner

That's fascinating, thanks for posting!

I think you are right that this warrants an additional option for this type of RNA-seq kits. I am not sure it needs to be included in the auto-detection,but there should be option one can set manually.

(For more context, I am a little worried that this would interfere with the Nextera detection/removal because if this adapter with a starting A would by chance be found more often than the Nextera sequence, all other sequence contexts would evade the Nextera trimming (unless this counts towards the error rate which is 0.1 by default)). Would you agree?

@FelixKrueger
Copy link
Owner

Hi Paul,

I have just added an option --stranded_illumina to allow trimming of the additional elusive A-tail. Can you checkout the current development version and see if it treats your samples well? If possible at all, I'd be interested in getting a small subset (100K sequences?) of one of your libraries (paired-end if possible) so I could add them here to the test_files folder? But don't worry if you are not at liberty to share.

Please note that this stranded Illumina kit is not part of the auto-detection, as I fear it would potentially cause more harm than do good...

@pcantalupo
Copy link
Author

(For more context, I am a little worried that this would interfere with the Nextera detection/removal because if this adapter with a starting A would by chance be found more often than the Nextera sequence, all other sequence contexts would evade the Nextera trimming (unless this counts towards the error rate which is 0.1 by default))

To your previous comment above, I'm sorry but can't offer a reasonable argument since I don't truly understand how TG works.

Great news about adding the illumina stranded option. Sorry again but won't be able to share the sequences but I might be able to get an example dataset from Illumina.

Here is the test that I performed on the same set of 100K sequences above #127 (comment)

TrimGalore/trim_galore --stranded_illumina --paired raw1_100K.fastq raw2_100K.fastq > raw.txt 2>&1

Here is the output file from TG raw.txt. I'm surprised to see that ~45% of the sequences have the adapter. I think it is due to the aggressive default setting of looking for partial matches.

Thank you for adding the option

@FelixKrueger
Copy link
Owner

That's right, it is the agressive trimming Trim Galore performs by default (more than half of the sequence that had any kind of adapter trimmed were trimmed by 1-3bp):

Overview of removed sequences
length	count	expect	max.err	error counts
1	20751	25000.0	0	20751
2	4773	6250.0	0	4773
3	1420	1562.5	0	1420
...

I'm glad it seems to be working as you expected, and don't worry about a test set of files, I am sure some point one will become available.

@senidr
Copy link

senidr commented May 24, 2022

Hi @FelixKrueger thanks for the very nice tool and adding this option. I was just running through a stranded mrna illumina prep through. And looks like the tool is picking up some of the t overhang but not all cases? when i run fast qc there is still som bias of base at first cycle. Then when i look in IGV there is some stack of reads showing up as a T or A and igv is calling it as a variant, but im pretty sure its just this T overhang that didnt get trimmed off? What do you think? Im new to data analysis though so i could be missing something.
I ran your tool with the following command.
~/TrimGalore/trim_galore --stranded_illumina --paired -o /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/trimmed /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R1.fastq.gz /home/uqsidris/DATA_RUNS/illumina_data/AGRF_CAGRF220510556_KFTTH/GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2.fastq.gz

Maybe i should have also used what you commented at the start of this thread?
--clip_r1 1 --clip_r2 1
Ill try adding that to the command.

GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R1.fastq.gz_trimming_report.txt
GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2_val_2_fastqc.zip
GFP_mRNA_cap_rep1_KFTTH_ATCCAGGTAT-CAACGTCAGC_L001_R2_fastqc.zip
image

@FelixKrueger
Copy link
Owner

Thanks for your report. At first glance it would appear that Read 2 always starts with a T, which is currently not spotted or removed. This is something that hadn't been spotted before as the original poster's data was single-end.... The --clip_ option is probably the right thing to specify, and this should probably also happen automatically. Would you be able to share the two untrimmed files with me so I can take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants