Alternative trimming step for polyA/T removal #663

Reda94 · 2021-06-25T10:44:42Z

Hello, it looks like TrimGalore does not automatically perform polyA/T removal (see post-trimming fastqc screenshot below)
). They have an experimental option to do so (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3230-L3257) but it would require running the nf-core/rnaseq pipeline twice (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3248-L3254) (@drpatelh).

Would be great to have e.g. bbduk (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/) as it allows simultaneous base quality, adapter and polyA/T trimming.

Thanks!

FelixKrueger · 2021-06-25T11:12:39Z

To just quickly chip in here. It is correct that the option --polyA in Trim Galore is designed to work with an RNA kit which specifically produces reads from polyA labelled transcripts (Thermo Fisher, Collibri) https://www.thermofisher.com/uk/en/home/life-science/sequencing/next-generation-sequencing/ngs-library-preparation-illumina-systems/collibri-stranded-rna-library-prep-kits.html?gclid=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE&ef_id=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE:G:s&s_kwcid=AL!3652!3!303512728519!p!!g!!collibri.

Standard removal of PolyA sequences could be accomplished using e.g. -a {A}10.

Trim Galore is does indeed not trim several things at the same time, and intentionally so. One could probably argue that you for the sequences you showed above, PolyA/PolyT sequence would remove all full homo-polymers entirely, and would truncate the PolyT sequences to a few nucleotides (certainly <15), which are either removed by a size-selection step, or will not map (uniquely). So in other words: these sequences are useless from a library point of view.

In your case, mapping will have the exact same effect though: PolyX sequences will probably either not map properly in the first place, or be soft-clipped (and then not map properly). The result should be pretty much the same: These sequences will not result in useful RNA-seq counts.

By all means, add another trimming option for the pipeline, but I don't think that it'll have a noteworthy advantage over ignoring PolyX sequences (and let the aligner take care of it), or even running an adapter trimming process first, and then using that output as the input for PolyA/T trimming (which should be just a matter of copy paste with DSL2).

It is still good to know if you have these contaminants in your library, as they may dramatically impact the mapping efficiency. You could add these sequences to a custom contaminants.txt file, and then run:

fastqc -a contaminants.txt *fastq.gz

to see the extent of PolyX sequences in your library.

Happy to discuss :)
Felix

Illumina Universal Adapter			AGATCGGAAGAG
Illumina Small RNA Adapter			ATGGAATTCTCG
Nextera Transposase Sequence		        CTGTCTCTTAT
#SOLiD Small RNA Adapter			CGCCTTGGCCGTA
Illumina Paired End PCR Primer 2		CTACACGACGC
PolyA						AAAAAAAAAAAA
PolyT						TTTTTTTTTTTT
PolyG						GGGGGGGGGGGG
PolyC						CCCCCCCCCCCC

drpatelh · 2021-06-25T11:29:50Z

Thank you for the detailed info @FelixKrueger ! I agree that most of these sequences won't align but I would personally prefer for these to removed at the "adapter" trimming step because it gives you an obvious assessment as to the fact that you may have such artifacts in the library. Trying to figure out why things haven't mapped can be quite painful.

I didn't know you could use -a {A}10 like that. Neat! In which case, you may be able to just tweak the pipeline options and run it once. Can you create a file called custom.config with the contents below @Reda94 and add -c custom.config to the command you are running?

params {
  modules {
    'trimgalore' {
      args = '--fastqc -a {A}10'
    }
  }
}

So would you need an additional -a {T}10 for polyT's @FelixKrueger or will the reverse complement be auto-trimmed?

FelixKrueger · 2021-06-25T11:36:51Z

Hi Harshil,

yes, args = '--fastqc -a {A}10' should work straight out of the box, but again, this will not be on top the adapter (auto-detection) and trimming, but this will be used as the adapter sequence itself. And if you look for -a AAAAAAAAAA it will trim - drumroll - PolyA! We are deliberately trying to keep things simple, and not do any obscure and undocumented reverse complement trimming as well...

drpatelh · 2021-06-25T11:39:57Z

Ah, I see! So it won't trim the conventional adapters if you use args = '--fastqc -a {A}10' in which case you would have to perform the adapter trimming twice anyway. Ok, maybe not the appropriate solution then unless you run the pipeline twice or if you are prepared to let the mapping take care of this.

FelixKrueger · 2021-06-25T11:44:47Z

It wouldn't necessarily be the entire pipeline twice, but just add an extra trimming step. So something like the nf-core equivalent of:

TRIM_GALORE                     (file_ch, params.outdir, params.trim_galore_args, params.verbose)
if (PolyX_trimming_will_make_my_day){
     params.trim_galore += " -a {10} "
     TRIM_GALORE                     (TRIM_GALORE.out.reads, params.outdir, params.trim_galore_args, params.verbose)
}

I would also split out FastQC as it can then be run in parallel rather then as part of Trim Galore:

FASTQC2                         (TRIM_GALORE.out.reads, params.outdir, params.fastqc_args, params.verbose)

FelixKrueger · 2021-06-25T11:47:02Z

but yea, in the end the results will be pretty much the same however you implement it :P

drpatelh · 2021-06-25T11:49:31Z

Sorry, I meant running only the trimming (and other mandatory) steps first time around and then running the pipeline full blown with either --polyA or -a {10} the second time around. Running the trimming twice isn't ideal 😅 Just out of interest why do you need to run the trimming twice for the polyA stuff?

FelixKrueger · 2021-06-25T12:02:52Z

For --polyA with the Collibri kit, it was actually a little complicated. But the bottom line was that you need to remove poor qualities and sequencing adapters first, and then as a second step find and remove poly-T sequences (I believe it was different for R1 and R2 as well, and write the found PolyT bases into the read IDs of both reads, and then use only one read for mapping (and use the readID for downstream processing). This protocol was specifically designed to identify alternative transcription end and PolyA initiation sites, and yea sequences with tons of AAAAA or TTTT are not exactly great for anything really...

Regarding your earlier quote:

Trying to figure out why things haven't mapped can be quite painful.

This is certainly true, but if you are running pipelines that will magically do lots of things at the same time you might just look at the mapping efficiency, see that it's great and move on - and potentially miss that you had 50% of sequences lost because of a technical issue that results in homo-polymers (which you should probably try and address on the wet lab side...)

drpatelh · 2021-06-25T12:44:29Z

you might just look at the mapping efficiency, see that it's great and move on

Good point! We also do have the cutadapt logs in the MultiQC report where this sort of info is easily accessible. But I guess it will come down to what is actually looked at further e.g. users may not even look at the mapping rates 🤷🏽 Another advantage of trimming beforehand is that there will most likely be a diversity of sequences with a variable count of these polyA bases - assuming that the aligner will handle these appropriately could be a little dangerous and even if it may be minimal this could actually impact the quantification.

tomsing1 · 2023-01-17T01:12:27Z

Just because this came up in a thread on slack: STAR can clip 3' polyA sequences, too, e.g. by adding the following STAR argument: --clip3pAdapterSeq AAAAAAAA

drpatelh · 2023-03-22T16:27:49Z

Added fastp support in #970 which will allow you to use --trimmer fastp --extra_fastp_args '--trim_poly_x' to trim the polyA tails by default along with any adapter sequences.

Reda94 added the enhancement label Jun 25, 2021

drpatelh added this to the 3.11 milestone Mar 17, 2023

drpatelh mentioned this issue Mar 22, 2023

Add fastp as an alternative route to adapter and quality trim reads #970

Merged

drpatelh closed this as completed Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative trimming step for polyA/T removal #663

Alternative trimming step for polyA/T removal #663

Reda94 commented Jun 25, 2021

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021 •

edited

Loading

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021 •

edited

Loading

FelixKrueger commented Jun 25, 2021 •

edited

Loading

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021

tomsing1 commented Jan 17, 2023

drpatelh commented Mar 22, 2023

Alternative trimming step for polyA/T removal #663

Alternative trimming step for polyA/T removal #663

Comments

Reda94 commented Jun 25, 2021

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021 • edited Loading

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021 • edited Loading

FelixKrueger commented Jun 25, 2021 • edited Loading

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021

FelixKrueger commented Jun 25, 2021

drpatelh commented Jun 25, 2021

tomsing1 commented Jan 17, 2023

drpatelh commented Mar 22, 2023

drpatelh commented Jun 25, 2021 •

edited

Loading

drpatelh commented Jun 25, 2021 •

edited

Loading

FelixKrueger commented Jun 25, 2021 •

edited

Loading