Releases: FelixKrueger/TrimGalore
Multi-core support, NextSeq/NovaSeq support & more
Version 0.6.0 - Multi-core support, NextSeq/NovaSeq quality support & more
- Added option
--hardtrim3 INT,
which allows you to hard-clip sequences from their 5' end. This option processes one or more files (plain FastQ or gzip compressed files) and produces hard-trimmed FastQ files ending in .{INT}bp_3prime.fq(.gz). We found this quite useful in a number of scenarios where we wanted to removed biased residues from the start of sequences. Here is an example:
before: CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT
--hardtrim3 20: TTTTTAAGAAAATGGAAAAT
-
Added new option
--basename <PREFERRED_NAME>
to usePREFERRED_NAME
as the basename for output files, instead of deriving the filenames from the input files. Single-end data would be calledPREFERRED_NAME_trimmed.fq(.gz)
, orPREFERRED_NAME_val_1.fq(.gz)
andPREFERRED_NAME_val_2.fq(.gz)
for paired-end data.--basename
only works when 1 file (single-end) or 2 files (paired-end) are specified, but not for longer lists (see #43). -
Added option
--2colour/--nextseq INT
whereby INT selects the quality cutoff that is normally set with-q
, only that qualities of G bases are ignored.-q
and--2colour/--nextseq INT
are mutually exclusive (see #41) -
Added check to see if Read 1 and Read 2 files were given as the very same file.
-
If an output directory which was specified with
-o output_directory
did not exist, it will be created for you. -
The option
--max_n INT
now also works in single-end RRBS mode. -
Added multi-threading support with the new option
-j/--cores INT
; many thanks to Frankie James (@fjames003 ) for initiating this. Multi-threading support works effectively if Cutadapt is run with Python 3, and when parallel gzip (pigz
) is installed:
For Cutadapt to work with multiple cores, it requires Python 3 as well as parallel gzip (pigz
) installed on the system. The version of Python used is detected from the shebang line of the Cutadapt executable (either 'cutadapt', or a specified path). If Python 2 is detected, --cores
is set to 1 and multi-core processing will be disabled. If pigz
cannot be detected on your system, Trim Galore reverts to using gzip
compression. Please note however, that gzip
compression will slow down multi-core processes so much that it is hardly worthwhile, please see: here for more info).
Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted. Assuming that Python 3 is used and pigz
is installed, --cores 2
would use:
- 2 cores to read the input (probably not at a high usage though)
- 2 cores to write to the output (at moderately high usage)
- 2 cores for Cutadapt itself
- 2 additional cores for Cutadapt (not sure what they are used for)
- 1 core for Trim Galore itself
So this can be up to 9 cores, even though most of them won't be used at 100% for most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step as Trim Galore reads and writes from and to two files at the same time, respectively.
--cores 4
would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) + 1 (Trim Galore) = ~15 cores in total.
From the graph above it seems that --cores 4
could be a sweet spot, anything above appear to have diminishing returns.
v0.5.0 - Mouse Epigenetic Clock support
v0.5.0 - Mouse Epigenetic Clock pre-processing
-
Adapters can now be specified as single bases with a multiplier in squiggly brackets, e.g.
-a "A{10}"
to trim poly-A tails -
Added option
--hardtrim5 INT
to enable hard-clipping from the 3' end. This option processes one or more files (plain FastQ or gzip compressed files) and produce hard-trimmed FastQ files ending in.{INT}bp.fq(.gz)
. -
Added option
--clock
to trim reads in a specific way that is currently used for the Mouse Epigenetic Clock (see here: Multi-tissue DNA methylation age predictor in mouse, Stubbs et al., Genome Biology, 2017 18:68). Following the trimming, Trim Galore exits.
In it's current implementation, the dual-UMI RRBS reads come in the following format:
Read 1 5' UUUUUUUU CAGTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF TACTG UUUUUUUU 3'
Read 2 3' UUUUUUUU GTCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ATGAC UUUUUUUU 5'
Where UUUUUUUU
is a random 8-mer unique molecular identifier (UMI), CAGTA
is a constant region, and FFFFFFF
... is the actual RRBS-Fragment to be sequenced. The UMIs for Read 1 (R1) and Read 2 (R2), as well as the fixed sequences (F1 or F2), are written into the read ID and removed from the actual sequence. Here is an example:
R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
CGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
CAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
Following clock trimming, the resulting files (.clock_UMI.R1.fq(.gz)
and .clock_UMI.R2.fq(.gz)
) should be adapter- and quality trimmed with Trim Galore as usual. In addition, reads need to be trimmed by 15bp from their 3' end to get rid of potential UMI and fixed sequences. The command is:
trim_galore --paired --three_prime_clip_R1 15 --three_prime_clip_R2 15 *.clock_UMI.R1.fq.gz *.clock_UMI.R2.fq.gz
Following clock pre-processing, reads should be aligned with Bismark and deduplicated with UmiBam in --dual_index
mode (see here: https://github.com/FelixKrueger/Umi-Grinder). UmiBam recognises the UMIs within this pattern: R1:(ATCTAGTT):R2:(CAATTTTG): as (UMI R1=ATCTAGTT) and (UMI R2=CAATTTTG).
0.4.5
- Trim Galore now dies during the validation step when it encounters paired-end files that are not equal in length
v0.4.4 - Essential --rrbs update for single-end files
-
Reinstated functionality of option
--rrbs
for single-end RRBS files which had gone amiss in the previous release. -
Updated User Guide and Readme documents, added Installation instruction and Travis functionality - thanks @ewels!
v0.4.3
- Changed the option
--rrbs
for paired-end libraries from removing 2 additional base pairs from the 3' end of both reads to trim 2 bp from the 3' end only for Read 1 and set--clip_r2 2
for Read 2 instead. This is because Read 2 does not technically need 3' trimming since the end of Read 2 is not affected by the artificial methylation states introduced by the [end-repair] fill-in reaction. Instead, the first couple of positions of Read 2 suffer from the same fill-in problems as standard paired-end libraries. Also see this issue. - Updated the RRBS Guide to incorporate the recent changes to the
--rrbs
trimming mode for paired-end files. - Added a closing statement for the REPORT filehandle since it occasionally swallowed the last line...
- Setting
--length
now takes priority over the smallRNA adapter (which would set the length cutoff to 18 bp).
v0.4.2
- Replaced all instances of
zcat
withgunzip -c
so that older versions of Mac OSX do not append a.Z
to the end of the file and subsequently fail because the file is not present. Dah... - Added option
--max_n COUNT
to remove all reads (or read pairs) exceeding this limit of tolerated Ns. In a paired-end setting it is sufficient if one read exceeds this limit. Reads (or read pairs) are removed altogether and are not further trimmed or written to the unpaired output. - Enabled option
--trim-n
to remove Ns from both end of the reads. Does currently not work for RRBS-mode. - Added new option
--max_length <INT>
which reads that are longer than bp after trimming. This is only advised for smallRNA sequencing to remove non-small RNA sequences. - Fixed a typo in adapter auto-detection warning message.