Skip to content

Releases: FelixKrueger/TrimGalore

Multi-core support, NextSeq/NovaSeq support & more

20 Mar 10:20
39e1ad7
Compare
Choose a tag to compare

Version 0.6.0 - Multi-core support, NextSeq/NovaSeq quality support & more

  • Added option --hardtrim3 INT, which allows you to hard-clip sequences from their 5' end. This option processes one or more files (plain FastQ or gzip compressed files) and produces hard-trimmed FastQ files ending in .{INT}bp_3prime.fq(.gz). We found this quite useful in a number of scenarios where we wanted to removed biased residues from the start of sequences. Here is an example:
before:         CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT
--hardtrim3 20:                                                   TTTTTAAGAAAATGGAAAAT
  • Added new option --basename <PREFERRED_NAME> to use PREFERRED_NAME as the basename for output files, instead of deriving the filenames from the input files. Single-end data would be called PREFERRED_NAME_trimmed.fq(.gz), or PREFERRED_NAME_val_1.fq(.gz) and PREFERRED_NAME_val_2.fq(.gz) for paired-end data. --basename only works when 1 file (single-end) or 2 files (paired-end) are specified, but not for longer lists (see #43).

  • Added option --2colour/--nextseq INT whereby INT selects the quality cutoff that is normally set with -q, only that qualities of G bases are ignored. -q and --2colour/--nextseq INT are mutually exclusive (see #41)

  • Added check to see if Read 1 and Read 2 files were given as the very same file.

  • If an output directory which was specified with -o output_directory did not exist, it will be created for you.

  • The option --max_n INT now also works in single-end RRBS mode.

  • Added multi-threading support with the new option -j/--cores INT; many thanks to Frankie James (@fjames003 ) for initiating this. Multi-threading support works effectively if Cutadapt is run with Python 3, and when parallel gzip (pigz) is installed:

Multi-threading

For Cutadapt to work with multiple cores, it requires Python 3 as well as parallel gzip (pigz) installed on the system. The version of Python used is detected from the shebang line of the Cutadapt executable (either 'cutadapt', or a specified path). If Python 2 is detected, --cores is set to 1 and multi-core processing will be disabled. If pigz cannot be detected on your system, Trim Galore reverts to using gzip compression. Please note however, that gzip compression will slow down multi-core processes so much that it is hardly worthwhile, please see: here for more info).

Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted. Assuming that Python 3 is used and pigz is installed, --cores 2 would use:

  • 2 cores to read the input (probably not at a high usage though)
  • 2 cores to write to the output (at moderately high usage)
  • 2 cores for Cutadapt itself
  • 2 additional cores for Cutadapt (not sure what they are used for)
  • 1 core for Trim Galore itself

So this can be up to 9 cores, even though most of them won't be used at 100% for most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step as Trim Galore reads and writes from and to two files at the same time, respectively.

--cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) + 1 (Trim Galore) = ~15 cores in total.

From the graph above it seems that --cores 4 could be a sweet spot, anything above appear to have diminishing returns.

v0.5.0 - Mouse Epigenetic Clock support

28 Jun 15:08
30edc81
Compare
Choose a tag to compare

v0.5.0 - Mouse Epigenetic Clock pre-processing


  • Adapters can now be specified as single bases with a multiplier in squiggly brackets, e.g. -a "A{10}" to trim poly-A tails

  • Added option --hardtrim5 INT to enable hard-clipping from the 3' end. This option processes one or more files (plain FastQ or gzip compressed files) and produce hard-trimmed FastQ files ending in .{INT}bp.fq(.gz).

  • Added option --clock to trim reads in a specific way that is currently used for the Mouse Epigenetic Clock (see here: Multi-tissue DNA methylation age predictor in mouse, Stubbs et al., Genome Biology, 2017 18:68). Following the trimming, Trim Galore exits.

In it's current implementation, the dual-UMI RRBS reads come in the following format:

Read 1  5' UUUUUUUU CAGTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF TACTG UUUUUUUU 3'
Read 2  3' UUUUUUUU GTCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ATGAC UUUUUUUU 5'

Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI), CAGTA is a constant region, and FFFFFFF... is the actual RRBS-Fragment to be sequenced. The UMIs for Read 1 (R1) and Read 2 (R2), as well as the fixed sequences (F1 or F2), are written into the read ID and removed from the actual sequence. Here is an example:

R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
    ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
    CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
                 CGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
                 CAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

Following clock trimming, the resulting files (.clock_UMI.R1.fq(.gz) and .clock_UMI.R2.fq(.gz)) should be adapter- and quality trimmed with Trim Galore as usual. In addition, reads need to be trimmed by 15bp from their 3' end to get rid of potential UMI and fixed sequences. The command is:

trim_galore --paired --three_prime_clip_R1 15 --three_prime_clip_R2 15 *.clock_UMI.R1.fq.gz *.clock_UMI.R2.fq.gz

Following clock pre-processing, reads should be aligned with Bismark and deduplicated with UmiBam in --dual_index mode (see here: https://github.com/FelixKrueger/Umi-Grinder). UmiBam recognises the UMIs within this pattern: R1:(ATCTAGTT):R2:(CAATTTTG): as (UMI R1=ATCTAGTT) and (UMI R2=CAATTTTG).

0.4.5

13 Nov 10:44
fef5f49
Compare
Choose a tag to compare
  • Trim Galore now dies during the validation step when it encounters paired-end files that are not equal in length

v0.4.4 - Essential --rrbs update for single-end files

28 Mar 19:50
Compare
Choose a tag to compare
  • Reinstated functionality of option --rrbs for single-end RRBS files which had gone amiss in the previous release.

  • Updated User Guide and Readme documents, added Installation instruction and Travis functionality - thanks @ewels!

v0.4.3

25 Jan 14:03
Compare
Choose a tag to compare
  • Changed the option --rrbs for paired-end libraries from removing 2 additional base pairs from the 3' end of both reads to trim 2 bp from the 3' end only for Read 1 and set --clip_r2 2 for Read 2 instead. This is because Read 2 does not technically need 3' trimming since the end of Read 2 is not affected by the artificial methylation states introduced by the [end-repair] fill-in reaction. Instead, the first couple of positions of Read 2 suffer from the same fill-in problems as standard paired-end libraries. Also see this issue.
  • Updated the RRBS Guide to incorporate the recent changes to the --rrbs trimming mode for paired-end files.
  • Added a closing statement for the REPORT filehandle since it occasionally swallowed the last line...
  • Setting --length now takes priority over the smallRNA adapter (which would set the length cutoff to 18 bp).

v0.4.2

07 Sep 15:42
Compare
Choose a tag to compare
  • Replaced all instances of zcat with gunzip -c so that older versions of Mac OSX do not append a .Z to the end of the file and subsequently fail because the file is not present. Dah...
  • Added option --max_n COUNT to remove all reads (or read pairs) exceeding this limit of tolerated Ns. In a paired-end setting it is sufficient if one read exceeds this limit. Reads (or read pairs) are removed altogether and are not further trimmed or written to the unpaired output.
  • Enabled option --trim-n to remove Ns from both end of the reads. Does currently not work for RRBS-mode.
  • Added new option --max_length <INT> which reads that are longer than bp after trimming. This is only advised for smallRNA sequencing to remove non-small RNA sequences.
  • Fixed a typo in adapter auto-detection warning message.