Skip to content

Canu v1.7

Compare
Choose a tag to compare
@brianwalenz brianwalenz released this 27 Feb 15:53
· 1797 commits to master since this release

These are release notes for Canu version 1.7, which was released on February 27th, 2018. Canu is specialized for assembly of single-molecule high-noise sequences. Full documentation can be found at http://canu.readthedocs.org/.

This release provides a stable, tested, and documented version of the software. The binary distributions should work on any relatively recent version of the respective OS. The source code distribution contains everything you need to create a binary distribution for your own specific OS.

Citation

Minimum Requirements

  • Perl 5.12.0, or File::Path 2.08
  • Java SE 8
  • GCC 4.5 (for compilation only)
  • OS X 10.10 (for binaries only)
  • gnuplot (optional, for generating diagnostic graphs)

Installation

Users can download Canu as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.

To install from source code (the file can be named either canu-v1.7.tar.gz or just v1.7.tar.gz, depending on how it is downloaded):

gunzip -dc canu-v1.7.tar.gz | tar -xf -
cd canu-1.7/src
make -j 8
cd ..

To install from a binary distribution:

xz -dc canu-1.7.*.tar.xz |tar -xf -

In both cases, canu is installed in directory canu-1.7/-, for example, canu-1.7/Linux-amd64. You can run the assembler with:

canu-1.7/*/bin/canu

Changes

This release was originally planned to only include changes to read correction, but we opportunistically added: improved support for plasmids via read rescue; an initial implementation of trio binning; a 'fast mode' for Nanopore reads (though not automatic); and sneaked in some major changes to the gkpStore/tigStore read/contig database for future use. So much for the plan.

Assemblies started in Canu v1.6 ARE NOT compatible with Canu v1.7.

  • Ensure that every raw read is either corrected or used as evidence for correcting some other raw read. This serves to rescue short plasmids in high coverage datasets, and it is no longer necessary to set corOutCoverage to achieve the same result.
  • Initial support of TrioCanu (biorxiv) added.
  • Add a '-fast' option for using a faster (but still not rigorously validated) overlap method. Useful for long Nanopore reads.
  • In anticipation of future features, all reads - raw, corrected and trimmed versions - are stored in a single gkpStore in the root assembly directory.
  • Read correction was almost completely re-engineered.
    • Stability of the computation was increased by removing multiple processes communicating through a pipe.
    • Layouts of the raw reads used to correct a read are saved for future use (e.g., during consensus). With the gkpStore change above, it is now possible to track a raw read through to the final contig outputs.
    • Only a single corrected read is generated for each raw read. Previously, PacBio reads containing multiple sub-reads could create multiple (redundant) corrected reads.
  • Overlap Error Detection (RED and OEA) memory usage when configuring compute jobs has been reduced.
  • Overlap Error Detection (RED and OEA) job sizes were increased to reduce disk contention.
  • overlapInCore (OBTOVL and UTGOVL) job sizes were increased to reduce disk contention and to take advantage of generally larger memory sizes available.
  • The ovlRefBlockSize parameter was removed; use ovlRefBlockLength instead.
  • Update to Snappy v1.1.7.
  • Add basic support for RNA by translating input U bases to T bases. Output files are NOT translated back to U bases.
  • Restrict the parallel overlap store creation method to grid runs. ovsMethod=forceparallel was added to force the usage of the parallel method on non-grid runs.
  • Add the preExec option to allow a single command to run before any Canu program is run. Useful for, e.g., loading a Canu module.
  • Use more standard locations for installing binaries and perl modules.

Bug Fixes

  • In non-grid mode, Canu was running too many jobs concurrently and exhausting memory.
  • Memory needed for consensus jobs is now set based on the largest contig.
  • The VN tag in GFA outputs was set, incorrectly, to the name of the program creating the file. It is now reflecting the format version of the GFA file.
  • Numerous not-very-exciting pedantic coding errors resolved. Stuff like failing to close a single input file, failing to release a block of memory, failing to check if an operation successfully completed, et cetera, that were technically incorrect but not significantly so.

Known Issues

See the issues page for up-to date open issues, or to report a problem.

  • The Overlap Error Adjustment step does not properly configure its memory usage, include redMemory=8 oeaMemory=8 as a workaround.
  • Large memory usage and runtime for long reads (e.g., Nanopore) when using the overlapper=ovl algorithm, and during Overlap Error Adjustment. The -fast option enables a significantly faster algorithm, but may produce slightly less contiguous assemblies on genomes larger than 1 Gbp in size. It is recommended for nanopore genomes smaller than 1 Gbp.
  • TrioCanu is not yet optimized for memory usage, as a result it requires higher than default memory for large genomes, the options gridOptionsExecutive="--mem=250g" griodOptionsMeryl='--partition=largemem --mem=1000g' (or the equivalent memory request on your grid) should be sufficient for a 3 Gbp genome.
  • Bubbles are not captured in the contig graph, but are included in the unitig graph. No attempt at marking bubbles is made.

See the FAQ for many suggestions, including suggestions for specific data types, e.g., Nanopore r9 reads.

Legal

Canu is derived from Celera Assembler and includes code from many other projects. Most, but not all, of the code is GPL licensed. See the README.licenses file and individual source code files for details.