Salmon v1.0.0 Release Notes
This is a major stable release of salmon and brings a lot of exciting new features with extensive benchmarking in the latest preprint.
This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense
mode and ~14G in sparse
mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.
Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthll
implementation.
Changes since v0.99.0 beta2
A bug related to alevin index parsing is fixed. Specifically, if the length of any one decoy target is less than the kmer length then alevin was dumping gene counts for decoy targets. Thanks @csoneson for reporting this and it has been fixed in the latest stable release.
Changes since v0.99.0 beta1
Allow passing of explicit filter size to the indexing command via the -f parameter (default is to estimate required filter size using nthll).
Fix bug that prevented dumping SAM output, if requested, in alevin mode.
Correctly enabled strictFilter mode in alevin, improving single-cell mapping quality.
Changes since v0.14.1
The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.
The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.
As a consequence of the above, range factorization is enabled by default.
There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.
There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.
There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.
The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.
There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.
Considerable improvments have been made to fragment length modeling in the case of single-end samples.
Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).
bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.
bug fix: The mtx format overflows the total number of genes boundary when the total number of genes are exactly a multiple of 8. It has been fixed to be consistent in the latest release
The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.