Vamb version 2

For this release, we have made significant changes to both the user-facing API and the internals.
The new API attempts to be more flexible and user-friendly, and the internal changes should make
Vamb perform better across a wider range of input datasets.

Highlights

Vamb should now be installed using pip.
The estimation of RPKM is now more accurate.
Vamb now allows encoding of single-sample datasets.
Clustering algorithm fundamentally changed.
Benchmarking algorithm fundamentally changed.

Now installable via pip

Binaries are no longer part of the package. These are instead automatically compiled using Cython upon installation.
Installing automatically installs Numpy, pysam and PyTorch if not installed already.
Installation enforces Python >= 3.5.
Installing via pip now adds the module to your $PATH so it can more easily be used from command line.

Changes to RPKM estimation

A BAM file with no reads now returns a depth of all zeros instead of all NaNs.
The BAM parser no longer attempts to detect and filter away duplicate alignments erroneously produced by some aligners.
Reads without a mapping mate are now consistently counted as half a read when estimating RPKM.
Reads with N total subjects count as 1/N instead of 1 towards each subject.
Vamb no longer parses the XA-tags produced by BWA.
All alignments not marked as primary are now skipped.
Minimum alignment score now works as expected when set below 1. Set to None in order to disable the filtering.
Default minimum alignment score is now None instead of 50.
Exceptions raised in subprocesses now propagate properly to the parent process and ends Vamb.

Vamb now allows encoding of single-sample datasets

When encoding a dataset with a single sample, the abundance loss uses sum of squared error, and does not softmax.
Vamb now performs well on single-sample compared to MetaBAT2.

New clustering algorithm

Now uses cosine distance instead of Pearson distance.
Clustering threshold is estimated for each cluster while running instead of beforehand.
The new approach is significantly slower, but more accurate.
cluster.read_clusters now doesn’t crash when reading an empty line.
Inputs are now shuffled in-place to reduce memory consumption.

New benchmarking algorithm

Now calculates recall and precision based on the total number of basepairs of the underlying genome covered by contigs in the bins.
Changed the interface and required input files.
Now accepts non-disjoint bins

Other changes

Vamb now optionally accepts a .npz files of TNF, sequence names and sequence lengths instead of a FASTA file.
Vamb can now parse depths information from BAM files, from a .npz file, or from the output of MetaBAT2's jgi_summarize_bam_contig_depths
Contig parser now checks that the header contains no tab characters.
Vamb now transparently accepts FASTA files zipped with gzip, bzip2 and LZMA.
The mask is now saved in the output directory as mask.npz.
The order of the columns in the RPKM array is now printed in the log file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 2