Version 2
Vamb version 2
For this release, we have made significant changes to both the user-facing API and the internals.
The new API attempts to be more flexible and user-friendly, and the internal changes should make
Vamb perform better across a wider range of input datasets.
Highlights
- Vamb should now be installed using pip.
- The estimation of RPKM is now more accurate.
- Vamb now allows encoding of single-sample datasets.
- Clustering algorithm fundamentally changed.
- Benchmarking algorithm fundamentally changed.
Now installable via pip
- Binaries are no longer part of the package. These are instead automatically compiled using Cython upon installation.
- Installing automatically installs Numpy, pysam and PyTorch if not installed already.
- Installation enforces Python >= 3.5.
- Installing via pip now adds the module to your $PATH so it can more easily be used from command line.
Changes to RPKM estimation
- A BAM file with no reads now returns a depth of all zeros instead of all NaNs.
- The BAM parser no longer attempts to detect and filter away duplicate alignments erroneously produced by some aligners.
- Reads without a mapping mate are now consistently counted as half a read when estimating RPKM.
- Reads with N total subjects count as 1/N instead of 1 towards each subject.
- Vamb no longer parses the XA-tags produced by BWA.
- All alignments not marked as primary are now skipped.
- Minimum alignment score now works as expected when set below 1. Set to None in order to disable the filtering.
- Default minimum alignment score is now None instead of 50.
- Exceptions raised in subprocesses now propagate properly to the parent process and ends Vamb.
Vamb now allows encoding of single-sample datasets
- When encoding a dataset with a single sample, the abundance loss uses sum of squared error, and does not softmax.
- Vamb now performs well on single-sample compared to MetaBAT2.
New clustering algorithm
- Now uses cosine distance instead of Pearson distance.
- Clustering threshold is estimated for each cluster while running instead of beforehand.
- The new approach is significantly slower, but more accurate.
- cluster.read_clusters now doesn’t crash when reading an empty line.
- Inputs are now shuffled in-place to reduce memory consumption.
New benchmarking algorithm
- Now calculates recall and precision based on the total number of basepairs of the underlying genome covered by contigs in the bins.
- Changed the interface and required input files.
- Now accepts non-disjoint bins
Other changes
- Vamb now optionally accepts a .npz files of TNF, sequence names and sequence lengths instead of a FASTA file.
- Vamb can now parse depths information from BAM files, from a .npz file, or from the output of MetaBAT2's jgi_summarize_bam_contig_depths
- Contig parser now checks that the header contains no tab characters.
- Vamb now transparently accepts FASTA files zipped with gzip, bzip2 and LZMA.
- The mask is now saved in the output directory as mask.npz.
- The order of the columns in the RPKM array is now printed in the log file.