Skip to content

Version 2

Compare
Choose a tag to compare
@jakobnissen jakobnissen released this 11 Jul 12:33
· 459 commits to master since this release

Vamb version 2

For this release, we have made significant changes to both the user-facing API and the internals.
The new API attempts to be more flexible and user-friendly, and the internal changes should make
Vamb perform better across a wider range of input datasets.

Highlights

  • Vamb should now be installed using pip.
  • The estimation of RPKM is now more accurate.
  • Vamb now allows encoding of single-sample datasets.
  • Clustering algorithm fundamentally changed.
  • Benchmarking algorithm fundamentally changed.

Now installable via pip

  • Binaries are no longer part of the package. These are instead automatically compiled using Cython upon installation.
  • Installing automatically installs Numpy, pysam and PyTorch if not installed already.
  • Installation enforces Python >= 3.5.
  • Installing via pip now adds the module to your $PATH so it can more easily be used from command line.

Changes to RPKM estimation

  • A BAM file with no reads now returns a depth of all zeros instead of all NaNs.
  • The BAM parser no longer attempts to detect and filter away duplicate alignments erroneously produced by some aligners.
  • Reads without a mapping mate are now consistently counted as half a read when estimating RPKM.
  • Reads with N total subjects count as 1/N instead of 1 towards each subject.
  • Vamb no longer parses the XA-tags produced by BWA.
  • All alignments not marked as primary are now skipped.
  • Minimum alignment score now works as expected when set below 1. Set to None in order to disable the filtering.
  • Default minimum alignment score is now None instead of 50.
  • Exceptions raised in subprocesses now propagate properly to the parent process and ends Vamb.

Vamb now allows encoding of single-sample datasets

  • When encoding a dataset with a single sample, the abundance loss uses sum of squared error, and does not softmax.
  • Vamb now performs well on single-sample compared to MetaBAT2.

New clustering algorithm

  • Now uses cosine distance instead of Pearson distance.
  • Clustering threshold is estimated for each cluster while running instead of beforehand.
  • The new approach is significantly slower, but more accurate.
  • cluster.read_clusters now doesn’t crash when reading an empty line.
  • Inputs are now shuffled in-place to reduce memory consumption.

New benchmarking algorithm

  • Now calculates recall and precision based on the total number of basepairs of the underlying genome covered by contigs in the bins.
  • Changed the interface and required input files.
  • Now accepts non-disjoint bins

Other changes

  • Vamb now optionally accepts a .npz files of TNF, sequence names and sequence lengths instead of a FASTA file.
  • Vamb can now parse depths information from BAM files, from a .npz file, or from the output of MetaBAT2's jgi_summarize_bam_contig_depths
  • Contig parser now checks that the header contains no tab characters.
  • Vamb now transparently accepts FASTA files zipped with gzip, bzip2 and LZMA.
  • The mask is now saved in the output directory as mask.npz.
  • The order of the columns in the RPKM array is now printed in the log file.