Releases: RasmussenLab/vamb
v3.0.5
v3.0.5
This release contains only bugfixes.
Vamb 3.0.3
Bugfix release, but with changes to dependency compat.
- Specify dependencies more strictly, only accepting compatible versions instead of minimum versions.
- Check refhash even if RPKM is loaded from JGI file
- Improved logic for loading JGI input
- Fix some simple bugs
Snakemake update
Updates:
Snakemake workflow added
Update to use mini-batch size default 256 instead of 64. This speeds Vamb up to 4x when doing encoding
Minor bugfixes
Vamb 3.0.1
This resolves an install issue in Vamb 3.0
Vamb 3.0
Vamb 3.0
Highlights
- new
--minfasta
command line option, which, when set, causes Vamb to output any bins with more nucleotides than this number. - TNFs are now projected to 103 linearly independent dimensions, saving memory and time, as well as improving VAE training.
- Clustering hyperparameters have been optimized further, improving binning accuracy.
- Benchmarking higher clades have changed and is now simpler to understand.
Changes in behaviour:
- TNFs are now automatically projected down to 103 linearly independent axes as done in Kislyuk et al. 2009, BMC Bioinformatics. This saves RAM and increases VAE training speed. As a result, all TNFs are now a Nx103 matrix down from a Nx136 matrix.
- Clustering is now done by a
ClusterGenerator
object, which contains the clustering state. This object can be queried for its state and manipulated much transparently than the previous generator. - A
ClusterGenerator
object returnsCluster
objects containing the cluster itself as well as some metadata. ClusterGenerator
have different hyperparameters than the old cluster generators, and is slightly more accurate. As a result, the clusters generated have changed.- Benchmarking of higher clades have been changed. Previously, the precision of a higher clade was defined as the recall of covered nucleotides belonging to the genomes of that clade. Now, a clade is recovered at a given threshold iff any member of it is.
- Cluster names in
clusters.tsv
are now more legible
New scripts:
- src/concatenate.py: Concatenates and renames FASTA files to FASTA catalogue
- src/create_kernel: Recreates kernel.npz. To be used as documentation.
New functions or classes:
vambtools.PushArray
: Data structure that allows efficient appending and extending a 1D Numpy array. Intended to strike a balance between not resizing too often (which is slow), and not allocating too much at a time (which is memory inefficient)cluster.ClusterGenerator
andcluster.Cluster
, see below for details.- You can now close a
Reader
withreader.close()
and iterate it like other file handles.
Removed functions:
vambtools.zeros
vambtools.FastaEntry.fourmer_freq
Moved functions:
read_clusters
fromcluster
tovambtools
.write_clusters
fromcluster
tovambtools
.
Misc:
- Minimum sequence length (
-m
option in command line) is now printed in log file. - Optimizations have made the
preallocate
keyword in functionparsecontigs.read_contigs
obsolete, and it has been removed. It is now always both fast and memory efficient. - No longer warns when used on fewer than 50,000 contigs.
- Various small bug fixes.
- Various optimizations have improved running time and memory use.
Vamb 2.1
Release of Vamb 2.1
New features:
Support for PyTorch 1.2
Support for CUDA-accelerated clustering
Now writes lengths.npz, a vector of contig lengths to output directory.
Now checks BAM files' is not present, unsorted or sorted by readname
Now checks that the references in BAM file are same as references in FASTA file
Bugfixes:
Correctly compensate for float rounding error in cluster histogram function
Fixed bug where RPKM were consistently half their true value
Fixed a bug where all-zero rows could not be clustered
make_dataloader now no longer crashes when passed destroy=True
Fixed a NameError bug in vambtools.filterfasta that made it unusable
If specified more than 8 cores for bam file parsing, now actually uses more cores.
Fixed wrong error message if input contiglength npz array was not of integer type
Correctly print batchsteps in VAE training if they are None
Now raises a legible error if running Vamb on command line and length of RPKM and TNF does not match, even if refhashing is disabled.
Now also creates a rpkm.npz file if depths input is a JGI file.
Internal (interpreter-use) behaviour change
New function: vambtools.concatenate_fasta, which creates a concatenated sequence catalogue from multiple FASTA files, renames the sequences to conform to binsplitting, and makes sure the headers are unique.
New function: Added function parsebam.count_reads, that counts mapping read in BAM file
New function: parsebam.calc_rpkm that calculates RPKM from read counts and contig lengths
Removed function: Pearson correlation based clustering is no longer available
When labels is None, vamb.cluster.cluster now returns an interator of indices, not indices + 1
cluster.write_clusters now has a "rename" keyword. It will not rename clusters if set to False.
vambtools.loadfasta and vambtools.write_bins now has keywords "compress" and "compressed" (False by default), which causes sequences to be stored in compressed form.
parsebam.read_bamfiles now takes argument refhash, which shold be None or md5 hash that the references must hash to.
Argument minlength can now consistently be passed as None in functions of module parsebam, similar to the other filtering parameters.
benchmark.Contig is no longer immutable, and can now be desrialized using pickle.
Misc:
Improved validation of command line options
Now handles whitespace in FASTA sequences (tabs, and spaces only)
Increased numerical stability in RPKM estimation
Numerous documentation and performance improvements
Now warns user that VAE may overfit if less than 50,000 contigs are kept.
Output cluster names are now more legible and contains information about pre-binsplit cluster.
Version 2.0.1
This version consists a bugfix for a silly syntax error in version 2.0.0.
Version 2
Vamb version 2
For this release, we have made significant changes to both the user-facing API and the internals.
The new API attempts to be more flexible and user-friendly, and the internal changes should make
Vamb perform better across a wider range of input datasets.
Highlights
- Vamb should now be installed using pip.
- The estimation of RPKM is now more accurate.
- Vamb now allows encoding of single-sample datasets.
- Clustering algorithm fundamentally changed.
- Benchmarking algorithm fundamentally changed.
Now installable via pip
- Binaries are no longer part of the package. These are instead automatically compiled using Cython upon installation.
- Installing automatically installs Numpy, pysam and PyTorch if not installed already.
- Installation enforces Python >= 3.5.
- Installing via pip now adds the module to your $PATH so it can more easily be used from command line.
Changes to RPKM estimation
- A BAM file with no reads now returns a depth of all zeros instead of all NaNs.
- The BAM parser no longer attempts to detect and filter away duplicate alignments erroneously produced by some aligners.
- Reads without a mapping mate are now consistently counted as half a read when estimating RPKM.
- Reads with N total subjects count as 1/N instead of 1 towards each subject.
- Vamb no longer parses the XA-tags produced by BWA.
- All alignments not marked as primary are now skipped.
- Minimum alignment score now works as expected when set below 1. Set to None in order to disable the filtering.
- Default minimum alignment score is now None instead of 50.
- Exceptions raised in subprocesses now propagate properly to the parent process and ends Vamb.
Vamb now allows encoding of single-sample datasets
- When encoding a dataset with a single sample, the abundance loss uses sum of squared error, and does not softmax.
- Vamb now performs well on single-sample compared to MetaBAT2.
New clustering algorithm
- Now uses cosine distance instead of Pearson distance.
- Clustering threshold is estimated for each cluster while running instead of beforehand.
- The new approach is significantly slower, but more accurate.
- cluster.read_clusters now doesn’t crash when reading an empty line.
- Inputs are now shuffled in-place to reduce memory consumption.
New benchmarking algorithm
- Now calculates recall and precision based on the total number of basepairs of the underlying genome covered by contigs in the bins.
- Changed the interface and required input files.
- Now accepts non-disjoint bins
Other changes
- Vamb now optionally accepts a .npz files of TNF, sequence names and sequence lengths instead of a FASTA file.
- Vamb can now parse depths information from BAM files, from a .npz file, or from the output of MetaBAT2's jgi_summarize_bam_contig_depths
- Contig parser now checks that the header contains no tab characters.
- Vamb now transparently accepts FASTA files zipped with gzip, bzip2 and LZMA.
- The mask is now saved in the output directory as mask.npz.
- The order of the columns in the RPKM array is now printed in the log file.
Version 1.0.1
This release includes a bugfix for a critical bug in the clustering code, as well as improved documentation in the tutorial.
Official release
Version 1.0. This release is the first non-developing release of Vamb. The version of Vamb is this release is exactly the version used in the paper on bioRxiv, except with several bugfixes, better error messages and improved cosmetics.
Development of Vamb will almost surely continue in the coming year as we experiment with new ways of improving or applying the method. However, it may take a long time before these changes are merged into the master branch.