-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65
Comments
Hi Jolespin, This is strange, what happens if you use use -m to set minimum contig length to 2000? Can you also paste the log-file? Second, you are running on a single sample? If you have more samples you should assemble them one-by-one and then run Vamb all samples simultaneously (multi-split mode). Best, Simon |
After looking into the documentation a bit more I'm thinking this tool might not be the best for my current purpose since these are subsets of reads that I've assembled and there are far fewer contigs than 50k (~20k w/ about 500 long contigs). However, the fact that you can assemble genomes separately and bin them w/ multi-split mode is absolutely amazing and will DEFINITELY be using this in the future for larger projects. Is there a tl;dr for the multi-split mode? For example, does it make consensus bins from all the samples and, if so, how does it reduce the redundancy? Thanks again and I'm looking forward to integrating your package into the workflows at my institute. |
Thanks for using Vamb. Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples. |
Ok sounds good! I think Vamb could work with fewer samples and contigs (even though we only tested with 50,000 contigs). If you are interested we could have a go at your data. Best, Simon |
Ah that makes even more sense now that I think about it. For a bit I was doing coassemblies of all my samples but what I realized was that the genome bins that are pulled out were not necessarily biologically true and more a representative of the population. The fact that the bins are maintained through out and binned together is a good idea because you can 1) have something to cross reference between samples although the actual bins are different; 2) have bins that are more true to the real biological sequence; and 3) if the user desired to create a coassembly (let's say they want to create an abundance table of contigs instead of otus), they could simply map to the multisample bin and then reassemble. Are the intermediate bin assignments stored in multi sample mode? |
Sort of. It's not stored, but the final bin name is named e.g. |
I reran this on another dataset and it worked fine: COV=../Assembly/883.abundance.metabat2.tsv
FASTA=../Assembly/883.fasta
OUT=883/vamb_output
qsub -cwd -P 0594 -N job_883_vamb -j y -o job_883_vamb.o "source activate binning_env && vamb --outdir $OUT --fasta $FASTA --jgi $COV --minfasta 200000" Not sure if this will help anyone but I lost access to my raw reads and bam files b/c I'm ressurecting an old project. To run this, I made my own JGI formatted file from the metaSPAdes coverage info:
Closing this issue. |
Hi, I found your VAMB algorithm very interesting and I wanted to evaluate the algorithm on a current project of mine.
I'm evaluating previous Canopy results vs Vamb and I want to use the same input data for the algorithm to minimize differences, thus I've created my own JGI formatted file from the previous analysis. I'm running on a small subset of the data, only 10 samples, to evaluate resource allocation for the larger run of ~1,200 samples. I'm also using a filtered version of my JGI-file where I have removed low abundant genes. The totalAvgDepth is identical to the first sample in my JGI-file. The command that I'm running using the snakemake script:
This is the current VAMB version that I'm using: his is how I've formatted the JGI input:
Here is what the --fasta input contigs.flt.fna.gz looks like:
This is the log/vamb/vamb.log:
Best, |
Dear @HaraldBrolin The error comes because each contig needs both a TNF (which is obtained from the FASTA file), and an RPKM (which is obtained from the JGI input file). To fix the problem, you need to remove the sequences in the FASTA file for which you don't have entries in the JGI depths file. The JGI file does not seem to be correctly formatted, either. It should look like this. |
Thanks for the quick reply @jakobnissen. I'll try to fix the issues and write an update. |
Hi @jakobnissen, I've matched the genes/contigs in both the jgi.abundance.dat and the FASTA file (contigs.flt.fna.gz). md5sum of genes/contig-names after removing the initial ">".
md5sum of contigNames after dropping the header.
Regarding the format of the jgi.abundance.dat, the only difference I can detect between my jgi-file and the file from this link: here, is the contig length and the variance-column. I can't do much about the lengths, but if I understood correctly Vamb can run on genes. I've added the variance column and as a test I've only kept one sampe.
I'm running the same command as previously: And I'm getting the same error message as previously: |
Hi @HaraldBrolin and @jakobnissen Chipping in on this.
Best, Simon |
Hi @simonrasmu I tried changing the settings in the Snakemake but I still end up with the same error message, I think it's time for me to try another approach. I wanted to avoid re-mapping my samples but I think I have to do it anyways. Thanks for your support! Best, |
Complete side note, but I ended up running this on an ocean metagenome using the following hyper parameter grid:
I also used Not sure if this will help anyone in the future. |
Sounds great - it is very nice that the methods are complementary and can be used to generate additional MAGs. |
Here's what the output of
jgi_summarize_bam_contig_depths
looks like:It's the right number of rows too (n-1 for the headers)
(vamb_env) -bash-4.1$ grep -c "^>" mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta 25728 (vamb_env) -bash-4.1$ wc -l coverage_output/coverage_metabat2.tsv 25729 coverage_output/coverage_metabat2.tsv
Here's the version:
The text was updated successfully, but these errors were encountered: