ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

jolespin · 2021-03-29T00:56:47Z

(vamb_env) -bash-4.1$ vamb --fasta mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta --jgi coverage_output/coverage_metabat2.tsv  --outdir vamb_output
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Here's what the output of jgi_summarize_bam_contig_depths looks like:

(vamb_env) -bash-4.1$ head coverage_output/coverage_metabat2.tsv
contigName	contigLen	totalAvgDepth	sorted.bam	sorted.bam-var
NODE_1_length_581408_cov_10.671907	581408	17.0497	17.0497	24.4313
NODE_2_length_212490_cov_11.140151	212490	17.7493	17.7493	26.3056
NODE_3_length_56611_cov_10.039571	56611	16.0747	16.0747	24.7309
NODE_4_length_52215_cov_10.245380	52215	16.4059	16.4059	20.9325
NODE_5_length_49788_cov_11.464963	49788	18.376	18.376	28.3959
NODE_6_length_44487_cov_9.390124	44487	15.069	15.069	20.5564
NODE_7_length_41442_cov_10.399425	41442	16.6383	16.6384	22.4833
NODE_8_length_37801_cov_9.536534	37801	15.3226	15.3226	25.3435
NODE_9_length_28654_cov_10.767824	28654	17.234	17.234	22.4427

It's the right number of rows too (n-1 for the headers)

(vamb_env) -bash-4.1$ grep -c "^>" mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta
25728
(vamb_env) -bash-4.1$ wc -l coverage_output/coverage_metabat2.tsv
25729 coverage_output/coverage_metabat2.tsv

Here's the version:

(mage_env) -bash-4.1$ conda list | grep "vamb"
vamb                      3.0.2            py36hc5360cc_1    bioconda

The text was updated successfully, but these errors were encountered:

simonrasmu · 2021-03-29T14:24:15Z

Hi Jolespin,

This is strange, what happens if you use use -m to set minimum contig length to 2000? Can you also paste the log-file?

Second, you are running on a single sample? If you have more samples you should assemble them one-by-one and then run Vamb all samples simultaneously (multi-split mode).

Best,

Simon

jolespin · 2021-03-29T18:00:15Z

After looking into the documentation a bit more I'm thinking this tool might not be the best for my current purpose since these are subsets of reads that I've assembled and there are far fewer contigs than 50k (~20k w/ about 500 long contigs).

However, the fact that you can assemble genomes separately and bin them w/ multi-split mode is absolutely amazing and will DEFINITELY be using this in the future for larger projects. Is there a tl;dr for the multi-split mode? For example, does it make consensus bins from all the samples and, if so, how does it reduce the redundancy?

Thanks again and I'm looking forward to integrating your package into the workflows at my institute.

jakobnissen · 2021-03-30T08:45:04Z

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

simonrasmu · 2021-03-30T11:45:11Z

Ok sounds good!

I think Vamb could work with fewer samples and contigs (even though we only tested with 50,000 contigs). If you are interested we could have a go at your data.

Best,

Simon

jolespin · 2021-03-30T16:32:03Z

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

Ah that makes even more sense now that I think about it. For a bit I was doing coassemblies of all my samples but what I realized was that the genome bins that are pulled out were not necessarily biologically true and more a representative of the population. The fact that the bins are maintained through out and binned together is a good idea because you can 1) have something to cross reference between samples although the actual bins are different; 2) have bins that are more true to the real biological sequence; and 3) if the user desired to create a coassembly (let's say they want to create an abundance table of contigs instead of otus), they could simply map to the multisample bin and then reassemble.

Are the intermediate bin assignments stored in multi sample mode?

jakobnissen · 2021-03-30T17:59:11Z

Sort of. It's not stored, but the final bin name is named e.g. sample1_1 - depending on the names of your contigs - e.g, given the name of an output bin, you can always get the sample and the original bin.

jolespin · 2021-04-22T19:44:32Z

I reran this on another dataset and it worked fine:

COV=../Assembly/883.abundance.metabat2.tsv
FASTA=../Assembly/883.fasta
OUT=883/vamb_output
qsub -cwd -P 0594 -N job_883_vamb -j y -o job_883_vamb.o "source activate binning_env && vamb --outdir $OUT --fasta $FASTA --jgi $COV  --minfasta 200000"

Not sure if this will help anyone but I lost access to my raw reads and bam files b/c I'm ressurecting an old project. To run this, I made my own JGI formatted file from the metaSPAdes coverage info:

(binning_env) -bash-4.1$ head ../Assembly/883.fasta
>NODE_1_length_850338_cov_9.99141_ID_1
TCTTGTTATTTTTCTTTACAACAGATGTGCTTATTTCACCCATCATTAGGGTAATTAACT
CCTTATCCTTTACAGATTCTGTTTTAAGGTCTTCAGCAACCATTTGACCATTTTTCATAA
CAAAGATCGTTTCAGTAAATTCTTTTATTTCTTTTAATTTGTGAGTGATAAATACGATTG
TTTTTTCATCTTCTTTTGAAAGAGTCTCAAGTGAATTTTTTAGCTGGTTTGTTTCTTGAG
GTGTTAAAACAGCAGTAGGCTCATCAAGCAACATAATGCTGTTGTTATTGAATATTAATT
TCATGATTTCTATTTTTTGTTTTTCTCCTACTGAAAGATCAGAGATAATAGAGTCATGAT
TTACTGATAAAGAAAAAATTTCAGAATATTTTTTATATGCATCTTCAAAATCATTAATGT
AGATATTTGCCTTTGATAGGGTCAAGTTATCTTTAATAGTGAAATTTTCAATTAATCGAA
ATTCCTGATGAACCATACCAATTCCAAGTTTTGCTGCTAATTCAGGATTAAGCTTTTTTT
(binning_env) -bash-4.1$ head ../Assembly/883.abundance.metabat2.tsv
contigName	contigLen	totalAvgDepth	metagenomics	metagenomics-var
NODE_1_length_850338_cov_9.99141_ID_1	850338	9.99141	9.99141	0
NODE_2_length_696952_cov_6.08719_ID_3	696952	6.08719	6.08719	0
NODE_3_length_645813_cov_1.38757_ID_5	645813	1.38757	1.38757	0
NODE_4_length_618535_cov_4.37412_ID_7	618535	4.37412	4.37412	0
NODE_5_length_518114_cov_4.40507_ID_9	518114	4.40507	4.40507	0
NODE_6_length_481955_cov_1.4344_ID_11	481955	1.4344	1.4344	0
NODE_7_length_435255_cov_2.77715_ID_13	435255	2.77715	2.77715	0
NODE_8_length_428820_cov_3.7035_ID_15	428820	3.7035	3.7035	0
NODE_9_length_425291_cov_2.13358_ID_17	425291	2.13358	2.13358	0

Closing this issue.

HaraldBrolin · 2021-05-14T07:07:51Z

Hi, I found your VAMB algorithm very interesting and I wanted to evaluate the algorithm on a current project of mine.
But I'm running into the same error as jolespin, in his original post. The failure occurs roughly ~15 mins into the run.

    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

I'm evaluating previous Canopy results vs Vamb and I want to use the same input data for the algorithm to minimize differences, thus I've created my own JGI formatted file from the previous analysis. I'm running on a small subset of the data, only 10 samples, to evaluate resource allocation for the larger run of ~1,200 samples. I'm also using a filtered version of my JGI-file where I have removed low abundant genes. The totalAvgDepth is identical to the first sample in my JGI-file.

The command that I'm running using the snakemake script:

vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

zgrep "^>" contigs.flt.fna.gz | wc -l
15186403

wc -l jgi_matrix/jgi.abundance.dat
14143390

This is the current VAMB version that I'm using:
vamb 3.0.2 py37h73a75cf_1

his is how I've formatted the JGI input:

head jgi_matrix/jgi.abundance.dat 
contigName	contigLen	totalAvgDepth	s_4026	s_4027	s_4023	R0921	R0920	R0923	R0922	R0925	R0924	R0927
R0592_scaffold6566_3_gene27949:length_1518:complete	1518	21.26	21.26	290.98	177.74	32.04	26.76	72.38	6.18	65.84	131.28	13.58
R0592_scaffold6566_3_gene27957:length_489:complete	489	2.84	2.84	66.5	38.76	12.46	9.96	17.86	1.68	12.06	28.26	0.5
R0592_scaffold6566_3_gene27958:length_1560:complete	1560	14.74	14.74	190.18	73.28	41.24	20.78	36.44	7.4	37.98	81.96	9.24
R0592_scaffold6566_3_gene27959:length_609:complete	609	8.1	8.1	103.7	45.44	7.62	5.34	22.26	2.34	15.04	29.36	1.0

Here is what the --fasta input contigs.flt.fna.gz looks like:

zless contigs.flt.fna.gz | head -n 50
>R0592_scaffold6566_3_gene27949:length_1518:complete
ATGAAAAAATCGAAGGAAAAAATAAAAAAGATCCACAGTCTCCGCCATCAGATCACAGCA
TACTTTATCGGTCTGCTGTTTTTATCCATTCTGACTATTACCGTGATCAACGGAGCCTTT
CTTGAAAAATATTATGTGACAAAAAAGATCGATGTACTTCTGAATCTCCGAACAACTTTG
GAAAATACAGATATTGACAAGATGATGAACAGCAACAGTTCGAAAGGTATGGAGAGTATT
CCCGATGAGATCCAGAGAGCCTGCTCCAGAAACAATCTTTCCTGGGTTATCATTGACAGC
AGCAATACCAGCTGGCTTTCCTGGGGCGAAAATGAAAAAATGCTTCAGAGCAAGCTGTTC
GGATATGTCTATGATCTGGATGAGGACGGAGATAAGAGCCGGACCTTGAAGCAGGGTGAC
AATTATACCATCCAGCAGTCGCATGACCGTTTTGCGGGAATGGATTATGTGGAATGCTGG
GGACAGCTGGATGAGGAACATTATTTTATCATAAGGACACCCCTTGAGAGCATCCGGGAG
AGCGCTGATATTTCCAATAAGTTTTACTTTGCGGTAGGTCTGATGATCATAATCTTAAGT
GGTCTGGTCATTATGGTCGTGACCAGGAGGATCACCCGTCCCATCAGTGAGCTTACGGAA
CTTTCCAGAAAAATGTCAGATCTGGATTTTGAGGCAAAATACGAGAGCAAGGTCGGCAAT
GAGATCGATGTTCTGGGAGATAATTTCAATCGGATGTCCTCACAGCTGGAGACTACCATC
AGTGAACTGAAATCTGCAAATAATGAACTGCAGCGTGACATTGAGGACAAGATCAAGATC
GACAAGATGCGAAAGGAATTTCTGGATAACGTGTCCCATGAACTGAAAACTCCCATAGCC
CTGATCCAGGGCTATGCAGAAGGCTTAAAAGAAAATATTTCTGATGATCCTGAGAGCAGA
GAGTTCTATTGCGATGTCATTATGGATGAAGCCTCCAAGATGAATAAACTGGTAAAAAAT
CTTCTCACACTGAATCAGCTGGAATCCGGCAAAGATGCAGTTGTAATGGAGCGGTTTGAT
ATCGTATCCCTGATCCGCGGCGTGCTTCAGACCATGAATATCATGATCGGGCAGAAGAAT
GCAAAAGTGATCTTTGAGGCGGAGAAGCCGGTTTATGTATGGGCAGATGAGTTTAAGATC
GAAGAGGTTGTGACAAACTATGTAAGCAATGCCCTGAACCATCTGGAGGGAGAGAAAGAA
ATCGAGATCAAGCTGCAGGATGAGGGTAACCGTGTGAAGATCAGTGTGTTTAACACAGGG
ACTCCGATCCCTGAGGCAGATGTACCGAATCTCTGGAATAAATTTTATAAAGTTGATAAA
GCGAGAACCAGGGAATACGGCGGAAGCGGTATCGGTTTATCCATTGTAAAGGCCATCATG
GAAAGCATGAACCAGGAATATGGCGTACAGAACTTTGATAATGGCGTGGAATTCTGGTTT
ACACTGGACCGGAAATAA
>R0592_scaffold6566_3_gene27957:length_489:complete
ATGCTGGATCTGGAAGAACGAAACCGTGCAAGAAAGAAAGCCATGCGGCTTCTGGAACAT
ATGGACCGCACAGAGAAAGGCCTGACGGATAAACTCCGTCAGGCTGAATTTTCACCGGAG
GCAGTAGAAGATGCCATTGCTTATGTGAAGTCCTATGGGTATATCAACGATGCCCGTTAT
GCCAGAACCTATATCTCCTTCCGTATGGAATCCAGAAGTCGTCAGAAGATCCTCTCGGAG
CTTCAGTTAAAGGGTGTGGACCGTCAGACTGCCCTGGATGCCTGGAATGAGATGGAAGAG
CTTATGGAACCGGACGAGCGGGAGGTATTACGCAGGACGATCGAAAAGAAATATGCGCCG
GACACAGAGCTTGATGAAGGACAGATGCGAAGGCTGTATGGATATCTGGCCAGACGAGCA
TTCCGGTATGAGGATATTTCCCATACCCTGGAAGAAATGAACATCCGGGTGAACCGTGAT
ATGGAATAA

This is the log/vamb/vamb.log:

Traceback (most recent call last):
  File "path/bin/vamb/vamb_16_04_21/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "path/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Best,
Harald

jakobnissen · 2021-05-14T08:50:53Z

Dear @HaraldBrolin

The error comes because each contig needs both a TNF (which is obtained from the FASTA file), and an RPKM (which is obtained from the JGI input file). To fix the problem, you need to remove the sequences in the FASTA file for which you don't have entries in the JGI depths file.

The JGI file does not seem to be correctly formatted, either. It should look like this.

HaraldBrolin · 2021-05-14T09:13:15Z

Thanks for the quick reply @jakobnissen. I'll try to fix the issues and write an update.

HaraldBrolin · 2021-05-18T08:25:35Z

Hi @jakobnissen,

I've matched the genes/contigs in both the jgi.abundance.dat and the FASTA file (contigs.flt.fna.gz).

md5sum of genes/contig-names after removing the initial ">".

zgrep '^>' contigs.flt.fna.gz |  awk '{print substr($1,2); }'  | md5sum
29180c62cc3bea4879a62634c244b236  -

md5sum of contigNames after dropping the header.

tail -n +2 jgi.abundance.dat | awk {'print $1'} | md5sum
29180c62cc3bea4879a62634c244b236  -

Regarding the format of the jgi.abundance.dat, the only difference I can detect between my jgi-file and the file from this link: here, is the contig length and the variance-column. I can't do much about the lengths, but if I understood correctly Vamb can run on genes. I've added the variance column and as a test I've only kept one sampe.

contigName	contigLen	totalAvgDepth	sorted.bam	sorted.bam-var
R0592_scaffold6566_3_gene27949:length_1518:complete	1518	37	37	0
R0592_scaffold6566_3_gene27957:length_489:complete	489	5	5	0
R0592_scaffold6566_3_gene27958:length_1560:complete	1560	25	25	0

I'm running the same command as previously:
vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

And I'm getting the same error message as previously:
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

simonrasmu · 2021-05-18T10:46:41Z

Hi @HaraldBrolin and @jakobnissen

Chipping in on this.

It is perfectly fine to run VAMB on genes - we did this in the preprint but replaced this with running on MAGs for the published paper. However, when you use the -o C flag you make it split the genes (or contigs for that matter) as if it was a combined assembly (which I assume it is not). E.g. -o C assumes that the contignames are as "SampleIDCcontigID", ie. it splits Samples and contigs (using the character "C") for when running multi-split mode. If your contig/gene names are not set up like this it will not work.
You should run without -m 2000 as this removes contigs/genes below 2000bp (most of your genes will be).

Best,

Simon

HaraldBrolin · 2021-05-18T13:42:17Z

Hi @simonrasmu

I tried changing the settings in the Snakemake but I still end up with the same error message, I think it's time for me to try another approach. I wanted to avoid re-mapping my samples but I think I have to do it anyways.

Thanks for your support!

Best,
Harald

jolespin · 2021-05-24T18:52:15Z

Complete side note, but I ended up running this on an ocean metagenome using the following hyper parameter grid:

For -N I used 128 512 1024
For -L I used 24 32 48

I also used maxbin2 and metabat2. I combined all of these results as input into DAS_Tool and got great results. vamb was able to recover bins that weren't found via maxbin2 and metabat2.

Not sure if this will help anyone in the future.

simonrasmu · 2021-05-25T07:32:19Z

Sounds great - it is very nice that the methods are complementary and can be used to generate additional MAGs.

jolespin closed this as completed Apr 22, 2021

cifuj mentioned this issue Mar 15, 2022

aviary recover with short reads rhysnewell/aviary#36

Closed

AroArz mentioned this issue Aug 31, 2023

Error in rule run_vamb metagenome-atlas/atlas#693

Closed

3 tasks

shaman-narayanasamy mentioned this issue Aug 7, 2024

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

jolespin commented Mar 29, 2021 •

edited

Loading

simonrasmu commented Mar 29, 2021

jolespin commented Mar 29, 2021

jakobnissen commented Mar 30, 2021

simonrasmu commented Mar 30, 2021

jolespin commented Mar 30, 2021

jakobnissen commented Mar 30, 2021

jolespin commented Apr 22, 2021

HaraldBrolin commented May 14, 2021 •

edited

Loading

jakobnissen commented May 14, 2021

HaraldBrolin commented May 14, 2021

HaraldBrolin commented May 18, 2021

simonrasmu commented May 18, 2021

HaraldBrolin commented May 18, 2021

jolespin commented May 24, 2021 •

edited

Loading

simonrasmu commented May 25, 2021

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

Comments

jolespin commented Mar 29, 2021 • edited Loading

simonrasmu commented Mar 29, 2021

jolespin commented Mar 29, 2021

jakobnissen commented Mar 30, 2021

simonrasmu commented Mar 30, 2021

jolespin commented Mar 30, 2021

jakobnissen commented Mar 30, 2021

jolespin commented Apr 22, 2021

HaraldBrolin commented May 14, 2021 • edited Loading

jakobnissen commented May 14, 2021

HaraldBrolin commented May 14, 2021

HaraldBrolin commented May 18, 2021

simonrasmu commented May 18, 2021

HaraldBrolin commented May 18, 2021

jolespin commented May 24, 2021 • edited Loading

simonrasmu commented May 25, 2021

jolespin commented Mar 29, 2021 •

edited

Loading

HaraldBrolin commented May 14, 2021 •

edited

Loading

jolespin commented May 24, 2021 •

edited

Loading