Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

Closed
jolespin opened this issue Mar 29, 2021 · 15 comments

Comments

@jolespin
Copy link

jolespin commented Mar 29, 2021

(vamb_env) -bash-4.1$ vamb --fasta mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta --jgi coverage_output/coverage_metabat2.tsv  --outdir vamb_output
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Here's what the output of jgi_summarize_bam_contig_depths looks like:

(vamb_env) -bash-4.1$ head coverage_output/coverage_metabat2.tsv
contigName	contigLen	totalAvgDepth	sorted.bam	sorted.bam-var
NODE_1_length_581408_cov_10.671907	581408	17.0497	17.0497	24.4313
NODE_2_length_212490_cov_11.140151	212490	17.7493	17.7493	26.3056
NODE_3_length_56611_cov_10.039571	56611	16.0747	16.0747	24.7309
NODE_4_length_52215_cov_10.245380	52215	16.4059	16.4059	20.9325
NODE_5_length_49788_cov_11.464963	49788	18.376	18.376	28.3959
NODE_6_length_44487_cov_9.390124	44487	15.069	15.069	20.5564
NODE_7_length_41442_cov_10.399425	41442	16.6383	16.6384	22.4833
NODE_8_length_37801_cov_9.536534	37801	15.3226	15.3226	25.3435
NODE_9_length_28654_cov_10.767824	28654	17.234	17.234	22.4427

It's the right number of rows too (n-1 for the headers)

(vamb_env) -bash-4.1$ grep -c "^>" mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta
25728
(vamb_env) -bash-4.1$ wc -l coverage_output/coverage_metabat2.tsv
25729 coverage_output/coverage_metabat2.tsv

Here's the version:

(mage_env) -bash-4.1$ conda list | grep "vamb"
vamb                      3.0.2            py36hc5360cc_1    bioconda
@simonrasmu
Copy link
Collaborator

Hi Jolespin,

This is strange, what happens if you use use -m to set minimum contig length to 2000? Can you also paste the log-file?

Second, you are running on a single sample? If you have more samples you should assemble them one-by-one and then run Vamb all samples simultaneously (multi-split mode).

Best,

Simon

@jolespin
Copy link
Author

After looking into the documentation a bit more I'm thinking this tool might not be the best for my current purpose since these are subsets of reads that I've assembled and there are far fewer contigs than 50k (~20k w/ about 500 long contigs).

However, the fact that you can assemble genomes separately and bin them w/ multi-split mode is absolutely amazing and will DEFINITELY be using this in the future for larger projects. Is there a tl;dr for the multi-split mode? For example, does it make consensus bins from all the samples and, if so, how does it reduce the redundancy?

Thanks again and I'm looking forward to integrating your package into the workflows at my institute.

@jakobnissen
Copy link
Member

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

@simonrasmu
Copy link
Collaborator

Ok sounds good!

I think Vamb could work with fewer samples and contigs (even though we only tested with 50,000 contigs). If you are interested we could have a go at your data.

Best,

Simon

@jolespin
Copy link
Author

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

Ah that makes even more sense now that I think about it. For a bit I was doing coassemblies of all my samples but what I realized was that the genome bins that are pulled out were not necessarily biologically true and more a representative of the population. The fact that the bins are maintained through out and binned together is a good idea because you can 1) have something to cross reference between samples although the actual bins are different; 2) have bins that are more true to the real biological sequence; and 3) if the user desired to create a coassembly (let's say they want to create an abundance table of contigs instead of otus), they could simply map to the multisample bin and then reassemble.

Are the intermediate bin assignments stored in multi sample mode?

@jakobnissen
Copy link
Member

Sort of. It's not stored, but the final bin name is named e.g. sample1_1 - depending on the names of your contigs - e.g, given the name of an output bin, you can always get the sample and the original bin.

@jolespin
Copy link
Author

I reran this on another dataset and it worked fine:

COV=../Assembly/883.abundance.metabat2.tsv
FASTA=../Assembly/883.fasta
OUT=883/vamb_output
qsub -cwd -P 0594 -N job_883_vamb -j y -o job_883_vamb.o "source activate binning_env && vamb --outdir $OUT --fasta $FASTA --jgi $COV  --minfasta 200000"

Not sure if this will help anyone but I lost access to my raw reads and bam files b/c I'm ressurecting an old project. To run this, I made my own JGI formatted file from the metaSPAdes coverage info:

(binning_env) -bash-4.1$ head ../Assembly/883.fasta
>NODE_1_length_850338_cov_9.99141_ID_1
TCTTGTTATTTTTCTTTACAACAGATGTGCTTATTTCACCCATCATTAGGGTAATTAACT
CCTTATCCTTTACAGATTCTGTTTTAAGGTCTTCAGCAACCATTTGACCATTTTTCATAA
CAAAGATCGTTTCAGTAAATTCTTTTATTTCTTTTAATTTGTGAGTGATAAATACGATTG
TTTTTTCATCTTCTTTTGAAAGAGTCTCAAGTGAATTTTTTAGCTGGTTTGTTTCTTGAG
GTGTTAAAACAGCAGTAGGCTCATCAAGCAACATAATGCTGTTGTTATTGAATATTAATT
TCATGATTTCTATTTTTTGTTTTTCTCCTACTGAAAGATCAGAGATAATAGAGTCATGAT
TTACTGATAAAGAAAAAATTTCAGAATATTTTTTATATGCATCTTCAAAATCATTAATGT
AGATATTTGCCTTTGATAGGGTCAAGTTATCTTTAATAGTGAAATTTTCAATTAATCGAA
ATTCCTGATGAACCATACCAATTCCAAGTTTTGCTGCTAATTCAGGATTAAGCTTTTTTT
(binning_env) -bash-4.1$ head ../Assembly/883.abundance.metabat2.tsv
contigName	contigLen	totalAvgDepth	metagenomics	metagenomics-var
NODE_1_length_850338_cov_9.99141_ID_1	850338	9.99141	9.99141	0
NODE_2_length_696952_cov_6.08719_ID_3	696952	6.08719	6.08719	0
NODE_3_length_645813_cov_1.38757_ID_5	645813	1.38757	1.38757	0
NODE_4_length_618535_cov_4.37412_ID_7	618535	4.37412	4.37412	0
NODE_5_length_518114_cov_4.40507_ID_9	518114	4.40507	4.40507	0
NODE_6_length_481955_cov_1.4344_ID_11	481955	1.4344	1.4344	0
NODE_7_length_435255_cov_2.77715_ID_13	435255	2.77715	2.77715	0
NODE_8_length_428820_cov_3.7035_ID_15	428820	3.7035	3.7035	0
NODE_9_length_425291_cov_2.13358_ID_17	425291	2.13358	2.13358	0

Closing this issue.

@HaraldBrolin
Copy link

HaraldBrolin commented May 14, 2021

Hi, I found your VAMB algorithm very interesting and I wanted to evaluate the algorithm on a current project of mine.
But I'm running into the same error as jolespin, in his original post. The failure occurs roughly ~15 mins into the run.

    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

I'm evaluating previous Canopy results vs Vamb and I want to use the same input data for the algorithm to minimize differences, thus I've created my own JGI formatted file from the previous analysis. I'm running on a small subset of the data, only 10 samples, to evaluate resource allocation for the larger run of ~1,200 samples. I'm also using a filtered version of my JGI-file where I have removed low abundant genes. The totalAvgDepth is identical to the first sample in my JGI-file.

The command that I'm running using the snakemake script:

vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

zgrep "^>" contigs.flt.fna.gz | wc -l
15186403

wc -l jgi_matrix/jgi.abundance.dat
14143390 

This is the current VAMB version that I'm using:
vamb 3.0.2 py37h73a75cf_1

his is how I've formatted the JGI input:

head jgi_matrix/jgi.abundance.dat 
contigName	contigLen	totalAvgDepth	s_4026	s_4027	s_4023	R0921	R0920	R0923	R0922	R0925	R0924	R0927
R0592_scaffold6566_3_gene27949:length_1518:complete	1518	21.26	21.26	290.98	177.74	32.04	26.76	72.38	6.18	65.84	131.28	13.58
R0592_scaffold6566_3_gene27957:length_489:complete	489	2.84	2.84	66.5	38.76	12.46	9.96	17.86	1.68	12.06	28.26	0.5
R0592_scaffold6566_3_gene27958:length_1560:complete	1560	14.74	14.74	190.18	73.28	41.24	20.78	36.44	7.4	37.98	81.96	9.24
R0592_scaffold6566_3_gene27959:length_609:complete	609	8.1	8.1	103.7	45.44	7.62	5.34	22.26	2.34	15.04	29.36	1.0

Here is what the --fasta input contigs.flt.fna.gz looks like:

zless contigs.flt.fna.gz | head -n 50
>R0592_scaffold6566_3_gene27949:length_1518:complete
ATGAAAAAATCGAAGGAAAAAATAAAAAAGATCCACAGTCTCCGCCATCAGATCACAGCA
TACTTTATCGGTCTGCTGTTTTTATCCATTCTGACTATTACCGTGATCAACGGAGCCTTT
CTTGAAAAATATTATGTGACAAAAAAGATCGATGTACTTCTGAATCTCCGAACAACTTTG
GAAAATACAGATATTGACAAGATGATGAACAGCAACAGTTCGAAAGGTATGGAGAGTATT
CCCGATGAGATCCAGAGAGCCTGCTCCAGAAACAATCTTTCCTGGGTTATCATTGACAGC
AGCAATACCAGCTGGCTTTCCTGGGGCGAAAATGAAAAAATGCTTCAGAGCAAGCTGTTC
GGATATGTCTATGATCTGGATGAGGACGGAGATAAGAGCCGGACCTTGAAGCAGGGTGAC
AATTATACCATCCAGCAGTCGCATGACCGTTTTGCGGGAATGGATTATGTGGAATGCTGG
GGACAGCTGGATGAGGAACATTATTTTATCATAAGGACACCCCTTGAGAGCATCCGGGAG
AGCGCTGATATTTCCAATAAGTTTTACTTTGCGGTAGGTCTGATGATCATAATCTTAAGT
GGTCTGGTCATTATGGTCGTGACCAGGAGGATCACCCGTCCCATCAGTGAGCTTACGGAA
CTTTCCAGAAAAATGTCAGATCTGGATTTTGAGGCAAAATACGAGAGCAAGGTCGGCAAT
GAGATCGATGTTCTGGGAGATAATTTCAATCGGATGTCCTCACAGCTGGAGACTACCATC
AGTGAACTGAAATCTGCAAATAATGAACTGCAGCGTGACATTGAGGACAAGATCAAGATC
GACAAGATGCGAAAGGAATTTCTGGATAACGTGTCCCATGAACTGAAAACTCCCATAGCC
CTGATCCAGGGCTATGCAGAAGGCTTAAAAGAAAATATTTCTGATGATCCTGAGAGCAGA
GAGTTCTATTGCGATGTCATTATGGATGAAGCCTCCAAGATGAATAAACTGGTAAAAAAT
CTTCTCACACTGAATCAGCTGGAATCCGGCAAAGATGCAGTTGTAATGGAGCGGTTTGAT
ATCGTATCCCTGATCCGCGGCGTGCTTCAGACCATGAATATCATGATCGGGCAGAAGAAT
GCAAAAGTGATCTTTGAGGCGGAGAAGCCGGTTTATGTATGGGCAGATGAGTTTAAGATC
GAAGAGGTTGTGACAAACTATGTAAGCAATGCCCTGAACCATCTGGAGGGAGAGAAAGAA
ATCGAGATCAAGCTGCAGGATGAGGGTAACCGTGTGAAGATCAGTGTGTTTAACACAGGG
ACTCCGATCCCTGAGGCAGATGTACCGAATCTCTGGAATAAATTTTATAAAGTTGATAAA
GCGAGAACCAGGGAATACGGCGGAAGCGGTATCGGTTTATCCATTGTAAAGGCCATCATG
GAAAGCATGAACCAGGAATATGGCGTACAGAACTTTGATAATGGCGTGGAATTCTGGTTT
ACACTGGACCGGAAATAA
>R0592_scaffold6566_3_gene27957:length_489:complete
ATGCTGGATCTGGAAGAACGAAACCGTGCAAGAAAGAAAGCCATGCGGCTTCTGGAACAT
ATGGACCGCACAGAGAAAGGCCTGACGGATAAACTCCGTCAGGCTGAATTTTCACCGGAG
GCAGTAGAAGATGCCATTGCTTATGTGAAGTCCTATGGGTATATCAACGATGCCCGTTAT
GCCAGAACCTATATCTCCTTCCGTATGGAATCCAGAAGTCGTCAGAAGATCCTCTCGGAG
CTTCAGTTAAAGGGTGTGGACCGTCAGACTGCCCTGGATGCCTGGAATGAGATGGAAGAG
CTTATGGAACCGGACGAGCGGGAGGTATTACGCAGGACGATCGAAAAGAAATATGCGCCG
GACACAGAGCTTGATGAAGGACAGATGCGAAGGCTGTATGGATATCTGGCCAGACGAGCA
TTCCGGTATGAGGATATTTCCCATACCCTGGAAGAAATGAACATCCGGGTGAACCGTGAT
ATGGAATAA

This is the log/vamb/vamb.log:

Traceback (most recent call last):
  File "path/bin/vamb/vamb_16_04_21/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "path/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Best,
Harald

@jakobnissen
Copy link
Member

Dear @HaraldBrolin

The error comes because each contig needs both a TNF (which is obtained from the FASTA file), and an RPKM (which is obtained from the JGI input file). To fix the problem, you need to remove the sequences in the FASTA file for which you don't have entries in the JGI depths file.

The JGI file does not seem to be correctly formatted, either. It should look like this.

@HaraldBrolin
Copy link

Thanks for the quick reply @jakobnissen. I'll try to fix the issues and write an update.

@HaraldBrolin
Copy link

Hi @jakobnissen,

I've matched the genes/contigs in both the jgi.abundance.dat and the FASTA file (contigs.flt.fna.gz).

md5sum of genes/contig-names after removing the initial ">".

zgrep '^>' contigs.flt.fna.gz |  awk '{print substr($1,2); }'  | md5sum
29180c62cc3bea4879a62634c244b236  -

md5sum of contigNames after dropping the header.

tail -n +2 jgi.abundance.dat | awk {'print $1'} | md5sum
29180c62cc3bea4879a62634c244b236  -

Regarding the format of the jgi.abundance.dat, the only difference I can detect between my jgi-file and the file from this link: here, is the contig length and the variance-column. I can't do much about the lengths, but if I understood correctly Vamb can run on genes. I've added the variance column and as a test I've only kept one sampe.

contigName	contigLen	totalAvgDepth	sorted.bam	sorted.bam-var
R0592_scaffold6566_3_gene27949:length_1518:complete	1518	37	37	0
R0592_scaffold6566_3_gene27957:length_489:complete	489	5	5	0
R0592_scaffold6566_3_gene27958:length_1560:complete	1560	25	25	0

I'm running the same command as previously:
vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

And I'm getting the same error message as previously:
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

@simonrasmu
Copy link
Collaborator

Hi @HaraldBrolin and @jakobnissen

Chipping in on this.

  1. It is perfectly fine to run VAMB on genes - we did this in the preprint but replaced this with running on MAGs for the published paper. However, when you use the -o C flag you make it split the genes (or contigs for that matter) as if it was a combined assembly (which I assume it is not). E.g. -o C assumes that the contignames are as "SampleIDCcontigID", ie. it splits Samples and contigs (using the character "C") for when running multi-split mode. If your contig/gene names are not set up like this it will not work.

  2. You should run without -m 2000 as this removes contigs/genes below 2000bp (most of your genes will be).

Best,

Simon

@HaraldBrolin
Copy link

Hi @simonrasmu

I tried changing the settings in the Snakemake but I still end up with the same error message, I think it's time for me to try another approach. I wanted to avoid re-mapping my samples but I think I have to do it anyways.

Thanks for your support!

Best,
Harald

@jolespin
Copy link
Author

jolespin commented May 24, 2021

Complete side note, but I ended up running this on an ocean metagenome using the following hyper parameter grid:

  • For -N I used 128 512 1024
  • For -L I used 24 32 48

I also used maxbin2 and metabat2. I combined all of these results as input into DAS_Tool and got great results. vamb was able to recover bins that weren't found via maxbin2 and metabat2.

Not sure if this will help anyone in the future.

@simonrasmu
Copy link
Collaborator

Sounds great - it is very nice that the methods are complementary and can be used to generate additional MAGs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants