-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postprocess_variants.py ValueError: ptrue must be between zero and one: nan #849
Comments
Hi @karoliinas , From your log, I'm unable to tell why this has occurred. In this command:
If you can share the If you can't share the files, we can think about what we can do here to help identify which example caused the issue. |
Hi @pichuan, thank you for getting back so quickly! I'm working on patient data, so unfortunately it's not something I can share. About the files you requested, I now see that there is no file called: That's probably why the above command for postprocess_variants doesn't work, right? I am using 19 threads. I copied the command from Thank you so much, I'm very happy if it turns I merely had a faulty command! |
Oh, and since we're here, I mentioned problems with the GPU. It seems that DeepVariant is using cuda version 11.3.1, but the nvidia driver (version 555.42.02) on the server is using cuda version 12.5
Should I update the driver, cuda or both? Also, installed on the server is cuda version 12.3, so I'm feeling a bit confused as to where the 12.5 comes from.
It's a right mess! Sometimes call_variants uses gpu and at other times it stalls. That's why I'm now running the commands separately, I have a lot of outputs from make_examples. Our IT support say they're happy the gpu works part of the time :) It's adding a lot of extra work, and I'm trying to come up with a solution. Since it's completely unrelated to the topic, perhaps I should create a new issue instead? |
Hi @karoliinas , The format I'm out of office now. I'll give you some examples to debug the call_variants_output next week! |
Hi @pichuan, thanks for clearing this up! When you get the chance, please let me know what to look for in the call_variants -output. Also, I'm not sure I understand the format, using zcat I get many very short lines. I see AD, DP and VAF but not sure how to read variant positions / probabilities. Many thanks! |
One thing I noticed is that you have multiple outputs: one generated with |
Nice catch! Thank you! It does seem that there are different number of outputs, though I've (to my knowledge) only used the commands from the full run with 19 cpus and --dry_run=true. I've cleared the output directories and am running the full command from the beginning. Which nvidia-driver - cuda combination do you run deepvariant with? I'm looking into the gpu -problem, I'm thinking I need to install an older nvidia-driver and cuda. Currently the driver is 555.42.02, but looking at this https://docs.nvidia.com/deploy/cuda-compatibility/#id1 , it's not compatible with cuda 11.3.1 that the deepvariant:1.6.1-gpu is using. |
Hello @pichuan, I ran the full deepvariant pipeline after deleting all output directories from the previous run. It seems call_variants outputs only 16 files to the intermediate dir, whereas make_examples outputs 19 (with --num_shards 19). Here's the full command:
Adding the ld_library_path -argument gets rid of the error messages about libvinfer, however I still get the cuda error:
Although call_variants did use gpu and ran in about half an hour. Then postprocess_variants halts with: (Full error log in the first message) I'll try to play around with --num_shards next. |
Hi @karoliinas, Given that you're having weird numerical prediction values from call_variants output, and that you mentioned your GPU version is newer that what we used in DeepVariant 1.6, I strongly suspect your GPU+DeepVariant setting is producing unexpected output. Would it be possible for you to:
|
Hi @pichuan, thanks for looking into this! And you're right, though changing the --num_shards to 16 did result in the same number of files from make_examples call_variants, the error remains. Which driver / cuda -versions are supported? The server I have is RHEL9, and the oldest available driver begins with 515 and cuda with 11.7. Would they do? I checked the available drivers from nvidia repo: https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/ Do you have any estimate as to when the new deepvariant -version will be out, since it might be easier to wait than set up a new server (I'm not sure we have RHEL8 available). I'll try to run it with cpu for now, and will let you know how it goes. The funny thing is, I already processed 19 samples with this set up. |
Hi again, indeed the cpu -version works. Boy am I glad there's no problem with the data! We'll have to wait for the deepvariant update to get the gpu going. Will it be using the same model, so that the samples processed the new version will be compatible with the samples processed with the current one? I'm getting new samples in every week, and we will have ~150 in a couple of months, so I'm interested as to not end up running them twice. The other option would be to upgrade the vm to one without gpu and more cpu:s to continue with the current cpu -version. |
Hi @karoliinas , for stability and reproducibility, using CPU version is likely the better way to go. In terms of GPU updates, in our developmental branch (https://github.com/google/deepvariant/tree/dev) we actually update the GPU version (and Ubuntu version). For example you can see: https://github.com/google/deepvariant/blob/dev/Dockerfile I'll close this now. Please let us know if you have more questions. |
Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md:
Yes
Describe the issue:
I have processed around 30 samples albeit having some issues with GPU, possibly due to nvidia driver / cuda version. However, recently postprocess has started stalling with the same error. Any help troubleshooting this would be greatly appreciated!
Setup
NAME=Red Hat Enterprise Linux
VERSION=9.4 (Plow)
Steps to reproduce:
podman run -it --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --gpus 1 -v /data:/data --device nvidia.com/gpu=all google/deepvariant:1.6.1-gpu /opt/deepvariant/bin/postprocess_variants --ref "/data/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --infile "/data/variants/sample1.intermediate/call_variants_output.tfrecord.gz" --outfile "/data/variants/sample1.vcf.gz" --cpus "19" --gvcf_outfile "/data/variants/sample1.g.vcf.gz" --nonvariant_site_tfrecord_path "/data/variants/sample1.intermediate/gvcf.tfrecord@19.gz"
Does the quick start test work on your system? Yes
Please test with https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md.
Is there any way to reproduce the issue by using the quick start?
Any additional context:
The text was updated successfully, but these errors were encountered: