Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

Open
romanfanta4 opened this issue May 31, 2024 · 2 comments

Comments

@romanfanta4
Copy link

romanfanta4 commented May 31, 2024

Describe the bug
When running a periodic calculation with 9 twists, 724 electrons, and 44 atoms using the mixed precision version of QMCPACK with GPU offload, the calculation was aborted with the following error:

NaNguard::checkOneParticleGradients error message: TWF::calcRatioGrad at particle 687
  grads[0] = (-nan,0.0418255)
  grads[1] = (-nan,-0.0806002)
  grads[2] = (-nan,0.0412396)
Unexpected exception thrown in threaded section
Fatal Error. Aborting at Unhandled Exception
This issue appears to be related to NaN values in the gradients of the wave function for a specific particle.

The same calculation with full precision ran smoothly without any problems.

To Reproduce
Input and output files below:
dmc_2x2_single_prec-test.zip

Expected behavior
The calculation should complete successfully without encountering NaN values in the wave function gradients, resulting in accurate and stable output data.

System:
System name: Perlmutter
Modules loaded:
module use /global/common/software/nersc/n9/llvm/modules
module load craype cray-mpich
module load llvm/17.0.6-gpu
Other systems where this is reproducible: Not tested on other systems.

Additional context
The calculation was performed using the complex version of QMCPACK with NVIDIA GPU and OpenMP offload.
No other context or error messages where in the output files.

@prckent
Copy link
Contributor

prckent commented May 31, 2024

Thanks for the report Roman. Is this the first run you have tried or are other runs either working or failing for you? Any issues with other runs? I see the full precision run of this system was fine.

@romanfanta4
Copy link
Author

romanfanta4 commented Jun 2, 2024

I tried it first for the larger system and ended up with the same error as for this smaller system. I did not investigated any further. For full precision, I did not run into any issues as you wrote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants