NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

romanfanta4 · 2024-05-31T00:15:45Z

Describe the bug
When running a periodic calculation with 9 twists, 724 electrons, and 44 atoms using the mixed precision version of QMCPACK with GPU offload, the calculation was aborted with the following error:

NaNguard::checkOneParticleGradients error message: TWF::calcRatioGrad at particle 687
  grads[0] = (-nan,0.0418255)
  grads[1] = (-nan,-0.0806002)
  grads[2] = (-nan,0.0412396)
Unexpected exception thrown in threaded section
Fatal Error. Aborting at Unhandled Exception
This issue appears to be related to NaN values in the gradients of the wave function for a specific particle.

The same calculation with full precision ran smoothly without any problems.

To Reproduce
Input and output files below:
dmc_2x2_single_prec-test.zip

Expected behavior
The calculation should complete successfully without encountering NaN values in the wave function gradients, resulting in accurate and stable output data.

System:
System name: Perlmutter
Modules loaded:
module use /global/common/software/nersc/n9/llvm/modules
module load craype cray-mpich
module load llvm/17.0.6-gpu
Other systems where this is reproducible: Not tested on other systems.

Additional context
The calculation was performed using the complex version of QMCPACK with NVIDIA GPU and OpenMP offload.
No other context or error messages where in the output files.

The text was updated successfully, but these errors were encountered:

prckent · 2024-05-31T13:38:38Z

Thanks for the report Roman. ~~Is this the first run you have tried or are other runs either working or failing for you?~~ Any issues with other runs? I see the full precision run of this system was fine.

romanfanta4 · 2024-06-02T02:19:02Z

I tried it first for the larger system and ended up with the same error as for this smaller system. I did not investigated any further. For full precision, I did not run into any issues as you wrote.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

romanfanta4 commented May 31, 2024 •

edited

Loading

prckent commented May 31, 2024 •

edited

Loading

romanfanta4 commented Jun 2, 2024 •

edited

Loading

NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

NaN Values in Gradients Cause Calculation Abortion Using Mixed Precision with GPU Offload #5015

Comments

romanfanta4 commented May 31, 2024 • edited Loading

prckent commented May 31, 2024 • edited Loading

romanfanta4 commented Jun 2, 2024 • edited Loading

romanfanta4 commented May 31, 2024 •

edited

Loading

prckent commented May 31, 2024 •

edited

Loading

romanfanta4 commented Jun 2, 2024 •

edited

Loading