Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS return nan for potential energy, tot energy, pressure, etc #43

Open
mhsiron opened this issue Sep 15, 2022 · 22 comments
Open

LAMMPS return nan for potential energy, tot energy, pressure, etc #43

mhsiron opened this issue Sep 15, 2022 · 22 comments
Assignees

Comments

@mhsiron
Copy link

mhsiron commented Sep 15, 2022

I have a Flare++ generated LAMMPS potential with the following header:

DATE: Thu Sep 15 08:15:18 2022 CONTRIBUTOR: Martin Siron
2
chebyshev
2 15 3 1730730
quadratic
7.00 7.00 7.00 7.00
 -3.7081202824163029e-01 -7.5121626026607249e+00  2.0285553143110349e+00 -2.5931799006379350e+00 -6.7339221940738980e-01
 -4.1086598196837585e+00  1.3867461307437310e+00 -1.3400453595520432e+00 -6.9342900823929199e-01  2.4547893054698768e+00
 -3.2365345306921256e-01  1.0697579523633678e+00 -1.0009252464915903e+00  5.5103596436798528e+00 -2.1893434014683351e+00
  2.1787352297543379e+00 -1.1271605000247291e+00  2.9668920730647503e+00 -2.6111849900449071e+00  1.2615246000666112e+00
 -2.9663719425271040e-01 -9.3284008733906543e-01 -5.8615111890141236e-01 -7.4564621228041972e-02  1.4580132256288607e+00
 -1.4538317320828966e+00  2.7242482436344915e+00 -2.8533072739625709e-01  2.8989992114976531e+00  8.1221376518358834e-01
  4.3954511805509924e+00 -1.3443434289366962e-01  2.6440157329586711e+00  1.3811351812136543e+00  2.5036757080927146e+00
 -1.4827050791523391e+00  6.5440934762730762e-01 -1.6827116757154350e+00 -1.5447686540050540e+00 -3.7877171837065617e+00

I have previously compiled LAMMPS with Flare++ and successfully ran MD calculation with potential with the following header:

DATE: Tue Sep  6 13:37:36 2022 CONTRIBUTOR: Martin Siron
2
chebyshev
2 12 3 720600
quadratic
3.70 3.70 3.70 3.70 
 -6.8142479285372162e-02 -1.7471659356174324e-01  8.9928444575977290e-02  9.5382460755559270e-02 -1.0573593939254869e-01
 -1.2702611633853572e-01  6.1525379261817802e-02  8.4021357499511784e-02 -2.8557872564111516e-02 -1.0541056145460515e-02
 -7.8123686721719710e-03  4.8391611702859487e-02  5.8877126903733215e-02  1.0994480866102428e-01 -7.9615695655338978e-02
 -1.0038966687721338e-02  1.1600338166083181e-01  1.6778484081738967e-01 -1.1450035780444216e-01 -7.5884153313299924e-02

I cannot get LAMMPS to run properly with the new potential -- it does not seem to be able to calculate the energies:

LAMMPS (29 Sep 2021 - Update 3)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 1 GPU(s) per node
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.002 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.003 seconds
Reading potential file TiO2.txt with DATE: Thu
FLARE will use up to 90.00 GB of device memory, controlled by MAXMEM environment variable
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 9, bins = 4 4 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair flare/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 4.539 | 4.539 | 4.539 Mbytes
Step PotEng KinEng TotEng Temp Press Density 
       0         -nan    750.22588         -nan         1000         -nan    3.7529171 
     100            0         -nan         -nan         -nan         -nan    3.7529171 
     200            0         -nan         -nan         -nan         -nan    3.7529171 

I have a system with a V100 and 100GB of memory.

My input file is as so:

units           metal
boundary        p p p

atom_style      atomic

read_data       data.meam


replicate 3 3 3

newton on
pair_style      flare
pair_coeff      * * flare.txt



timestep 0.005
velocity all create 1000.0 454883 mom yes rot yes dist gaussian

thermo_style custom step pe ke etotal temp press density
thermo 100

fix 2 all nvt temp 1000.00 2400.00 0.1


dump           1 all atom 10000 dump.meam

run             10000000

I am unsure how to troubleshoot.

@YuuuXie
Copy link
Collaborator

YuuuXie commented Sep 15, 2022

Hi @mhsiron
So you have the same code/executable, but the old potential runs well while the new potential does not? Have you tried running with the CPU code? Just to make sure whether it is a problem of Kokkos

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

Hi @YuuuXie thanks for your reply. It does appear to be related to Kokkos -- running without newton/kokkos appears to output computed energies.

What might make the first potential work with kokkos that does not work with the second potential?

Thanks for your help!

@anjohan
Copy link
Collaborator

anjohan commented Sep 15, 2022

Hi Martin,

This is puzzling. So with the same executable, it works with one potential file, but not the other?

What command do you use to run LAMMPS?

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

Exactly! I am using:
lmp -k on g 1 -sf kk -pk kokkos newton on neigh full -in in.script

Happy to provide the potentials as well (attached).

The one that gives me the problem is the one where I used 7A as the cutoff -- my only guess is it could have to do with memory? The 7A must be larger to process. But that's just my guess...
TiO2_3A.zip
TiO2_7A

@anjohan
Copy link
Collaborator

anjohan commented Sep 15, 2022

That command looks good.

Do you happen to have your input structure as well? Does it have any weird features like isolated atoms?

Note that we are in the process of merging this FLARE++ code into the main FLARE repository, it is currently on the development branch: https://github.com/mir-group/flare/tree/development

There's a chance there are some bugfixes in there that were done after migrating from flare_pp, so you could try to use the files in https://github.com/mir-group/flare/tree/development/lammps_plugins

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

Here's the input structure. It does not have any isolated atoms.

I will work on building with the development branch to see if that removes this error!
data.zip

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

I am unable to compile LAMMPS with the development branch of Flare:

3 errors detected in the compilation of "/lammps/src/compute_flare_std_atom.cpp".
CMakeFiles/lammps.dir/build.make:789: recipe for target 'CMakeFiles/lammps.dir/lammps/src/compute_flare_std_atom.cpp.o' failed
gmake[2]: *** [CMakeFiles/lammps.dir/ammps/l/lammps/src/compute_flare_std_atom.cpp.o] Error 1

@anjohan
Copy link
Collaborator

anjohan commented Sep 16, 2022

I see now that your output contains

FLARE will use up to 90.00 GB of device memory, controlled by MAXMEM environment variable

which indicates that you've set the environment variable MAXMEM=90. Does your GPU actually have 90 GB of memory?

As for the build errors on the development branch, you may have to first uninstall the old patch from this directory, ./uninstall.sh /path/to/lammps from lammps_plugins/, then ./install.sh /path/to/lammps, since there are some changes to our CMake patch (the development branch should run quite a bit faster because it uses KokkosKernels/cuBLAS).

@mhsiron
Copy link
Author

mhsiron commented Sep 16, 2022

Hi @anjohan even with the MAXMEM=10, I get the following output:

LAMMPS (29 Sep 2021 - Update 3)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 1 GPU(s) per node
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.004 seconds
Reading potential file TiO2.txt with DATE: Thu
FLARE will use up to 12.00 GB of device memory, controlled by MAXMEM environment variable
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 9, bins = 2 2 3
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair flare/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.219 | 2.219 | 2.219 Mbytes
Step PotEng KinEng TotEng Temp Press Density 
       0         -nan    27.661671         -nan         1000         -nan    3.7529171 
     100            0         -nan         -nan         -nan         -nan    3.7529171 
     200            0         -nan         -nan         -nan         -nan    3.7529171 
     300            0         -nan         -nan         -nan         -nan    3.7529171

It seems Flare ignores MAXMEM<12?

As for compiling this was on a fresh LAMMPS download, but for 29Sep2021 update 3. I just tried it on 17Feb2022 and it appears to fail at the same error:

3 errors detected in the compilation of "lammps/l2/lammps/src/compute_flare_std_atom.cpp".
CMakeFiles/lammps.dir/build.make:789: recipe for target 'CMakeFiles/lammps.dir/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o' failed
gmake[2]: *** [CMakeFiles/lammps.dir/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o] Error 1
CMakeFiles/Makefile2:769: recipe for target 'CMakeFiles/lammps.dir/all' failed
gmake[1]: *** [CMakeFiles/lammps.dir/all] Error 2
Makefile:135: recipe for target 'all' failed
gmake: *** [all] Error 2

@mhsiron
Copy link
Author

mhsiron commented Sep 16, 2022

Here is the full error output in VERBOSE=1 mode:

[ 18%] Building CXX object CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o
/nfs/site/disks/msironml/lammps/l2/lammps/lib/kokkos/bin/nvcc_wrapper -DFFT_KISS -DKOKKOS_DEPENDENCE -DLAMMPS_GZIP -DLAMMPS_JPEG -DLAMMPS_MEMALIGN=64 -DLAMMPS_OMP_COMPAT=4 -DLAMMPS_PNG -DLAMMPS_SMALLBIG -DLMP_KOKKOS -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -D_MPICC_H -I/nfs/site/disks/msironml/lammps/l2/lammps/src -I/nfs/site/disks/msironml/lammps/l2/lammps/src/KSPACE -I/nfs/site/disks/msironml/lammps/l2/lammps/src/MACHDYN -I/nfs/site/disks/msironml/lammps/l2/lammps/src/MANYBODY -I/nfs/site/disks/msironml/lammps/l2/lammps/src/MOLECULE -I/nfs/site/disks/msironml/lammps/l2/lammps/src/RIGID -I/nfs/site/disks/msironml/lammps/l2/lammps/lib/kokkos/core/src -I/nfs/site/disks/msironml/lammps/l2/lammps/lib/kokkos/containers/src -I/nfs/site/disks/msironml/lammps/l2/lammps/lib/kokkos/algorithms/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/lib/kokkos -I/nfs/site/disks/msironml/lammps/l2/lammps/src/KOKKOS -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/styles -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/lib/kokkos/core/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/lib/kokkos/containers/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/lib/kokkos/algorithms/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-build/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-build/src/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/impl/tpls -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/blas -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/blas/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/sparse -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/sparse/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/graph -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/graph/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/batched -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/batched/dense -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/batched/dense/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/batched/sparse -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/batched/sparse/impl -I/nfs/site/disks/msironml/lammps/l2/lammps/build1/_deps/kokkoskernels-src/src/common -isystem /nfs/site/itools/em64t_SLES12SP5/pkgs/ics/2019.4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/include -isystem /nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build -O2 -g -DNDEBUG -fPIC -fopenmp -expt-extended-lambda -Wext-lambda-captures-this -arch=sm_70 -Xcompiler -std=c++14 -MD -MT CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o -MF CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o.d -o CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o -c /nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp
/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(223): warning: __host__ annotation is ignored on a function("DenseStorage") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(223): warning: __device__ annotation is ignored on a function("DenseStorage") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(233): warning: __host__ annotation is ignored on a function("operator=") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(233): warning: __device__ annotation is ignored on a function("operator=") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(248): warning: __host__ annotation is ignored on a function("DenseStorage") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(248): warning: __device__ annotation is ignored on a function("DenseStorage") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(249): warning: __host__ annotation is ignored on a function("operator=") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/build1/Eigen3_build-prefix/src/Eigen3_build/Eigen/src/Core/DenseStorage.h(249): warning: __device__ annotation is ignored on a function("operator=") that is explicitly defaulted on its first declaration

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: class "LAMMPS_NS::Neighbor" has no member "add_request"

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: namespace "LAMMPS_NS::NeighConst" has no member "REQ_FULL"

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: namespace "LAMMPS_NS::NeighConst" has no member "REQ_OCCASIONAL"

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(129): warning: variable "beta_init" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(129): warning: variable "beta_counter" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(130): warning: variable "B2_val_1" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(130): warning: variable "B2_val_2" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(106): warning: variable "nall" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(107): warning: variable "newton_pair" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(435): warning: variable "n_size" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(437): warning: variable "beta_val" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(471): warning: variable "tmp" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(471): warning: variable "nwords" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(571): warning: variable "radial_string" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(571): warning: variable "cutoff_string" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(572): warning: variable "radial_string_length" was declared but never referenced

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(572): warning: variable "cutoff_string_length" was declared but never referenced

3 errors detected in the compilation of "/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp".
CMakeFiles/lammps.dir/build.make:789: recipe for target 'CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o' failed
make[2]: *** [CMakeFiles/lammps.dir/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp.o] Error 1
make[2]: Leaving directory '/nfs/site/disks/msironml/lammps/l2/lammps/build1'
CMakeFiles/Makefile2:769: recipe for target 'CMakeFiles/lammps.dir/all' failed
make[1]: *** [CMakeFiles/lammps.dir/all] Error 2
make[1]: Leaving directory '/nfs/site/disks/msironml/lammps/l2/lammps/build1'
Makefile:135: recipe for target 'all' failed
make: *** [all] Error 2

@mhsiron
Copy link
Author

mhsiron commented Sep 16, 2022

On LAMMPS stable_23Jun2022, it appears to not have this issue, so far. Will update if compile finishes.

Perhaps a new version of LAMMPS has the following:

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: class "LAMMPS_NS::Neighbor" has no member "add_request"

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: namespace "LAMMPS_NS::NeighConst" has no member "REQ_FULL"

/nfs/site/disks/msironml/lammps/l2/lammps/src/compute_flare_std_atom.cpp(84): error: namespace "LAMMPS_NS::NeighConst" has no member "REQ_OCCASIONAL"

@mhsiron
Copy link
Author

mhsiron commented Sep 16, 2022

Confirmed that LAMMPS stable_23June2022 compiled with development branch of Flare LAMMPS plugin with Kokkos. With the same CMAKE compiler tags I used it did not work with 17Feb2022 or 29Sep2021 Update 3.

In terms of my original problem, the "7A" flare PP potential, the Flare development branch did not fix this:

LAMMPS (23 Jun 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
  will use up to 1 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0 0 0) to (11.40813 11.40813 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.002 seconds
Reading potential file TiO2.txt with DATE: Thu
FLARE will use up to 12.00 GB of device memory, controlled by MAXMEM environment variable
Neighbor list info ...
  update every 1 steps, delay 10 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 9, bins = 2 2 3
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair flare/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.219 | 2.219 | 2.219 Mbytes
   Step         PotEng         KinEng         TotEng          Temp          Press         Density    
         0  -nan            27.661671     -nan            1000          -nan            3.7529171    
       100   0             -nan           -nan           -nan           -nan            3.7529171    
       200   0             -nan           -nan           -nan           -nan            3.7529171  

@anjohan
Copy link
Collaborator

anjohan commented Sep 21, 2022

@mhsiron Have you tried running it with Kokkos in either Serial or OpenMP mode?

Also: Your 7 Å potential link has expired, so I'm not able to download it.

@mhsiron
Copy link
Author

mhsiron commented Sep 22, 2022

I'm not too familiar with Kokkos -- would you mind letting me know the command for Serial or OpenMP mode?
From this is my understanding that I would have to compile for each mode correct?

Here's the 7A potential again!

@anjohan
Copy link
Collaborator

anjohan commented Oct 4, 2022

Hi @mhsiron ,

Sorry that I didn't get around to following up on this.

  1. Yes, you can compile it with either -DKokkos_ENABLE_OPENMP=ON (for OpenMP) or nothing (for Serial), then replace the g 1 with t 1 (1 thread or Serial) or t 4 (4 threads).

  2. Your potential link expired again.

@mhsiron
Copy link
Author

mhsiron commented Oct 10, 2022

Hi @anjohan ,

No worries -- completely understand! Lets try round 3 with the potential link! Here it is (looks like this service is 30 days): link.

I will attempt the recompile -- I am working with my cluster to upgrade our CUDA to be on par with later PyTorch versions first to see if this may fix some issues. Will update soon.

@mhsiron
Copy link
Author

mhsiron commented Oct 21, 2022

Ok some good news -- after recompiling with Kokkos OpenMP mode, I am able to run the 7A potential with Kokkos with decent performance (though not as good as the 4A one with 2x the amount of atoms, and Kokkos+GPU). So it does appear to be related more specifically to Kokkos with GPU. Though I'm still not sure why the 4A worked and the 7A didn't.

@mhsiron
Copy link
Author

mhsiron commented Oct 21, 2022

I used the following compile cmd:
cmake -C ../cmake/presets/kokkos-serial.cmake -C ../cmake/presets/kokkos-openmp.cmake ../cmake -DPKG_KOKKOS=ON -DKokkos_ENABLE_OPENMP=ON -DBUILD_OMP=yes

This was with GCC 12.1, CMake 3.25

@mhsiron
Copy link
Author

mhsiron commented Oct 21, 2022

I am still waiting on the cluster to upgrade CUDA to either 11.3 or 11.6 to see if this resolves the problem for Kokkos with CUDA support.

@mhsiron
Copy link
Author

mhsiron commented Oct 27, 2022

After upgrading to CUDA 11.6, and recompiling, I am still reporting the same error with Kokkos+GPU. Only Kokkos+OMP seems to work for 7A potential. 4A potential has no issue on either compilations.

@anjohan
Copy link
Collaborator

anjohan commented Nov 9, 2022

Hi @mhsiron ,

Thank you for investigating further!

I finally got around to running this on my own setup, and it does indeed produce nan (both energies and forces) for me as well with the large cutoff. I'm looking into it.

@anjohan
Copy link
Collaborator

anjohan commented Nov 9, 2022

Hm, it appears that when the cutoff (and thus the number of neighbors) grows too large, CUDA simply doesn't launch the radial and spherical harmonics basis calculation, giving no warning whatsoever.

For now, you can get it to run by replacing the second max_neighs with something like std::min(max_neighs, 32) on line 276 of pair_flare_kokkos.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants