Error when running LAMMPS in the devel branch #4161

wujing81 · 2024-09-24T08:02:44Z

Summary

I created a container node registry.dp.tech/dptech/deepmd-kit:3.0.0b3-cuda12.1 using the Bourium platform. Then I installed the devel branch of DeepMD-kit with:
conda create -n deepmd-dev python=3.10
source activate deepmd-dev
pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel
rsync -a --ignore-existing /opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/ /opt/deepmd-kit-3.0.0b3/
The command /opt/deepmd-kit-3.0.0b3/bin/dp --version displays: DeePMD-kit v3.0.0b4.dev56+g0b72dae3.
I trained a model using this version of dp, and the training input file is attached. I used dp --pt freeze to get a .pth file. Then, I used this model to run MD simulations with the command /opt/deepmd-kit-3.0.0b3/bin/lmp -i lammps.in. The input.lammps and conf.lmp files are attached.
An error occurs:
[bohrium-11849-1195151:01982] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
DeePMD-kit: Successfully load libcudart.so.11.0
2024-09-24 15:37:29.837816: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 15:37:29.837871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 15:37:29.837882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loaded 1 plugins from /opt/deepmd-kit-3.0.0b3/lib/deepmd_lmp
Reading data file ...
triclinic box = (0 0 0) to (12.4447 12.4447 12.4447) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms ...
192 atoms
read_data CPU = 0.003 seconds
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /opt/deepmd-kit-3.0.0b3
source:
source branch: HEAD
source commit: cbf2de6
source commit at: 2024-07-27 05:11:58 +0000
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/include;/opt/deepmd-kit-3.0.0b3/include
build with tf lib: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /opt/deepmd-kit-3.0.0b3load model from: model.pth to cpu
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Info of model(s):
using 1 model(s): model.pth
rcut in model: 4.5
ntypes in model: 118

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 10 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.5
ghost atom cutoff = 6.5
binsize = 3.25, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_5 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_6 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
fitting_net = self.fitting_net
File "code/torch/deepmd/pt/model/descriptor/dpa2.py", line 98, in forward
repformers1 = self.repformers
_17 = nlist_dict[_1(_16, (repformers1).get_nsel(), )]
_18 = (repformers).forward(_17, extended_coord, extended_atype, g13, mapping0, comm_dict0, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
g14, g2, h2, rot_mat, sw, = _18
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repformers.py", line 364, in forward
_65 = "border_op is not available since customized PyTorch OP library is not built when freezing the model."
_66 = uninitialized(Tensor)
ops.prim.RaiseException(_65, "builtins.NotImplementedError")

return _66

Traceback of TorchScript, original code (most recent call last):
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
      comm_dict: Optional[Dict[str, torch.Tensor]] = None,
  ):
      model_ret = self.forward_common_lower(
                  ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
          )
          del extended_coord, fparam, aparam
          atomic_ret = self.atomic_model.forward_common_atomic(
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
              cc_ext,
              extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 242, in forward_common_atomic
  
      ext_atom_mask = self.make_atom_mask(extended_atype)
      ret_dict = self.forward_atomic(
                 ~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          torch.where(ext_atom_mask, extended_atype, 0),
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
      if self.do_grad_r() or self.do_grad_c():
          extended_coord.requires_grad_(True)
      descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                        ~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 799, in forward
          g1 = g1_ext
      # repformer
      g1, g2, h2, rot_mat, sw = self.repformers(
                                ~~~~~~~~~~~~~~~ <--- HERE
          nlist_dict[
              get_multiple_nlist_key(
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 62, in forward
      argument8,
  ) -> torch.Tensor:
      raise NotImplementedError(
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
          "border_op is not available since customized PyTorch OP library is not built when freezing the model."
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      )
builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model.
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------


### DeePMD-kit Version

DeePMD-kit v3.0.0b4.dev56+g0b72dae3

### Backend and its version

PyTorch v2.4.1+cu121-g38b96d3399a

### Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

_No response_

### Details

[input.zip](https://github.com/user-attachments/files/17110378/input.zip)

The text was updated successfully, but these errors were encountered:

iProzd · 2024-09-24T14:22:09Z

@wujing81 Apologies for the confusion during installation; I faced the same issue while debugging.

The problem arises because DPA2 requires the border_op module, which depends on enabling PyTorch support during installation. You can do this by using the following command:
DP_VARIANT=cuda DP_ENABLE_PYTORCH=1 pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel

But why is this option False in default? @njzjz @CaRoLZhangxy To my understanding, users who want to use dpa2 model with lammps must need this option. BTW, the doc mentioned this option here may be not so clear? https://docs.deepmodeling.com/projects/deepmd/en/latest/install/install-from-source.html#envvar-DP_ENABLE_PYTORCH

njzjz · 2024-09-24T19:00:38Z

But why is this option False in default?

xref: #3891 (comment)

I am not going to change the default option to True until PyTorch fixes pytorch/pytorch#78530.

Fix deepmodeling#4161. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Fix #4161.  ## Summary by CodeRabbit - **New Features** - Added installation requirements for the DPA-2 model in the documentation, including customized OP library instructions. - **Improvements** - Enhanced error messaging in the `border_op` function for better user guidance. - Clarified parameter handling and documentation in the `DescrptBlockRepformers` class. - Improved logic for processing input tensors and neighbor lists in the `forward` method. - Strengthened input statistics handling in the `compute_input_stats` method.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

wujing81 added the wontfix label Sep 24, 2024

njzjz added Docs and removed wontfix labels Sep 24, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 1, 2024

docs: add documentation for installation requirements of DPA-2

9e47e3f

Fix deepmodeling#4161. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this issue Oct 1, 2024

docs: add documentation for installation requirements of DPA-2 #4178

Merged

njzjz linked a pull request Oct 1, 2024 that will close this issue

docs: add documentation for installation requirements of DPA-2 #4178

Merged

njzjz closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running LAMMPS in the devel branch #4161

Error when running LAMMPS in the devel branch #4161

wujing81 commented Sep 24, 2024

iProzd commented Sep 24, 2024

njzjz commented Sep 24, 2024

Error when running LAMMPS in the devel branch #4161

Error when running LAMMPS in the devel branch #4161

Comments

wujing81 commented Sep 24, 2024

Summary

iProzd commented Sep 24, 2024

njzjz commented Sep 24, 2024