Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dipole model (pt backend) does not work for systems with different number of atoms #4368

Closed
ChiahsinChu opened this issue Nov 16, 2024 · 3 comments · Fixed by #4370
Closed
Assignees
Labels

Comments

@ChiahsinChu
Copy link
Contributor

Bug summary

When training a dipole model with multiple systems with different numbers of atoms, an error is thrown when trying to merge the arrays in different shapes. Training with a single system or multiple systems with the same number of atoms is fine.

DeePMD-kit Version

DeePMD-kit v3.0.0b5.dev52+g13e247ec

Backend and its version

PyTorch v2.2.1-g6c8c5ad5eaf

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Input file (which is adapted from the official example of the water dipole model):

{
  "_comment1": " model parameters",
  "model": {
    "type_map": ["O", "H"],
    "atom_exclude_types": [1],
    "descriptor": {
      "type": "se_e2_a",
      "sel": [46, 92],
      "rcut_smth": 3.8,
      "rcut": 4.0,
      "neuron": [25, 50, 100],
      "resnet_dt": false,
      "axis_neuron": 6,
      "type_one_side": true,
      "precision": "float64",
      "seed": 1,
      "_comment2": " that's all"
    },
    "fitting_net": {
      "type": "dipole",
      "neuron": [100, 100, 100],
      "resnet_dt": true,
      "precision": "float64",
      "seed": 1,
      "_comment3": " that's all"
    },
    "_comment4": " that's all"
  },
  "learning_rate": {
    "type": "exp",
    "start_lr": 0.01,
    "decay_steps": 5000,
    "_comment5": "that's all"
  },
  "loss": {
    "type": "tensor",
    "pref": 0.0,
    "pref_atomic": 1.0,
    "_comment6": " that's all"
  },
  "_comment7": " traing controls",
  "training": {
    "training_data": {
      "systems": "./test-data",
      "batch_size": "auto",
      "_comment8": "that's all"
    },
    "numb_steps": 2000,
    "seed": 10,
    "disp_file": "lcurve.out",
    "disp_freq": 100,
    "save_freq": 1000,
    "_comment10": "that's all"
  },
  "_comment11": "that's all"
}

Running command:

dp --pt train input.json

Error log:

To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-11-16 14:35:09,752] DEEPMD INFO    DeePMD version: 3.0.0b5.dev52+g13e247ec
[2024-11-16 14:35:09,752] DEEPMD INFO    Configuration path: mix_data.json
[2024-11-16 14:35:09,763] DEEPMD INFO     _____               _____   __  __  _____           _     _  _   
[2024-11-16 14:35:09,763] DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |  
[2024-11-16 14:35:09,763] DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_ 
[2024-11-16 14:35:09,763] DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[2024-11-16 14:35:09,763] DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_ 
[2024-11-16 14:35:09,763] DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[2024-11-16 14:35:09,763] DEEPMD INFO    Please read and cite:
[2024-11-16 14:35:09,763] DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2024-11-16 14:35:09,763] DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2024-11-16 14:35:09,763] DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
[2024-11-16 14:35:09,763] DEEPMD INFO    --------------------------------------------------------------------------------------------------------
[2024-11-16 14:35:09,763] DEEPMD INFO    installed to:          /home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/deepmd
[2024-11-16 14:35:09,763] DEEPMD INFO                           /home/jxzhu/apps/deepmd/devel/deepmd
[2024-11-16 14:35:09,763] DEEPMD INFO    source:                v3.0.0b4-52-g13e247ec
[2024-11-16 14:35:09,763] DEEPMD INFO    source branch:         devel
[2024-11-16 14:35:09,763] DEEPMD INFO    source commit:         13e247ec
[2024-11-16 14:35:09,763] DEEPMD INFO    source commit at:      2024-10-26 18:25:18 +0000
[2024-11-16 14:35:09,763] DEEPMD INFO    use float prec:        double
[2024-11-16 14:35:09,763] DEEPMD INFO    build variant:         cuda
[2024-11-16 14:35:09,763] DEEPMD INFO    Backend:               PyTorch
[2024-11-16 14:35:09,763] DEEPMD INFO    PT ver:                v2.2.1-g6c8c5ad5eaf
[2024-11-16 14:35:09,763] DEEPMD INFO    Enable custom OP:      False
[2024-11-16 14:35:09,763] DEEPMD INFO    running on:            jxzhu
[2024-11-16 14:35:09,763] DEEPMD INFO    computing device:      cuda:0
[2024-11-16 14:35:09,763] DEEPMD INFO    CUDA_VISIBLE_DEVICES:  unset
[2024-11-16 14:35:09,763] DEEPMD INFO    Count of visible GPUs: 1
[2024-11-16 14:35:09,763] DEEPMD INFO    num_intra_threads:     0
[2024-11-16 14:35:09,763] DEEPMD INFO    num_inter_threads:     0
[2024-11-16 14:35:09,763] DEEPMD INFO    --------------------------------------------------------------------------------------------------------
[2024-11-16 14:35:09,803] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2024-11-16 14:35:10,054] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-11-16 14:35:10,145] DEEPMD INFO    Adjust batch size from 2048 to 4096
[2024-11-16 14:35:10,232] DEEPMD INFO    Adjust batch size from 4096 to 8192
[2024-11-16 14:35:10,444] DEEPMD INFO    Adjust batch size from 8192 to 16384
[2024-11-16 14:35:10,653] DEEPMD INFO    Adjust batch size from 16384 to 32768
[2024-11-16 14:35:10,866] DEEPMD INFO    Adjust batch size from 32768 to 16384
[2024-11-16 14:35:11,127] DEEPMD INFO    training data with min nbor dist: 0.999890527057838
[2024-11-16 14:35:11,127] DEEPMD INFO    training data with max nbor size: [20 36]
[2024-11-16 14:35:11,146] DEEPMD INFO    Packing data for statistics from 3 systems
Traceback (most recent call last):
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/jxzhu/apps/deepmd/devel/deepmd/main.py", line 927, in main
    deepmd_main(args)
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 527, in main
    train(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 339, in train
    trainer = get_trainer(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 191, in get_trainer
    trainer = training.Trainer(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/train/training.py", line 293, in __init__
    self.get_sample_func = single_model_stat(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/train/training.py", line 233, in single_model_stat
    _model.compute_or_load_stat(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/model/make_model.py", line 573, in compute_or_load_stat
    return self.atomic_model.compute_or_load_stat(sampled_func, stat_file_path)
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 294, in compute_or_load_stat
    self.compute_or_load_out_stat(wrapped_sampler, stat_file_path)
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/base_atomic_model.py", line 396, in compute_or_load_out_stat
    self.change_out_bias(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/model/atomic_model/base_atomic_model.py", line 463, in change_out_bias
    bias_out, std_out = compute_output_stats(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 367, in compute_output_stats
    bias_atom_a, std_atom_a = compute_output_stats_atomic(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 550, in compute_output_stats_atomic
    merged_output = {
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/utils/stat.py", line 551, in <dictcomp>
    kk: to_numpy_array(torch.cat(outputs[kk]))
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 702 but got size 864 for tensor number 1 in the list.

Steps to Reproduce

mkdir test && cd test
wget -c https://github.com/user-attachments/files/17783779/test-data.tar.gz
tar -zxvf test-data.tar.gz

# get input.json file 

dp --pt train input.json

Further Information, Files, and Links

test-data.tar.gz

@anyangml
Copy link
Collaborator

anyangml commented Nov 16, 2024

My guess is that at line 550 we are concatenating tensor labels of shape [nframe, nloc * ndim]. This will fail because for the first system, there are 234 atoms each has a dipole lable of shape 3 (hence 702) and for the second system the number of atom is 288.

merged_output = {
kk: to_numpy_array(torch.cat(outputs[kk]))
for kk in keys
if len(outputs[kk]) > 0
}

To fix this, I think we can simply reshape the tensor label into [nframe * nloc, 1, ndim]. Did a quick test, and it seems this should work. Image

will create a PR to fix this soon.

@anyangml anyangml linked a pull request Nov 17, 2024 that will close this issue
@anyangml
Copy link
Collaborator

@ChiahsinChu can we add your test data into UT?

@ChiahsinChu
Copy link
Contributor Author

@ChiahsinChu can we add your test data into UT?

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants