Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speedup scan_nlist kernel #1028

Merged
merged 2 commits into from
Aug 25, 2021
Merged

Conversation

denghuilu
Copy link
Member

Our profiling of the example water benchmark system shows that the scan_nlist kernel within the $deepmd_source_dir/source/lib/src/cuda/neighbor_list.cu consumes more than 7% of kernel execution time during the dp train process. And it consumes more than 20% of the kernel execution time in the dp init-frz-model process.

The original scan_nlist kernel uses one thread to scan the neighbor list of a central atom. This is inefficient within the training process. Given the training nloc usually smaller than the threads number per cuda block, scan_nlist will typically launch only one cuda thread-block at each training step and causes a huge waste of computing resources.

In the new implementation, I use the nvidia/cub head library to parallelize the scan kernel, and it speedup the scan_nlist kernel by more than 50 times. The total training speed-up ratio is about 3%, which is mainly because there's still a considerable part of the training process does not run on the gpu kernel functions. And we still need to perform more detailed profiling of the training process.

The lcurve.out of cpu training(lcurve-cpu.out), gpu training(lcurve-gpu.out) and new scan training(lcurve-parallel.out) show the same results:

root se_e2_a $ head lcurve-cpu.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.82e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

root se_e2_a $ head lcurve-gpu.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.82e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

root se_e2_a $ head lcurve-parallel.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.65e+01    2.76e+01      6.77e-01    6.79e-01      8.38e-01    8.71e-01    1.0e-03
      1      2.64e+01    2.78e+01      4.32e-01    4.31e-01      8.35e-01    8.78e-01    1.0e-03
      2      2.53e+01    2.52e+01      1.33e-01    1.28e-01      7.99e-01    7.98e-01    1.0e-03
      3      2.44e+01    2.31e+01      9.81e-02    9.90e-02      7.72e-01    7.30e-01    1.0e-03
      4      2.78e+01    2.57e+01      2.39e-01    2.38e-01      8.80e-01    8.12e-01    1.0e-03
      5      2.54e+01    2.54e+01      3.01e-01    3.04e-01      8.04e-01    8.04e-01    1.0e-03
      6      2.58e+01    2.49e+01      2.98e-01    3.03e-01      8.16e-01    7.88e-01    1.0e-03
      7      2.62e+01    2.36e+01      2.37e-01    2.40e-01      8.29e-01    7.46e-01    1.0e-03
      8      2.52e+01    2.58e+01      1.80e-01    1.79e-01      7.97e-01    8.15e-01    1.0e-03

This ensures the correctness of the new implementation. And all UTs have passed in my local V100 workstation.

@codecov-commenter
Copy link

codecov-commenter commented Aug 25, 2021

Codecov Report

Merging #1028 (28e16e1) into devel (c0874f0) will decrease coverage by 7.85%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #1028      +/-   ##
==========================================
- Coverage   82.86%   75.01%   -7.86%     
==========================================
  Files         119       86      -33     
  Lines       10110     6924    -3186     
==========================================
- Hits         8378     5194    -3184     
+ Misses       1732     1730       -2     
Impacted Files Coverage Δ
deepmd/fit/ener.py 94.63% <0.00%> (ø)
source/lib/tests/test_simulation_region.cc
source/lib/tests/test_fmt_nlist.cc
source/api_cc/tests/test_deeppot_model_devi.cc
source/lib/tests/test_tabulate.cc
...ource/lib/tests/test_soft_min_switch_force_grad.cc
source/lib/tests/test_coord.cc
source/lib/tests/test_env_mat_a.cc
source/lib/tests/test_main.cc
source/lib/tests/test_soft_min_switch_virial.cc
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0874f0...28e16e1. Read the comment docs.

@denghuilu denghuilu changed the title Speedup scan speedup scan_nlist kernel Aug 25, 2021
@amcadmus
Copy link
Member

@pkulzy could you plz check it this PR works fine on ROCM? Thanks!

@galeselee
Copy link
Contributor

OK. These changes have passed UTs. The specific acceleration is better than I need to wait until the cluster environment is better to measure

@amcadmus amcadmus merged commit 602760e into deepmodeling:devel Aug 25, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
* speedup cuda kernel scan_nlist

* fix no-pbc error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants