Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our profiling of the example water benchmark system shows that the
scan_nlist
kernel within the$deepmd_source_dir/source/lib/src/cuda/neighbor_list.cu
consumes more than 7% of kernel execution time during thedp train
process. And it consumes more than 20% of the kernel execution time in thedp init-frz-model
process.The original
scan_nlist
kernel uses one thread to scan the neighbor list of a central atom. This is inefficient within the training process. Given the training nloc usually smaller than the threads number per cuda block,scan_nlist
will typically launch only one cuda thread-block at each training step and causes a huge waste of computing resources.In the new implementation, I use the nvidia/cub head library to parallelize the scan kernel, and it speedup the
scan_nlist
kernel by more than 50 times. The total training speed-up ratio is about 3%, which is mainly because there's still a considerable part of the training process does not run on the gpu kernel functions. And we still need to perform more detailed profiling of the training process.The lcurve.out of cpu training(lcurve-cpu.out), gpu training(lcurve-gpu.out) and new scan training(lcurve-parallel.out) show the same results:
This ensures the correctness of the new implementation. And all UTs have passed in my local V100 workstation.