-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Errcheck after every kernel function runs And merge redundant code #855
Conversation
bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds
Codecov Report
@@ Coverage Diff @@
## devel #855 +/- ##
==========================================
- Coverage 73.88% 64.28% -9.61%
==========================================
Files 85 5 -80
Lines 6805 14 -6791
==========================================
- Hits 5028 9 -5019
+ Misses 1777 5 -1772 Continue to review full report at Codecov.
|
@@ -27,7 +28,8 @@ inline void cudaAssert(cudaError_t code, const char *file, int line, bool abort= | |||
} | |||
|
|||
#define nborErrcheck(res) {nborAssert((res), __FILE__, __LINE__);} | |||
inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true) { | |||
inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only 'nborErrcheck' that differs. Shall we output the same information like that in 'DPErrcheck' when any kernel meets error Or we output specific information when specific kernel meets error (like 'illegal nbor list sorting' and so on)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shoule use specific information for more important kernel, but not all kernel.
@denghuilu After adding Errcheck for every kernel, I use the water model for performance tests, the result shows there is no loss in performance. |
source/api_cc/src/DeepPot.cc
Outdated
#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); } | ||
inline void DPAssert(hipError_t code, const char *file, int line, bool abort=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated implementation.
@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_ | |||
else | |||
graph_def.ParseFromString(file_content); | |||
int gpu_num = -1; | |||
#if GOOGLE_CUDA | |||
cudaGetDeviceCount(&gpu_num); // check current device environment | |||
#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM
in another place (for example USE_DEVICE
), so when we add more devices, we do not need to modify these conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we could define
GOOGLE_CUDA || TENSORFLOW_USE_ROCM
in another place (for exampleUSE_DEVICE
), so when we add more devices, we do not need to modify these conditions.
May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"
source/api_cc/src/DeepPot.cc
Outdated
#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); } | ||
inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated definition with those defined in lib/gpu_cuda.h
.
deepmd-kit/source/lib/include/gpu_cuda.h
Lines 8 to 28 in eea2ab1
#define DPErrcheck(res) {DPAssert((res), __FILE__, __LINE__);} | |
inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true) | |
{ | |
if (code != cudaSuccess) { | |
fprintf(stderr,"cuda assert: %s %s %d\n", cudaGetErrorString(code), file, line); | |
if (code == 2) { | |
// out of memory | |
// TODO: I have no idea how to thorw errors back to Python interface | |
fprintf(stderr, "Your memory is not enough, thus an error has been raised " \ | |
"above. You need to take the following actions:\n" \ | |
"1. Check if the network size of the model is too large.\n" \ | |
"2. Check if the batch size of training or testing is too large. " \ | |
"You can set the training batch size to `auto`.\n" \ | |
"3. Check if the number of atoms is too large.\n" \ | |
"4. Check if another program is using the same GPU by execuating `nvidia-smi`. " \ | |
"The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` " \ | |
"environment variable.\n"); | |
} | |
if (abort) exit(code); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_ | |||
else | |||
graph_def.ParseFromString(file_content); | |||
int gpu_num = -1; | |||
#if GOOGLE_CUDA | |||
cudaGetDeviceCount(&gpu_num); // check current device environment | |||
#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we could define
GOOGLE_CUDA || TENSORFLOW_USE_ROCM
in another place (for exampleUSE_DEVICE
), so when we add more devices, we do not need to modify these conditions.
May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"
deepmodeling#855) * Synchronize CUDA _r modifications to ROCM * fix bug 824 and Synchronize updates to CUDA code bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds * Update prod_env_mat.hip.cu * Add Errcheck after every kernel function runs And merge redundant code * Get rid of duplicate definitions of DPErrcheck Co-authored-by: 李泽宇 <li_zeyu@pku.edu.cn>
1、cudaErrcheck hipErrcheck -> DPErrcheck
2、hipGetDeviceCount cudaGetDeviceCount -> DPGetDeviceCount
3、Encapsulate hipSetDevice and cudaSetDevice with DPSetDevice
4、convert_nlist_gpu_cuda convert_nlist_gpu_rocm -> convert_nlist_gpu_device (also free_nlist_gpu_cuda and free_nlist_gpu_rocm)
5、merge redundant code