Add Errcheck after every kernel function runs And merge redundant code #855

galeselee · 2021-07-14T17:21:54Z

1、cudaErrcheck hipErrcheck -> DPErrcheck
2、hipGetDeviceCount cudaGetDeviceCount -> DPGetDeviceCount
3、Encapsulate hipSetDevice and cudaSetDevice with DPSetDevice
4、convert_nlist_gpu_cuda convert_nlist_gpu_rocm -> convert_nlist_gpu_device (also free_nlist_gpu_cuda and free_nlist_gpu_rocm)
5、merge redundant code

Devel

bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds

Devel

codecov-commenter · 2021-07-14T17:25:07Z

Codecov Report

Merging #855 (7f2fe56) into devel (94635ba) will decrease coverage by 9.60%.
The diff coverage is n/a.

❗ Current head 7f2fe56 differs from pull request most recent head c0f57f6. Consider uploading reports for the commit c0f57f6 to get more accurate results

@@            Coverage Diff             @@
##            devel     #855      +/-   ##
==========================================
- Coverage   73.88%   64.28%   -9.61%     
==========================================
  Files          85        5      -80     
  Lines        6805       14    -6791     
==========================================
- Hits         5028        9    -5019     
+ Misses       1777        5    -1772

Impacted Files	Coverage Δ
source/lib/include/neighbor_list.h	`100.00% <ø> (ø)`
deepmd/utils/data.py
deepmd/train/run_options.py
deepmd/utils/argcheck.py
deepmd/loggers/loggers.py
deepmd/op/__init__.py
deepmd/utils/data_system.py
deepmd/env.py
deepmd/infer/data_modifier.py
deepmd/descriptor/se_a.py
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94635ba...c0f57f6. Read the comment docs.

galeselee · 2021-07-15T07:12:17Z

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

denghuilu · 2021-07-15T07:33:26Z

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

There's no problem at api_cc/tests in CUDA environment.

galeselee · 2021-07-16T16:51:38Z

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

njzjz · 2021-07-16T23:29:20Z

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

Maybe it's out of memory?

galeselee · 2021-07-17T04:21:02Z

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

Maybe it's out of memory?

But they have the same size of memory, about 12G per control processor

iProzd · 2021-07-18T09:49:45Z

source/lib/include/gpu_cuda.h

@@ -27,7 +28,8 @@ inline void cudaAssert(cudaError_t code, const char *file, int line, bool abort=
 }

 #define nborErrcheck(res) {nborAssert((res), __FILE__, __LINE__);}
-inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true) {
+inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true) 


There's only 'nborErrcheck' that differs. Shall we output the same information like that in 'DPErrcheck' when any kernel meets error Or we output specific information when specific kernel meets error (like 'illegal nbor list sorting' and so on)?

I think we shoule use specific information for more important kernel, but not all kernel.

galeselee · 2021-07-19T04:46:36Z

@denghuilu After adding Errcheck for every kernel, I use the water model for performance tests, the result shows there is no loss in performance.

amcadmus · 2021-07-15T01:57:49Z

source/api_cc/src/DeepPot.cc

+#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
+inline void DPAssert(hipError_t code, const char *file, int line, bool abort=true)


Duplicated implementation.

njzjz · 2021-07-20T19:56:23Z

source/api_cc/src/DeepPot.cc

@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_
  else
    graph_def.ParseFromString(file_content);
  int gpu_num = -1;
-  #if GOOGLE_CUDA
-  cudaGetDeviceCount(&gpu_num); // check current device environment
+  #if GOOGLE_CUDA || TENSORFLOW_USE_ROCM


I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"

amcadmus · 2021-07-15T01:57:49Z

source/api_cc/src/DeepPot.cc

+#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
+inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)


duplicated definition with those defined in lib/gpu_cuda.h.

deepmd-kit/source/lib/include/gpu_cuda.h

Lines 8 to 28 in eea2ab1

#define DPErrcheck(res) {DPAssert((res), __FILE__, __LINE__);}

inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)

{

if (code != cudaSuccess) {

fprintf(stderr,"cuda assert: %s %s %d\n", cudaGetErrorString(code), file, line);

if (code == 2) {

// out of memory

// TODO: I have no idea how to thorw errors back to Python interface

fprintf(stderr, "Your memory is not enough, thus an error has been raised " \

"above. You need to take the following actions:\n" \

"1. Check if the network size of the model is too large.\n" \

"2. Check if the batch size of training or testing is too large. " \

"You can set the training batch size to `auto`.\n" \

"3. Check if the number of atoms is too large.\n" \

"4. Check if another program is using the same GPU by execuating `nvidia-smi`. " \

"The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` " \

"environment variable.\n");

}

if (abort) exit(code);

}

}

I have modified source/api_cc/src/DeepPot.cc

amcadmus · 2021-07-20T23:38:42Z

source/api_cc/src/DeepPot.cc

@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_
  else
    graph_def.ParseFromString(file_content);
  int gpu_num = -1;
-  #if GOOGLE_CUDA
-  cudaGetDeviceCount(&gpu_num); // check current device environment
+  #if GOOGLE_CUDA || TENSORFLOW_USE_ROCM


I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"

deepmodeling#855) * Synchronize CUDA _r modifications to ROCM * fix bug 824 and Synchronize updates to CUDA code bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds * Update prod_env_mat.hip.cu * Add Errcheck after every kernel function runs And merge redundant code * Get rid of duplicate definitions of DPErrcheck Co-authored-by: 李泽宇 <li_zeyu@pku.edu.cn>

galeselee and others added 6 commits June 30, 2021 02:48

Synchronize CUDA _r modifications to ROCM

70cb226

Merge pull request #1 from deepmodeling/devel

d6e9d13

Devel

fix bug 824 and Synchronize updates to CUDA code

8588bbb

bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds

Update prod_env_mat.hip.cu

1f0cb9a

Merge pull request #2 from deepmodeling/devel

82de35a

Devel

Add Errcheck after every kernel function runs And merge redundant code

eea2ab1

amcadmus requested review from denghuilu and iProzd July 15, 2021 00:25

denghuilu approved these changes Jul 17, 2021

View reviewed changes

galeselee requested a review from denghuilu July 17, 2021 08:58

iProzd reviewed Jul 18, 2021

View reviewed changes

denghuilu approved these changes Jul 19, 2021

View reviewed changes

galeselee requested a review from iProzd July 20, 2021 07:45

amcadmus reviewed Jul 20, 2021

View reviewed changes

Get rid of duplicate definitions of DPErrcheck

c0f57f6

galeselee requested a review from amcadmus July 20, 2021 19:09

njzjz reviewed Jul 20, 2021

View reviewed changes

amcadmus reviewed Jul 20, 2021

View reviewed changes

galeselee requested a review from amcadmus July 21, 2021 03:20

amcadmus approved these changes Jul 22, 2021

View reviewed changes

amcadmus merged commit 4985932 into deepmodeling:devel Jul 22, 2021

njzjz mentioned this pull request Aug 22, 2021

give a clear message if model.get_ntypes()<data.get_ntypes() #1016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Errcheck after every kernel function runs And merge redundant code #855

Add Errcheck after every kernel function runs And merge redundant code #855

galeselee commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 •

edited

Loading

galeselee commented Jul 15, 2021

denghuilu commented Jul 15, 2021

galeselee commented Jul 16, 2021

njzjz commented Jul 16, 2021

galeselee commented Jul 17, 2021

iProzd Jul 18, 2021

galeselee Jul 19, 2021

galeselee commented Jul 19, 2021

amcadmus Jul 15, 2021

njzjz Jul 20, 2021

amcadmus Jul 20, 2021

amcadmus Jul 15, 2021

galeselee Jul 21, 2021

amcadmus Jul 20, 2021

		#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
		inline void DPAssert(hipError_t code, const char *file, int line, bool abort=true)

		#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
		inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)

	#define DPErrcheck(res) {DPAssert((res), __FILE__, __LINE__);}
	inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)
	{
	if (code != cudaSuccess) {
	fprintf(stderr,"cuda assert: %s %s %d\n", cudaGetErrorString(code), file, line);
	if (code == 2) {
	// out of memory
	// TODO: I have no idea how to thorw errors back to Python interface
	fprintf(stderr, "Your memory is not enough, thus an error has been raised " \
	"above. You need to take the following actions:\n" \
	"1. Check if the network size of the model is too large.\n" \
	"2. Check if the batch size of training or testing is too large. " \
	"You can set the training batch size to `auto`.\n" \
	"3. Check if the number of atoms is too large.\n" \
	"4. Check if another program is using the same GPU by execuating `nvidia-smi`. " \
	"The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` " \
	"environment variable.\n");
	}
	if (abort) exit(code);
	}
	}

Add Errcheck after every kernel function runs And merge redundant code #855

Add Errcheck after every kernel function runs And merge redundant code #855

Conversation

galeselee commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 • edited Loading

Codecov Report

galeselee commented Jul 15, 2021

denghuilu commented Jul 15, 2021

galeselee commented Jul 16, 2021

njzjz commented Jul 16, 2021

galeselee commented Jul 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galeselee commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 14, 2021 •

edited

Loading