Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Errcheck after every kernel function runs And merge redundant code #855

Merged
merged 7 commits into from
Jul 22, 2021

Conversation

galeselee
Copy link
Contributor

1、cudaErrcheck hipErrcheck -> DPErrcheck
2、hipGetDeviceCount cudaGetDeviceCount -> DPGetDeviceCount
3、Encapsulate hipSetDevice and cudaSetDevice with DPSetDevice
4、convert_nlist_gpu_cuda convert_nlist_gpu_rocm -> convert_nlist_gpu_device (also free_nlist_gpu_cuda and free_nlist_gpu_rocm)
5、merge redundant code

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #855 (7f2fe56) into devel (94635ba) will decrease coverage by 9.60%.
The diff coverage is n/a.

❗ Current head 7f2fe56 differs from pull request most recent head c0f57f6. Consider uploading reports for the commit c0f57f6 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #855      +/-   ##
==========================================
- Coverage   73.88%   64.28%   -9.61%     
==========================================
  Files          85        5      -80     
  Lines        6805       14    -6791     
==========================================
- Hits         5028        9    -5019     
+ Misses       1777        5    -1772     
Impacted Files Coverage Δ
source/lib/include/neighbor_list.h 100.00% <ø> (ø)
deepmd/utils/data.py
deepmd/train/run_options.py
deepmd/utils/argcheck.py
deepmd/loggers/loggers.py
deepmd/op/__init__.py
deepmd/utils/data_system.py
deepmd/env.py
deepmd/infer/data_modifier.py
deepmd/descriptor/se_a.py
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94635ba...c0f57f6. Read the comment docs.

@amcadmus amcadmus requested review from denghuilu and iProzd July 15, 2021 00:25
@galeselee
Copy link
Contributor Author

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon
2f31d1453b9ac32e960a1dab8fbb5c5

@denghuilu
Copy link
Member

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon
2f31d1453b9ac32e960a1dab8fbb5c5

There's no problem at api_cc/tests in CUDA environment.

@galeselee
Copy link
Contributor Author

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon
2f31d1453b9ac32e960a1dab8fbb5c5
I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

@njzjz
Copy link
Member

njzjz commented Jul 16, 2021

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

2f31d1453b9ac32e960a1dab8fbb5c5

I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

Maybe it's out of memory?

@galeselee
Copy link
Contributor Author

when I repeat tests of api_cc, somtimes the program is killed. I will find the reason of this bug soon

2f31d1453b9ac32e960a1dab8fbb5c5

I can't reproduce this problem on another machine. I preliminarily think it is the machine problem

Maybe it's out of memory?

But they have the same size of memory, about 12G per control processor

@galeselee galeselee requested a review from denghuilu July 17, 2021 08:58
@@ -27,7 +28,8 @@ inline void cudaAssert(cudaError_t code, const char *file, int line, bool abort=
}

#define nborErrcheck(res) {nborAssert((res), __FILE__, __LINE__);}
inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true) {
inline void nborAssert(cudaError_t code, const char *file, int line, bool abort=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only 'nborErrcheck' that differs. Shall we output the same information like that in 'DPErrcheck' when any kernel meets error Or we output specific information when specific kernel meets error (like 'illegal nbor list sorting' and so on)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shoule use specific information for more important kernel, but not all kernel.

@galeselee
Copy link
Contributor Author

@denghuilu After adding Errcheck for every kernel, I use the water model for performance tests, the result shows there is no loss in performance.

@galeselee galeselee requested a review from iProzd July 20, 2021 07:45
Comment on lines 29 to 30
#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
inline void DPAssert(hipError_t code, const char *file, int line, bool abort=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated implementation.

@galeselee galeselee requested a review from amcadmus July 20, 2021 19:09
@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_
else
graph_def.ParseFromString(file_content);
int gpu_num = -1;
#if GOOGLE_CUDA
cudaGetDeviceCount(&gpu_num); // check current device environment
#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"

Comment on lines 11 to 12
#define DPErrcheck(res) { DPAssert((res), __FILE__, __LINE__); }
inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated definition with those defined in lib/gpu_cuda.h.

#define DPErrcheck(res) {DPAssert((res), __FILE__, __LINE__);}
inline void DPAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess) {
fprintf(stderr,"cuda assert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (code == 2) {
// out of memory
// TODO: I have no idea how to thorw errors back to Python interface
fprintf(stderr, "Your memory is not enough, thus an error has been raised " \
"above. You need to take the following actions:\n" \
"1. Check if the network size of the model is too large.\n" \
"2. Check if the batch size of training or testing is too large. " \
"You can set the training batch size to `auto`.\n" \
"3. Check if the number of atoms is too large.\n" \
"4. Check if another program is using the same GPU by execuating `nvidia-smi`. " \
"The usage of GPUs is controlled by `CUDA_VISIBLE_DEVICES` " \
"environment variable.\n");
}
if (abort) exit(code);
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified source/api_cc/src/DeepPot.cc
image

@@ -218,32 +191,18 @@ init (const std::string & model, const int & gpu_rank, const std::string & file_
else
graph_def.ParseFromString(file_content);
int gpu_num = -1;
#if GOOGLE_CUDA
cudaGetDeviceCount(&gpu_num); // check current device environment
#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we could define GOOGLE_CUDA || TENSORFLOW_USE_ROCM in another place (for example USE_DEVICE), so when we add more devices, we do not need to modify these conditions.

May not be a good idea, because we are supporting more accelerators (device) whose name may not be "gpu"

@galeselee galeselee requested a review from amcadmus July 21, 2021 03:20
@amcadmus amcadmus merged commit 4985932 into deepmodeling:devel Jul 22, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
deepmodeling#855)

* Synchronize CUDA _r modifications to ROCM

* fix bug 824 and Synchronize updates to CUDA code

bug 824 Fixed it in ROCM because of a bug caused by an array going out of bounds

* Update prod_env_mat.hip.cu

* Add Errcheck after every kernel function runs And merge redundant code

* Get rid of duplicate definitions of DPErrcheck

Co-authored-by: 李泽宇 <li_zeyu@pku.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants