-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: multi GPUs issue #3450
Comments
@Brian0906 Thanks a lot using experimental CUDA implementation! I observe the same error even with 1 GPU executing |
hi @StrikerRUS what's the size of trainset in the simple_example.py? I fount that only if the size of dataset is large, this issue happens. |
Dataset is very small in the example: 7000x28 |
Hi, @StrikerRUS and @Brian0906 ... is this the current issue for this problem? I'm one of the members of the team at IBM that ported this CUDA code, and I'm ready to try to reproduce this problem in my environment, if someone can teach me how, preferably with the most simple possible dataset. My plan would be to fix this problem with the simplest possible dataset and then see if that fixes it in the original environment. |
Hello @austinpagan ! Please refer to #3428 (comment) for the self-contained repro via Docker. Please let me know if you need any additional details. |
Let me apologize, @StrikerRUS, if my questions are seen as somehow inappropriate, as I'm rather new to the open source environment... OK, so three things I'd like to understand, please: |
@austinpagan No need to apologize! Let me try to be more precise and do my best to answer your questions.
No, this error can be reproduced w/ and w/o Docker. But I believe Docker is the easiest way to reproduce the error on your side as it ensures we are using the same environment.
We don't test Power systems. So we can be 100% sure only that X86 systems are affected.
This will run simple_example.py inside NVIDIA Docker and let you reproduce the error. Please feel free to ping me if something is still doesn't clear for you or you face any errors during preparing the repro. |
So, since we're not conveniently set up with X86 boxes here, I decided to at least try to see if I could reproduce the problem on a Power system (since, after all, we did this exercise largely to allow folks on Power to access the GPUs, and did not contemplate that X86 folks would experiment with moving from OpenCL to direct CUDA). INSIDE my docker container on my power box, I just ran the sample and the output looked like this:
That message about "double precision calculations" is telling me we are using our code. Is this a good result, or is there an error here? I also wanted to try a raw run on a lightgbm repository completely outside of the Docker universe, so on a different Power box, I cloned the repository and did the following commands:
That all seemed to work, so I went into the directory with the program and ran it. It gave me the following fundamental error:
I naively went back to the LightGBM and tried "make install" but that was a non-starter. Not being a python expert, I figured I'd stop here and report my status, so maybe you could give me some pointers... |
@austinpagan Am I right that you got successful run of the
What do you mean by "our code"? CUDA implementation your team contributed to LightGBM repository or some your internal code from a fork? |
Easy answer first: Yes, I ran "simple_example.py" following your guide, but skipping both steps 0 and 1, because we already have some Power boxes with functional docker containers, which already contained relatively recent clones of LightGBM, so I just went into one of them, and executed the "simple_example.py" program. So, again, if you could help us figure out how to get the not-inside-a-container version running, we can hope to see the error there, and I can work on it. Failing that, my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out. I could imagine this becoming an iterative process, and after a few iterations, we can determine why it's not working in your environment. |
Thanks for your prompt response!
Could you be more precise and tell based on what commit your local LightGBM version was compiled? You can check it by running
inside your local clone of the repo. Before taking any further steps we should agree on version we will debug with. Because by continuing with different versions of source files we are making the whole debug process pointless. |
Fortunately for both of us, I'm a morning person. With the nine hour time difference between Москва and Austin, me being at my computer at 3PM your time will improve our productivity. To the extent that you can work a bit into your evening, that helps as well!
Now, if you want me to clone a fresh version of your choosing and try there, that will be fine, but you'll have to walk me through the process of building it to the point where my attempt to run the python test doesn't fail as I had indicated above on my other box. (My strengths are algorithms and debugging and c coding, not building and installing.) |
I hope it's OK that we're more used to doing our work inside the docker container rather than issuing commands to the container from outside... |
No thanks, I believe that 5d79ff2 is a good candidate for the debugging! Let's continue with this commit. Given that simple code runs OK on POWER machine but fails on many x86 ones, it is starting to look like the bug affects only x86 architecture. However, it is quite strange because we are speaking about CUDA code executing on NVIDIA cards here... I think we can follow your suggestion
Let me compile LightGBM with the commit we agreed on and run the most verbose version of logs. Then I think you can suggest me some debug code injections and I'll recompile with them and get back with more info. I guess it will be the most efficient form of collaboration given that we do not have an easy access to POWER machines and you do not have a easy access to x86 ones. Please let me know WDYT. |
I am happy with this plan! I have a recommendation. If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved... Also, I will just let you know that my plan would be to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file, but let's see what your log reports have to say. |
Two more things. |
OK, I have setup fresh and minimal environment to start debugging process.
What variable do you mean? I run a bash script inside a docker. It's common practise to ask Docker to run something. It can't be a problem. More proofs come from other reports of the same error. I believe users reported them use pretty different scripts and maybe do not use Docker at all. And they for sure do not use any variables that I use.
Yeah, that's why I've asked you to setup clean Docker environment. I was suspecting that you have some other version of LightGBM that works fine on your side. But now I'm quite confident with that. The thing is that that commit you've told me your version of LightGBM is compiled from simply cannot be compiled. CMake reports the following error.
This happens due to the following recent changes in LightGBM codebase: fcfd413 (but those changes came before the commit we agreed on). However, I went ahead and fixed the error which didn't allow to compile the library. These fixes allowed me to successfully compile the library with the commit you've mentioned (5d79ff2). Then I specified
So I will really appreciate your suggestions for
Speaking about to how re-compile and reinstall LightGBM, it is quite simple. Commands to compile the dynamic library: LightGBM/.github/workflows/cuda.yml Lines 76 to 80 in 5eee55c
Command to install python package with just compiled library: LightGBM/.github/workflows/cuda.yml Line 81 in 5eee55c
Here is the full script that is used to install and setup Docker, clone repository, install CMake, Python and so on:
Thanks! I setup the same Python version ( |
Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in src/treelearner/cuda_tree_learner.cpp... |
Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:
|
Thank you very much!
Have you applied two those fixes?
|
This may or may not end up being a "fix" if it helps, but it's useful information to have, and it's an easy change. Please replace line 414 of src/treelearner/cuda_tree_learner with a different line, as follows: Current line:
Suggested new line:
|
Here what I'm getting after a patch:
|
Sorry, I don't know how to "apply" a fix. |
Oh, never mind. I see now. Give me a couple minutes. |
still claiming the problem is in line 414, right? |
"this code" = code with these fixes #3450 (comment)? Maybe you don't have all source files? Could you please try to re-clone the repo and only after that apply a fix?
Yes, absolutely right.
I guess so. At least the error comes from line 414... |
so when I try to build, it's trying to get files from the "external_libs" directory, but in my clone, that directory just contains two empty sub-directories... any idea whether I'm missing some piece of the build that populates those directories? It looks like there's a "setup.py" file that mentions this directory, but I don't know who is supposed to execute that setup command... |
We are investigating, but I figured it wouldn't hurt to ask you if you just know the answer off the top of your head... |
Please make sure you don't forget
|
I've tried and can confirm that we can reproduce the error with simple command-line program. I simplified reproducible example so that it no longer requires Python installation. I believe it will help to sync environments. Fortunately, the error is still the same. But we do not need a proxy of Python layer anymore. Now we run simple regression example from the repository directly via CLI version of LightGBM. Previously we run it via our Python-package. Please take a look at greatly simplified script (no Python, no any env. variables) we run inside a Docker to reproduce the error: LightGBM/.github/workflows/cuda.yml Lines 43 to 62 in bcc3f29
This script
|
And here are more verbose logs from the run after applying your proposed change in 414 line of
Hope they will help somehow. Please let me know how can I modify the source code of CUDA treelearner more to get useful info that will help to narrow the problem. |
So, sorry for the delay in response. My colleague seems to be close to figuring out how we can reproduce this problem on Power systems. You can rest easy for now, because if he is successful, we can handle it from here on out... |
Oh, great news! Thank you very much! |
And, it is confirmed. On my power system, I get this now: (base) [root@58814263a195 python-guide]# python simple_example.py Traceback (most recent call last): (base) [root@58814263a195 python-guide]# |
So, again, I can pursue this now, without pestering you. Wish me luck! |
Hah, in any other situation people shouldn't be happy when someone another gets errors from software, but right now I'm happy! 😄 Again, if you are not comfortable using Python, please check this my message where I show how to reproduce the same error with LightGBM's executable binary from command line interface. Feel free to ask for any details if something is not clear. |
@StrikerRUS The problem is that the non-CUDA vector allocators were changed to use kAlignedSize with VirtualFileWriter::AlignedSize between 3.0 and 3.1. Therefore the CUDA vector allocator wasn't allocating enough space in some instances. Here is a purpose change to fix the CUDA vector allocator. simple_example.py and advanced_example.py work with this change. diff --git a/include/LightGBM/cuda/vector_cudahost.h b/include/LightGBM/cuda/vector_cudahost.h
index 03db338..46698d0 100644
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
@@ -42,6 +42,7 @@ struct CHAllocator {
T* allocate(std::size_t n) {
T* ptr;
if (n == 0) return NULL;
+ n = (n + kAlignedSize - 1) & -kAlignedSize;
#ifdef USE_CUDA
if (LGBM_config_::current_device == lgbm_device_cuda) {
cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable); |
@austinpagan @ChipKerchner Awesome! I can confirm that this fix helps to get rid from errors on X86 machines as well. Many thanks for the research you've done and providing the fix! Would you like to contribute this fix from your account so that GitHub will associate fixing commit with you? Or maybe it's not very important for you and you prefer to let do this to someone from LightGBM maintainers to save your time? |
Fixed via #3748. |
@StrikerRUS This should fix the remaining CUDA failures. Let me know if you see any issues. diff --git a/src/treelearner/cuda_tree_learner.cpp b/src/treelearner/cuda_tree_learner.cpp
index 16569ee..4495578 100644
--- a/src/treelearner/cuda_tree_learner.cpp
+++ b/src/treelearner/cuda_tree_learner.cpp
@@ -408,7 +408,7 @@ void CUDATreeLearner::copyDenseFeature() {
// looking for dword_features_ non-sparse feature-groups
if (!train_data_->IsMultiGroup(i)) {
dense_feature_group_map_.push_back(i);
- auto sizes_in_byte = train_data_->FeatureGroupSizesInByte(i);
+ auto sizes_in_byte = std::min(train_data_->FeatureGroupSizesInByte(i), static_cast<size_t>(num_data_));
void* tmp_data = train_data_->FeatureGroupData(i);
Log::Debug("Started copying dense features from CPU to GPU - 2");
CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cu
@@ -534,8 +534,8 @@ void CUDATreeLearner::InitGPU(int num_gpu) {
copyDenseFeature();
} |
@ChipKerchner After applying this fix all but two tests are passed! Very nice indeed! Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same
|
In my branch , test_plotting passes all tests.
|
woo! Thanks @ChipKerchner . Like @StrikerRUS mentioned, I think it's very very unlikely that the two failing plotting tests are related to your changes. I found in #3672 (comment) that there might be some issues with the conda-forge recipe for |
Yeah, thanks for the info about tests @ChipKerchner ! I'm 100% sure that 2 failing plotting tests on our side is related to our environment. And I'll fix this environment issue during working on making CUDA builds run on a regular basis. |
@StrikerRUS: Look at you, making all our dreams come true!!! Thank you! |
@austinpagan Thanks a lot for all your hard work! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I'm trying to use multi-GPUs to train the model. When I increase the number of data, this issue happens.
Everything goes well if the size of train set is less than 10000.
Operating System: Linux
CPU/GPU model: GPU
The text was updated successfully, but these errors were encountered: