CUDA: multi GPUs issue #3450

Brian0906 · 2020-10-10T01:31:30Z

I'm trying to use multi-GPUs to train the model. When I increase the number of data, this issue happens.

Everything goes well if the size of train set is less than 10000.

Operating System: Linux

CPU/GPU model: GPU

StrikerRUS · 2020-10-10T01:59:24Z

@Brian0906 Thanks a lot using experimental CUDA implementation! I observe the same error even with 1 GPU executing simple_example.py: #3424 (comment).

Brian0906 · 2020-10-10T02:55:23Z

hi @StrikerRUS what's the size of trainset in the simple_example.py? I fount that only if the size of dataset is large, this issue happens.

StrikerRUS · 2020-10-10T03:46:22Z

Dataset is very small in the example: 7000x28
https://github.com/microsoft/LightGBM/blob/master/examples/regression/regression.train

austinpagan · 2021-01-04T20:52:49Z

Hi, @StrikerRUS and @Brian0906 ... is this the current issue for this problem? I'm one of the members of the team at IBM that ported this CUDA code, and I'm ready to try to reproduce this problem in my environment, if someone can teach me how, preferably with the most simple possible dataset.

My plan would be to fix this problem with the simplest possible dataset and then see if that fixes it in the original environment.

StrikerRUS · 2021-01-04T21:17:05Z

Hello @austinpagan !

Please refer to #3428 (comment) for the self-contained repro via Docker. Please let me know if you need any additional details.

austinpagan · 2021-01-05T00:00:59Z

Let me apologize, @StrikerRUS, if my questions are seen as somehow inappropriate, as I'm rather new to the open source environment...

OK, so three things I'd like to understand, please:
(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?
(2) Is this problem only seen within Docker containers?
(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system. I know it's only happening when you run on CUDA, but we'd still like to understand the environment.

StrikerRUS · 2021-01-05T01:32:03Z

@austinpagan No need to apologize! Let me try to be more precise and do my best to answer your questions.

(2) Is this problem only seen within Docker containers?

No, this error can be reproduced w/ and w/o Docker. But I believe Docker is the easiest way to reproduce the error on your side as it ensures we are using the same environment.

(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system.

We don't test Power systems. So we can be 100% sure only that X86 systems are affected.

(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?

Get machine (preferably x86, because we cannot guaranty that the bug is reproduced on Power machines) with NVIDIA GPU (we've tested with Tesla M60 and Tesla P100, but I don't think it matters).
Install Docker and NVIDIA Docker (nvidia-docker2) on your machine. https://docs.docker.com/engine/install/ubuntu/ and https://github.com/NVIDIA/nvidia-docker#getting-started can help with this.
Run the following command in your console to get the latest sources of LightGBM:
```
git clone --recursive https://github.com/microsoft/LightGBM
```
Set environment variable named GITHUB_WORKSPACE to the path where you've downloaded LightGBM repository at step #2. It will be something like export GITHUB_WORKSPACE=/home/yourUserName/Documents/LightGBM.

Run the following bunch of commands in your console:

export ROOT_DOCKER_FOLDER=/LightGBM
cat > docker.env <<EOF
TASK=cuda
COMPILER=gcc
GITHUB_ACTIONS=true
OS_NAME=linux
BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
CONDA_ENV=test-env
PYTHON_VERSION=3.8
EOF
cat > docker-script.sh <<EOF
export CONDA=\$HOME/miniconda
export PATH=\$CONDA/bin:\$PATH
nvidia-smi
$ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
$ROOT_DOCKER_FOLDER/.ci/test.sh
source activate \$CONDA_ENV
cd \$BUILD_DIRECTORY/examples/python-guide/
python simple_example.py
EOF
sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

This will run simple_example.py inside NVIDIA Docker and let you reproduce the error.

Please feel free to ping me if something is still doesn't clear for you or you face any errors during preparing the repro.

austinpagan · 2021-01-05T18:23:25Z

So, since we're not conveniently set up with X86 boxes here, I decided to at least try to see if I could reproduce the problem on a Power system (since, after all, we did this exercise largely to allow folks on Power to access the GPUs, and did not contemplate that X86 folks would experiment with moving from OpenCL to direct CUDA).

INSIDE my docker container on my power box, I just ran the sample and the output looked like this:

(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[1]	valid_0's l2: 0.244076	valid_0's l1: 0.493018
Training until validation scores don't improve for 5 rounds
[2]	valid_0's l2: 0.240297	valid_0's l1: 0.489056
[3]	valid_0's l2: 0.235733	valid_0's l1: 0.484089
[4]	valid_0's l2: 0.231352	valid_0's l1: 0.479088
[5]	valid_0's l2: 0.228939	valid_0's l1: 0.476159
[6]	valid_0's l2: 0.22593	valid_0's l1: 0.472664
[7]	valid_0's l2: 0.222515	valid_0's l1: 0.468425
[8]	valid_0's l2: 0.219569	valid_0's l1: 0.464594
[9]	valid_0's l2: 0.2168	valid_0's l1: 0.460795
[10]	valid_0's l2: 0.214371	valid_0's l1: 0.457276
[11]	valid_0's l2: 0.211988	valid_0's l1: 0.453923
[12]	valid_0's l2: 0.210264	valid_0's l1: 0.451235
[13]	valid_0's l2: 0.208926	valid_0's l1: 0.448992
[14]	valid_0's l2: 0.207403	valid_0's l1: 0.44634
[15]	valid_0's l2: 0.20601	valid_0's l1: 0.444016
[16]	valid_0's l2: 0.204447	valid_0's l1: 0.441362
[17]	valid_0's l2: 0.202712	valid_0's l1: 0.43891
[18]	valid_0's l2: 0.201066	valid_0's l1: 0.436192
[19]	valid_0's l2: 0.1998	valid_0's l1: 0.433884
[20]	valid_0's l2: 0.198063	valid_0's l1: 0.431129
Did not meet early stopping. Best iteration is:
[20]	valid_0's l2: 0.198063	valid_0's l1: 0.431129
Saving model...
Starting predicting...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
The rmse of prediction is: 0.4450426449744025
(base) [root@58814263a195 python-guide]#

That message about "double precision calculations" is telling me we are using our code. Is this a good result, or is there an error here?

I also wanted to try a raw run on a lightgbm repository completely outside of the Docker universe, so on a different Power box, I cloned the repository and did the following commands:

cd LightGBM
mkdir build ; cd build
cmake ..
make -j4

That all seemed to work, so I went into the directory with the program and ran it. It gave me the following fundamental error:

[fossum@rain6p1 python-guide]$ pwd
/home/fossum/LightGBM/examples/python-guide
[fossum@rain6p1 python-guide]$ python3.8 simple_example.py
Traceback (most recent call last):
  File "simple_example.py", line 2, in <module>
    import lightgbm as lgb
ModuleNotFoundError: No module named 'lightgbm'
[fossum@rain6p1 python-guide]$

I naively went back to the LightGBM and tried "make install" but that was a non-starter.

Not being a python expert, I figured I'd stop here and report my status, so maybe you could give me some pointers...

StrikerRUS · 2021-01-05T22:57:59Z

@austinpagan Am I right that you got successful run of the simple_example.py script by following my guide from #3450 (comment) but without step #0?

That message about "double precision calculations" is telling me we are using our code.

What do you mean by "our code"? CUDA implementation your team contributed to LightGBM repository or some your internal code from a fork?

austinpagan · 2021-01-06T02:25:50Z

Easy answer first:
"our code" means CUDA implementation our team contributed to LightGBM repository.
These warnings are only printed out when you run the code requesting the "cuda" device (as opposed to the OpenGL "gpu" device).

Yes, I ran "simple_example.py" following your guide, but skipping both steps 0 and 1, because we already have some Power boxes with functional docker containers, which already contained relatively recent clones of LightGBM, so I just went into one of them, and executed the "simple_example.py" program.

So, again, if you could help us figure out how to get the not-inside-a-container version running, we can hope to see the error there, and I can work on it.

Failing that, my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out. I could imagine this becoming an iterative process, and after a few iterations, we can determine why it's not working in your environment.

StrikerRUS · 2021-01-06T04:18:36Z

Thanks for your prompt response!

which already contained relatively recent clones of LightGBM

Could you be more precise and tell based on what commit your local LightGBM version was compiled? You can check it by running

git rev-parse HEAD

inside your local clone of the repo. Before taking any further steps we should agree on version we will debug with. Because by continuing with different versions of source files we are making the whole debug process pointless.

austinpagan · 2021-01-06T12:11:42Z

Fortunately for both of us, I'm a morning person. With the nine hour time difference between Москва and Austin, me being at my computer at 3PM your time will improve our productivity. To the extent that you can work a bit into your evening, that helps as well!

(base) [root@58814263a195 LightGBM]# pwd
/home/builder/fossum/LightGBM
(base) [root@58814263a195 LightGBM]# git rev-parse HEAD
5d79ff20d1b7ae226531e2445b17d747b253a637
(base) [root@58814263a195 LightGBM]#

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine, but you'll have to walk me through the process of building it to the point where my attempt to run the python test doesn't fail as I had indicated above on my other box. (My strengths are algorithms and debugging and c coding, not building and installing.)

austinpagan · 2021-01-06T12:15:08Z

I hope it's OK that we're more used to doing our work inside the docker container rather than issuing commands to the container from outside...

StrikerRUS · 2021-01-06T13:46:23Z

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine,

No thanks, I believe that 5d79ff2 is a good candidate for the debugging! Let's continue with this commit.

Given that simple code runs OK on POWER machine but fails on many x86 ones, it is starting to look like the bug affects only x86 architecture. However, it is quite strange because we are speaking about CUDA code executing on NVIDIA cards here...

I think we can follow your suggestion

my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out.

Let me compile LightGBM with the commit we agreed on and run the most verbose version of logs. Then I think you can suggest me some debug code injections and I'll recompile with them and get back with more info. I guess it will be the most efficient form of collaboration given that we do not have an easy access to POWER machines and you do not have a easy access to x86 ones. Please let me know WDYT.

austinpagan · 2021-01-06T13:57:50Z

I am happy with this plan!

I have a recommendation. If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

Also, I will just let you know that my plan would be to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file, but let's see what your log reports have to say.

austinpagan · 2021-01-06T14:01:23Z

Two more things.
(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?
(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

StrikerRUS · 2021-01-07T15:32:10Z

OK, I have setup fresh and minimal environment to start debugging process.

If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

What variable do you mean? I run a bash script inside a docker. It's common practise to ask Docker to run something. It can't be a problem. More proofs come from other reports of the same error. I believe users reported them use pretty different scripts and maybe do not use Docker at all. And they for sure do not use any variables that I use.

(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?

Yeah, that's why I've asked you to setup clean Docker environment. I was suspecting that you have some other version of LightGBM that works fine on your side. But now I'm quite confident with that. The thing is that that commit you've told me your version of LightGBM is compiled from simply cannot be compiled. CMake reports the following error.

...
[ 77%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
/LightGBM/src/treelearner/cuda_tree_learner.cpp: In member function 'LightGBM::Tree* LightGBM::CUDATreeLearner::Train(const score_t*, const score_t*)':
/LightGBM/src/treelearner/cuda_tree_learner.cpp:538:59: error: no matching function for call to 'LightGBM::CUDATreeLearner::Train(const score_t*&, const score_t*&)'
  538 |   Tree *ret = SerialTreeLearner::Train(gradients, hessians);
      |                                                           ^
In file included from /LightGBM/src/treelearner/cuda_tree_learner.h:25,
                 from /LightGBM/src/treelearner/cuda_tree_learner.cpp:6:
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note: candidate: 'virtual LightGBM::Tree* LightGBM::SerialTreeLearner::Train(const score_t*, const score_t*, bool)'
   78 |   Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
      |         ^~~~~
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note:   candidate expects 3 arguments, 2 provided
[ 80%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
make[3]: *** [CMakeFiles/_lightgbm.dir/build.make:407: CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/Makefile2:304: CMakeFiles/_lightgbm.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:311: CMakeFiles/_lightgbm.dir/rule] Error 2
make: *** [Makefile:274: _lightgbm] Error 2

This happens due to the following recent changes in LightGBM codebase: fcfd413 (but those changes came before the commit we agreed on).
So you should rebuild LightGBM to match the commit you've specified (and ensure that compilation fails), or tell me another (older) commit that your LightGBM version is really built from.

However, I went ahead and fixed the error which didn't allow to compile the library.

These fixes allowed me to successfully compile the library with the commit you've mentioned (5d79ff2).

Then I specified verbose=4 in simple_example.py to get debug logs from cpp code but unfortunately this didn't help. The error is still the same as before without no additional info.

2021-01-07T15:06:02.5788235Z Loading data...
2021-01-07T15:06:02.5789446Z 
2021-01-07T15:06:02.5789792Z Starting training...
2021-01-07T15:06:02.5790650Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5791552Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T15:06:02.5792427Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5798769Z Traceback (most recent call last):
2021-01-07T15:06:02.5799483Z   File "simple_example.py", line 38, in <module>
2021-01-07T15:06:02.5799965Z     early_stopping_rounds=5)
2021-01-07T15:06:02.5801170Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
2021-01-07T15:06:02.5801839Z     booster = Booster(params=params, train_set=train_set)
2021-01-07T15:06:02.5802709Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in __init__
2021-01-07T15:06:02.5803309Z     ctypes.byref(self.handle)))
2021-01-07T15:06:02.5804122Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
2021-01-07T15:06:02.5805012Z     raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
2021-01-07T15:06:02.5811139Z lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T15:06:02.5811782Z 
2021-01-07T15:06:05.3524322Z ##[error]Process completed with exit code 1.

So I will really appreciate your suggestions for

to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file

Speaking about to how re-compile and reinstall LightGBM, it is quite simple.

Commands to compile the dynamic library:

LightGBM/.github/workflows/cuda.yml

Lines 76 to 80 in 5eee55c

    
                       mkdir \$BUILD_DIRECTORY/build && cd \$BUILD_DIRECTORY/build 
        
                       sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' \$BUILD_DIRECTORY/include/LightGBM/config.h 
        
                       grep -q 'std::string device_type = "cuda"' \$BUILD_DIRECTORY/include/LightGBM/config.h || exit -1  # make sure that changes were really done 
        
                       cmake -DUSE_CUDA=ON .. 
        
                       make _lightgbm -j4 || exit -1

Command to install python package with just compiled library:

LightGBM/.github/workflows/cuda.yml

Line 81 in 5eee55c

    
                       cd \$BUILD_DIRECTORY/python-package && python setup.py install --precompile --user || exit -1

Here is the full script that is used to install and setup Docker, clone repository, install CMake, Python and so on:
https://github.com/microsoft/LightGBM/blob/test_cuda/.github/workflows/cuda.yml

(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

Thanks! I setup the same Python version (3.6) to mimic your environment.

austinpagan · 2021-01-07T15:41:49Z

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in src/treelearner/cuda_tree_learner.cpp...

austinpagan · 2021-01-07T15:44:17Z

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

(base) [root@58814263a195 treelearner]# sum cuda_tree_learner.cpp
36657    40
(base) [root@58814263a195 treelearner]#

StrikerRUS · 2021-01-07T15:52:32Z

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in

Thank you very much!

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

Have you applied two those fixes?

However, I went ahead and fixed the error which didn't allow to compile the library.

fcdeb10

5eee55c

austinpagan · 2021-01-07T15:59:01Z

This may or may not end up being a "fix" if it helps, but it's useful information to have, and it's an easy change.

Please replace line 414 of src/treelearner/cuda_tree_learner with a different line, as follows:

Current line:

CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));

Suggested new line:

CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));

StrikerRUS · 2021-01-07T15:59:39Z

Have you applied two those fixes?

Here what I'm getting after a patch:

Check sum of cuda_tree_learner.cpp
15848    40

austinpagan · 2021-01-07T16:00:54Z

Sorry, I don't know how to "apply" a fix.

austinpagan · 2021-01-07T16:04:22Z

Oh, never mind. I see now. Give me a couple minutes.

austinpagan · 2021-01-07T16:15:28Z

still claiming the problem is in line 414, right?

StrikerRUS · 2021-01-07T16:21:13Z

And, on my end, I can't get this code to build. So frustrating...

"this code" = code with these fixes #3450 (comment)?

Maybe you don't have all source files? Could you please try to re-clone the repo and only after that apply a fix?

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
git checkout 5d79ff20d1b7ae226531e2445b17d747b253a637


<apply fixes to src/treelearner/cuda_tree_learner.h and src/treelearner/cuda_tree_learner.cpp>

so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed?

Yes, absolutely right.

still claiming the problem is in line 414, right?

I guess so. At least the error comes from line 414...

austinpagan · 2021-01-07T17:08:35Z

so when I try to build, it's trying to get files from the "external_libs" directory, but in my clone, that directory just contains two empty sub-directories... any idea whether I'm missing some piece of the build that populates those directories? It looks like there's a "setup.py" file that mentions this directory, but I don't know who is supposed to execute that setup command...

austinpagan · 2021-01-07T17:09:31Z

We are investigating, but I figured it wouldn't hurt to ask you if you just know the answer off the top of your head...

StrikerRUS · 2021-01-07T17:55:29Z

that directory just contains two empty sub-directories...

Please make sure you don't forget --recursive flag during cloning the repo.

git clone --recursive https://github.com/microsoft/LightGBM.git

StrikerRUS · 2021-01-07T18:46:55Z

I've tried and can confirm that we can reproduce the error with simple command-line program. I simplified reproducible example so that it no longer requires Python installation. I believe it will help to sync environments.

Fortunately, the error is still the same. But we do not need a proxy of Python layer anymore. Now we run simple regression example from the repository directly via CLI version of LightGBM. Previously we run it via our Python-package.

Please take a look at greatly simplified script (no Python, no any env. variables) we run inside a Docker to reproduce the error:

LightGBM/.github/workflows/cuda.yml

Lines 43 to 62 in bcc3f29

    
                 - name: Test CUDA 
        
                   run: | 
        
                       cat > docker-script.sh <<EOF 
        
                       nvidia-smi 
        
                       apt-get update 
        
                       apt-get install --no-install-recommends -y curl 
        
                       curl -sL https://cmake.org/files/v3.19/cmake-3.19.2-Linux-x86_64.sh -o cmake.sh 
        
                       chmod +x cmake.sh 
        
                       ./cmake.sh --prefix=/usr/local --exclude-subdir 
        
                       sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' /LightGBM/include/LightGBM/config.h 
        
                       grep -q 'std::string device_type = "cuda"' /LightGBM/include/LightGBM/config.h || exit -1  # make sure that changes were really done 
        
                       echo "Check sum of cuda_tree_learner.cpp:" 
        
                       sum /LightGBM/src/treelearner/cuda_tree_learner.cpp 
        
                       mkdir /LightGBM/build && cd /LightGBM/build 
        
                       cmake -DUSE_CUDA=ON .. 
        
                       make lightgbm -j4 || exit -1 
        
                       cd ../examples/regression 
        
                       ../../lightgbm config=train.conf 
        
                       EOF 
        
                       sudo docker run -v "$(pwd)":"/LightGBM" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash /LightGBM/docker-script.sh

This script

ensures we have NVIDIA card available inside a Docker via nvidia-smi
installs curl and CMake to compile LightGBM
changes default device type from cpu to cuda in source config file (we will see later [Warning] CUDA currently requires double precision calculations. warning that proves successful change)
checks sum of cuda_tree_learner.cpp
compiles lightgbm executable programm
runs regression example

StrikerRUS · 2021-01-07T19:02:19Z

And here are more verbose logs from the run after applying your proposed change in 414 line of src/treelearner/cuda_tree_learner.cpp file #3450 (comment):

2021-01-07T18:54:57.1318861Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T18:54:57.1320390Z [LightGBM] [Info] Finished loading parameters
2021-01-07T18:54:57.1320991Z [LightGBM] [Debug] Loading train file...
2021-01-07T18:54:57.1405940Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:57.1597220Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T18:54:58.2787014Z [LightGBM] [Debug] Loading validation file #1...
2021-01-07T18:54:58.2879002Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:58.2964932Z [LightGBM] [Info] Finished loading data in 1.165807 seconds
2021-01-07T18:54:58.2965532Z [LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
2021-01-07T18:54:58.2971585Z [LightGBM] [Info] Total Bins 6132
2021-01-07T18:54:58.2981032Z [LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
2021-01-07T18:54:58.2981689Z [LightGBM] [Debug] device_bin_size_ = 256
2021-01-07T18:54:58.2982161Z [LightGBM] [Debug] Resized feature masks
2021-01-07T18:54:58.2982684Z [LightGBM] [Debug] Memset pinned_feature_masks_
2021-01-07T18:54:58.2983679Z [LightGBM] [Debug] Allocated device_features_ addr=0x7ff5aaa00000 sz=196000
2021-01-07T18:54:58.2985727Z [LightGBM] [Debug] Memset device_data_indices_
2021-01-07T18:54:58.2991002Z [LightGBM] [Fatal] [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.2995493Z [LightGBM] [Debug] created device_subhistograms_: 0x7ff5ab000000
2021-01-07T18:54:58.3027139Z 
2021-01-07T18:54:58.3027684Z [LightGBM] [Debug] Started copying dense features from CPU to GPU
2021-01-07T18:54:58.3028247Z Met Exceptions:
2021-01-07T18:54:58.3028802Z [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.3029237Z 
2021-01-07T18:54:58.3030255Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
2021-01-07T18:54:58.3031103Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3031917Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3032773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3033581Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3034408Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3035216Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3036038Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3036843Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3037660Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3038459Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3039263Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3040077Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3041108Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3041993Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3042794Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3043607Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3044405Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3045225Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3046029Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3046847Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3047646Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3048447Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3049264Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3050082Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3050902Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3051702Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3052521Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3053318Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3054138Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3054939Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3055754Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3056550Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3057347Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3058351Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3059161Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3059976Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3060773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3061589Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3062382Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:59.6885338Z ##[error]Process completed with exit code 255.

Hope they will help somehow. Please let me know how can I modify the source code of CUDA treelearner more to get useful info that will help to narrow the problem.

austinpagan · 2021-01-07T19:39:30Z

So, sorry for the delay in response. My colleague seems to be close to figuring out how we can reproduce this problem on Power systems. You can rest easy for now, because if he is successful, we can handle it from here on out...

StrikerRUS · 2021-01-07T19:55:57Z

Oh, great news! Thank you very much!

austinpagan · 2021-01-07T21:06:50Z

And, it is confirmed. On my power system, I get this now:

(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Fatal] [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

Traceback (most recent call last):
File "simple_example.py", line 39, in
early_stopping_rounds=5)
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
booster = Booster(params=params, train_set=train_set)
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in init
ctypes.byref(self.handle)))
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

(base) [root@58814263a195 python-guide]#

austinpagan · 2021-01-07T21:07:32Z

So, again, I can pursue this now, without pestering you. Wish me luck!

StrikerRUS · 2021-01-07T21:22:13Z

Hah, in any other situation people shouldn't be happy when someone another gets errors from software, but right now I'm happy! 😄
Hope it won't be hard to find a root cause for you.

Again, if you are not comfortable using Python, please check this my message where I show how to reproduce the same error with LightGBM's executable binary from command line interface. Feel free to ask for any details if something is not clear.

ChipKerchner · 2021-01-08T23:33:43Z

@StrikerRUS The problem is that the non-CUDA vector allocators were changed to use kAlignedSize with VirtualFileWriter::AlignedSize between 3.0 and 3.1. Therefore the CUDA vector allocator wasn't allocating enough space in some instances. Here is a purpose change to fix the CUDA vector allocator. simple_example.py and advanced_example.py work with this change.

diff --git a/include/LightGBM/cuda/vector_cudahost.h b/include/LightGBM/cuda/vector_cudahost.h
index 03db338..46698d0 100644
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
@@ -42,6 +42,7 @@ struct CHAllocator {
   T* allocate(std::size_t n) {
     T* ptr;
     if (n == 0) return NULL;
+    n = (n + kAlignedSize - 1) & -kAlignedSize;
     #ifdef USE_CUDA
       if (LGBM_config_::current_device == lgbm_device_cuda) {
         cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);

StrikerRUS · 2021-01-09T15:31:40Z

@austinpagan @ChipKerchner Awesome! I can confirm that this fix helps to get rid from errors on X86 machines as well. Many thanks for the research you've done and providing the fix!

Would you like to contribute this fix from your account so that GitHub will associate fixing commit with you? Or maybe it's not very important for you and you prefer to let do this to someone from LightGBM maintainers to save your time?

StrikerRUS · 2021-01-11T15:18:50Z

Fixed via #3748.

ChipKerchner · 2021-01-11T23:37:35Z

@StrikerRUS This should fix the remaining CUDA failures. Let me know if you see any issues.

diff --git a/src/treelearner/cuda_tree_learner.cpp b/src/treelearner/cuda_tree_learner.cpp
index 16569ee..4495578 100644
--- a/src/treelearner/cuda_tree_learner.cpp
+++ b/src/treelearner/cuda_tree_learner.cpp
@@ -408,7 +408,7 @@ void CUDATreeLearner::copyDenseFeature() {
     // looking for dword_features_ non-sparse feature-groups
     if (!train_data_->IsMultiGroup(i)) {
       dense_feature_group_map_.push_back(i);
-      auto sizes_in_byte = train_data_->FeatureGroupSizesInByte(i);
+      auto sizes_in_byte = std::min(train_data_->FeatureGroupSizesInByte(i), static_cast<size_t>(num_data_));
       void* tmp_data = train_data_->FeatureGroupData(i);
       Log::Debug("Started copying dense features from CPU to GPU - 2");
       CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cu
@@ -534,8 +534,8 @@ void CUDATreeLearner::InitGPU(int num_gpu) {
   copyDenseFeature();
 }

StrikerRUS · 2021-01-12T11:07:25Z

@ChipKerchner After applying this fix all but two tests are passed! Very nice indeed!

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in #3672 (comment).

============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/c_api_test/test_.py ..                                          [  0%]
../tests/python_package_test/test_basic.py .............                 [  6%]
../tests/python_package_test/test_consistency.py ......                  [  8%]
../tests/python_package_test/test_dask.py ............................   [ 20%]
../tests/python_package_test/test_dual.py s                              [ 21%]
../tests/python_package_test/test_engine.py ............................ [ 32%]
.......................................                                  [ 49%]
../tests/python_package_test/test_plotting.py F...F                      [ 51%]
../tests/python_package_test/test_sklearn.py ........................... [ 62%]
......x.........................................x....................... [ 92%]
.................                                                        [100%]
= 2 failed, 233 passed, 1 skipped, 2 xfailed, 74 warnings in 195.32s (0:03:15) =

ChipKerchner · 2021-01-12T16:36:21Z

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in #3672 (comment).
============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/python_package_test/test_plotting.py F...F                      [ 51%]

In my branch , test_plotting passes all tests.

python -m unittest tests/python_package_test/test_plotting.py
.../test_plotting.py:156: UserWarning: More than one metric available, picking one to plot.
  ax0 = lgb.plot_metric(evals_result0)
..s
----------------------------------------------------------------------
Ran 5 tests in 1.956s

OK (skipped=2)

jameslamb · 2021-01-12T16:38:58Z

woo! Thanks @ChipKerchner . Like @StrikerRUS mentioned, I think it's very very unlikely that the two failing plotting tests are related to your changes. I found in #3672 (comment) that there might be some issues with the conda-forge recipe for graphviz.

StrikerRUS · 2021-01-12T19:10:42Z

Yeah, thanks for the info about tests @ChipKerchner ! I'm 100% sure that 2 failing plotting tests on our side is related to our environment. And I'll fix this environment issue during working on making CUDA builds run on a regular basis.

austinpagan · 2021-01-12T19:44:52Z

@StrikerRUS: Look at you, making all our dreams come true!!! Thank you!

StrikerRUS · 2021-01-12T21:21:45Z

@austinpagan Thanks a lot for all your hard work!

github-actions · 2023-08-23T18:52:39Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Brian0906 changed the title ~~CUDA~~ CUDA: multi GPUs Oct 10, 2020

Brian0906 changed the title ~~CUDA: multi GPUs~~ CUDA: multi GPUs issue Oct 10, 2020

This was referenced Oct 22, 2020

setup CUDA CI job #3424

Merged

3.1.0 release #3484

Merged

This was referenced Nov 3, 2020

[docs] document CUDA version support #3428

Merged

Add support for CUDA-based GPU build #3160

Merged

ChipKerchner mentioned this issue Jan 11, 2021

Ensure CUDA vector length is consistent with AlignedSize #3748

Merged

StrikerRUS closed this as completed Jan 11, 2021

StrikerRUS mentioned this issue Jan 11, 2021

Update CUDA treelearner according to changes introduced for linear trees #3750

Merged

StrikerRUS mentioned this issue Jan 12, 2021

Trees with linear models at leaves #3299

Merged

ChipKerchner mentioned this issue Jan 12, 2021

Don't copy more than has been allocated to device_features. #3752

Merged

pseudotensor mentioned this issue Jan 22, 2021

[LightGBM] [Fatal] [CUDA] invalid argument /workspace/LightGBM/src/treelearner/cuda_tree_learner.cpp 414 #3815

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

CUDA: multi GPUs issue #3450

CUDA: multi GPUs issue #3450

Comments

Brian0906 commented Oct 10, 2020 • edited Loading

StrikerRUS commented Oct 10, 2020

Brian0906 commented Oct 10, 2020

StrikerRUS commented Oct 10, 2020

austinpagan commented Jan 4, 2021 • edited Loading

StrikerRUS commented Jan 4, 2021

austinpagan commented Jan 5, 2021 • edited Loading

StrikerRUS commented Jan 5, 2021 • edited Loading

austinpagan commented Jan 5, 2021 • edited Loading

StrikerRUS commented Jan 5, 2021

austinpagan commented Jan 6, 2021 • edited Loading

StrikerRUS commented Jan 6, 2021

austinpagan commented Jan 6, 2021

austinpagan commented Jan 6, 2021

StrikerRUS commented Jan 6, 2021

austinpagan commented Jan 6, 2021

austinpagan commented Jan 6, 2021 • edited Loading

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

austinpagan commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021 • edited Loading

StrikerRUS commented Jan 7, 2021 • edited Loading

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021

austinpagan commented Jan 7, 2021

austinpagan commented Jan 7, 2021

StrikerRUS commented Jan 7, 2021

ChipKerchner commented Jan 8, 2021 • edited by StrikerRUS Loading

StrikerRUS commented Jan 9, 2021

StrikerRUS commented Jan 11, 2021

ChipKerchner commented Jan 11, 2021 • edited Loading

StrikerRUS commented Jan 12, 2021

ChipKerchner commented Jan 12, 2021

jameslamb commented Jan 12, 2021

StrikerRUS commented Jan 12, 2021

austinpagan commented Jan 12, 2021

StrikerRUS commented Jan 12, 2021

github-actions bot commented Aug 23, 2023

Brian0906 commented Oct 10, 2020 •

edited

Loading

austinpagan commented Jan 4, 2021 •

edited

Loading

austinpagan commented Jan 5, 2021 •

edited

Loading

StrikerRUS commented Jan 5, 2021 •

edited

Loading

austinpagan commented Jan 5, 2021 •

edited

Loading

austinpagan commented Jan 6, 2021 •

edited

Loading

austinpagan commented Jan 6, 2021 •

edited

Loading

StrikerRUS commented Jan 7, 2021 •

edited

Loading

StrikerRUS commented Jan 7, 2021 •

edited

Loading

ChipKerchner commented Jan 8, 2021 •

edited by StrikerRUS

Loading

ChipKerchner commented Jan 11, 2021 •

edited

Loading