Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: multi GPUs issue #3450

Closed
Brian0906 opened this issue Oct 10, 2020 · 51 comments
Closed

CUDA: multi GPUs issue #3450

Brian0906 opened this issue Oct 10, 2020 · 51 comments

Comments

@Brian0906
Copy link

Brian0906 commented Oct 10, 2020

I'm trying to use multi-GPUs to train the model. When I increase the number of data, this issue happens.

Everything goes well if the size of train set is less than 10000.

Operating System: Linux

CPU/GPU model: GPU

image

@Brian0906 Brian0906 changed the title CUDA CUDA: multi GPUs Oct 10, 2020
@Brian0906 Brian0906 changed the title CUDA: multi GPUs CUDA: multi GPUs issue Oct 10, 2020
@StrikerRUS
Copy link
Collaborator

@Brian0906 Thanks a lot using experimental CUDA implementation! I observe the same error even with 1 GPU executing simple_example.py: #3424 (comment).

@Brian0906
Copy link
Author

hi @StrikerRUS what's the size of trainset in the simple_example.py? I fount that only if the size of dataset is large, this issue happens.

@StrikerRUS
Copy link
Collaborator

This was referenced Oct 22, 2020
@austinpagan
Copy link
Contributor

austinpagan commented Jan 4, 2021

Hi, @StrikerRUS and @Brian0906 ... is this the current issue for this problem? I'm one of the members of the team at IBM that ported this CUDA code, and I'm ready to try to reproduce this problem in my environment, if someone can teach me how, preferably with the most simple possible dataset.

My plan would be to fix this problem with the simplest possible dataset and then see if that fixes it in the original environment.

@StrikerRUS
Copy link
Collaborator

Hello @austinpagan !

Please refer to #3428 (comment) for the self-contained repro via Docker. Please let me know if you need any additional details.

@austinpagan
Copy link
Contributor

austinpagan commented Jan 5, 2021

Let me apologize, @StrikerRUS, if my questions are seen as somehow inappropriate, as I'm rather new to the open source environment...

OK, so three things I'd like to understand, please:
(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?
(2) Is this problem only seen within Docker containers?
(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system. I know it's only happening when you run on CUDA, but we'd still like to understand the environment.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 5, 2021

@austinpagan No need to apologize! Let me try to be more precise and do my best to answer your questions.

(2) Is this problem only seen within Docker containers?

No, this error can be reproduced w/ and w/o Docker. But I believe Docker is the easiest way to reproduce the error on your side as it ensures we are using the same environment.

(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system.

We don't test Power systems. So we can be 100% sure only that X86 systems are affected.

(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?

  1. Get machine (preferably x86, because we cannot guaranty that the bug is reproduced on Power machines) with NVIDIA GPU (we've tested with Tesla M60 and Tesla P100, but I don't think it matters).
  2. Install Docker and NVIDIA Docker (nvidia-docker2) on your machine. https://docs.docker.com/engine/install/ubuntu/ and https://github.com/NVIDIA/nvidia-docker#getting-started can help with this.
  3. Run the following command in your console to get the latest sources of LightGBM:
    git clone --recursive https://github.com/microsoft/LightGBM
    
  4. Set environment variable named GITHUB_WORKSPACE to the path where you've downloaded LightGBM repository at step #2. It will be something like export GITHUB_WORKSPACE=/home/yourUserName/Documents/LightGBM.
  5. Run the following bunch of commands in your console:
    export ROOT_DOCKER_FOLDER=/LightGBM
    cat > docker.env <<EOF
    TASK=cuda
    COMPILER=gcc
    GITHUB_ACTIONS=true
    OS_NAME=linux
    BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
    CONDA_ENV=test-env
    PYTHON_VERSION=3.8
    EOF
    cat > docker-script.sh <<EOF
    export CONDA=\$HOME/miniconda
    export PATH=\$CONDA/bin:\$PATH
    nvidia-smi
    $ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
    $ROOT_DOCKER_FOLDER/.ci/test.sh
    source activate \$CONDA_ENV
    cd \$BUILD_DIRECTORY/examples/python-guide/
    python simple_example.py
    EOF
    sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh
    

This will run simple_example.py inside NVIDIA Docker and let you reproduce the error.

Please feel free to ping me if something is still doesn't clear for you or you face any errors during preparing the repro.

@austinpagan
Copy link
Contributor

austinpagan commented Jan 5, 2021

So, since we're not conveniently set up with X86 boxes here, I decided to at least try to see if I could reproduce the problem on a Power system (since, after all, we did this exercise largely to allow folks on Power to access the GPUs, and did not contemplate that X86 folks would experiment with moving from OpenCL to direct CUDA).

INSIDE my docker container on my power box, I just ran the sample and the output looked like this:

(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[1]	valid_0's l2: 0.244076	valid_0's l1: 0.493018
Training until validation scores don't improve for 5 rounds
[2]	valid_0's l2: 0.240297	valid_0's l1: 0.489056
[3]	valid_0's l2: 0.235733	valid_0's l1: 0.484089
[4]	valid_0's l2: 0.231352	valid_0's l1: 0.479088
[5]	valid_0's l2: 0.228939	valid_0's l1: 0.476159
[6]	valid_0's l2: 0.22593	valid_0's l1: 0.472664
[7]	valid_0's l2: 0.222515	valid_0's l1: 0.468425
[8]	valid_0's l2: 0.219569	valid_0's l1: 0.464594
[9]	valid_0's l2: 0.2168	valid_0's l1: 0.460795
[10]	valid_0's l2: 0.214371	valid_0's l1: 0.457276
[11]	valid_0's l2: 0.211988	valid_0's l1: 0.453923
[12]	valid_0's l2: 0.210264	valid_0's l1: 0.451235
[13]	valid_0's l2: 0.208926	valid_0's l1: 0.448992
[14]	valid_0's l2: 0.207403	valid_0's l1: 0.44634
[15]	valid_0's l2: 0.20601	valid_0's l1: 0.444016
[16]	valid_0's l2: 0.204447	valid_0's l1: 0.441362
[17]	valid_0's l2: 0.202712	valid_0's l1: 0.43891
[18]	valid_0's l2: 0.201066	valid_0's l1: 0.436192
[19]	valid_0's l2: 0.1998	valid_0's l1: 0.433884
[20]	valid_0's l2: 0.198063	valid_0's l1: 0.431129
Did not meet early stopping. Best iteration is:
[20]	valid_0's l2: 0.198063	valid_0's l1: 0.431129
Saving model...
Starting predicting...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
The rmse of prediction is: 0.4450426449744025
(base) [root@58814263a195 python-guide]# 

That message about "double precision calculations" is telling me we are using our code. Is this a good result, or is there an error here?

I also wanted to try a raw run on a lightgbm repository completely outside of the Docker universe, so on a different Power box, I cloned the repository and did the following commands:

cd LightGBM
mkdir build ; cd build
cmake ..
make -j4

That all seemed to work, so I went into the directory with the program and ran it. It gave me the following fundamental error:

[fossum@rain6p1 python-guide]$ pwd
/home/fossum/LightGBM/examples/python-guide
[fossum@rain6p1 python-guide]$ python3.8 simple_example.py
Traceback (most recent call last):
  File "simple_example.py", line 2, in <module>
    import lightgbm as lgb
ModuleNotFoundError: No module named 'lightgbm'
[fossum@rain6p1 python-guide]$ 

I naively went back to the LightGBM and tried "make install" but that was a non-starter.

Not being a python expert, I figured I'd stop here and report my status, so maybe you could give me some pointers...

@StrikerRUS
Copy link
Collaborator

@austinpagan Am I right that you got successful run of the simple_example.py script by following my guide from #3450 (comment) but without step #0?

That message about "double precision calculations" is telling me we are using our code.

What do you mean by "our code"? CUDA implementation your team contributed to LightGBM repository or some your internal code from a fork?

@austinpagan
Copy link
Contributor

austinpagan commented Jan 6, 2021

Easy answer first:
"our code" means CUDA implementation our team contributed to LightGBM repository.
These warnings are only printed out when you run the code requesting the "cuda" device (as opposed to the OpenGL "gpu" device).

Yes, I ran "simple_example.py" following your guide, but skipping both steps 0 and 1, because we already have some Power boxes with functional docker containers, which already contained relatively recent clones of LightGBM, so I just went into one of them, and executed the "simple_example.py" program.

So, again, if you could help us figure out how to get the not-inside-a-container version running, we can hope to see the error there, and I can work on it.

Failing that, my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out. I could imagine this becoming an iterative process, and after a few iterations, we can determine why it's not working in your environment.

@StrikerRUS
Copy link
Collaborator

Thanks for your prompt response!

which already contained relatively recent clones of LightGBM

Could you be more precise and tell based on what commit your local LightGBM version was compiled? You can check it by running

git rev-parse HEAD

inside your local clone of the repo. Before taking any further steps we should agree on version we will debug with. Because by continuing with different versions of source files we are making the whole debug process pointless.

@austinpagan
Copy link
Contributor

Fortunately for both of us, I'm a morning person. With the nine hour time difference between Москва and Austin, me being at my computer at 3PM your time will improve our productivity. To the extent that you can work a bit into your evening, that helps as well!

(base) [root@58814263a195 LightGBM]# pwd
/home/builder/fossum/LightGBM
(base) [root@58814263a195 LightGBM]# git rev-parse HEAD
5d79ff20d1b7ae226531e2445b17d747b253a637
(base) [root@58814263a195 LightGBM]# 

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine, but you'll have to walk me through the process of building it to the point where my attempt to run the python test doesn't fail as I had indicated above on my other box. (My strengths are algorithms and debugging and c coding, not building and installing.)

@austinpagan
Copy link
Contributor

I hope it's OK that we're more used to doing our work inside the docker container rather than issuing commands to the container from outside...

@StrikerRUS
Copy link
Collaborator

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine,

No thanks, I believe that 5d79ff2 is a good candidate for the debugging! Let's continue with this commit.

Given that simple code runs OK on POWER machine but fails on many x86 ones, it is starting to look like the bug affects only x86 architecture. However, it is quite strange because we are speaking about CUDA code executing on NVIDIA cards here...

I think we can follow your suggestion

my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out.

Let me compile LightGBM with the commit we agreed on and run the most verbose version of logs. Then I think you can suggest me some debug code injections and I'll recompile with them and get back with more info. I guess it will be the most efficient form of collaboration given that we do not have an easy access to POWER machines and you do not have a easy access to x86 ones. Please let me know WDYT.

@austinpagan
Copy link
Contributor

I am happy with this plan!

I have a recommendation. If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

Also, I will just let you know that my plan would be to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file, but let's see what your log reports have to say.

@austinpagan
Copy link
Contributor

austinpagan commented Jan 6, 2021

Two more things.
(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?
(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

@StrikerRUS
Copy link
Collaborator

OK, I have setup fresh and minimal environment to start debugging process.

If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

What variable do you mean? I run a bash script inside a docker. It's common practise to ask Docker to run something. It can't be a problem. More proofs come from other reports of the same error. I believe users reported them use pretty different scripts and maybe do not use Docker at all. And they for sure do not use any variables that I use.

(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?

Yeah, that's why I've asked you to setup clean Docker environment. I was suspecting that you have some other version of LightGBM that works fine on your side. But now I'm quite confident with that. The thing is that that commit you've told me your version of LightGBM is compiled from simply cannot be compiled. CMake reports the following error.

...
[ 77%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
/LightGBM/src/treelearner/cuda_tree_learner.cpp: In member function 'LightGBM::Tree* LightGBM::CUDATreeLearner::Train(const score_t*, const score_t*)':
/LightGBM/src/treelearner/cuda_tree_learner.cpp:538:59: error: no matching function for call to 'LightGBM::CUDATreeLearner::Train(const score_t*&, const score_t*&)'
  538 |   Tree *ret = SerialTreeLearner::Train(gradients, hessians);
      |                                                           ^
In file included from /LightGBM/src/treelearner/cuda_tree_learner.h:25,
                 from /LightGBM/src/treelearner/cuda_tree_learner.cpp:6:
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note: candidate: 'virtual LightGBM::Tree* LightGBM::SerialTreeLearner::Train(const score_t*, const score_t*, bool)'
   78 |   Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
      |         ^~~~~
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note:   candidate expects 3 arguments, 2 provided
[ 80%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
make[3]: *** [CMakeFiles/_lightgbm.dir/build.make:407: CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/Makefile2:304: CMakeFiles/_lightgbm.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:311: CMakeFiles/_lightgbm.dir/rule] Error 2
make: *** [Makefile:274: _lightgbm] Error 2

This happens due to the following recent changes in LightGBM codebase: fcfd413 (but those changes came before the commit we agreed on).
So you should rebuild LightGBM to match the commit you've specified (and ensure that compilation fails), or tell me another (older) commit that your LightGBM version is really built from.

However, I went ahead and fixed the error which didn't allow to compile the library.

  1. fcdeb10
  2. 5eee55c

These fixes allowed me to successfully compile the library with the commit you've mentioned (5d79ff2).

Then I specified verbose=4 in simple_example.py to get debug logs from cpp code but unfortunately this didn't help. The error is still the same as before without no additional info.

2021-01-07T15:06:02.5788235Z Loading data...
2021-01-07T15:06:02.5789446Z 
2021-01-07T15:06:02.5789792Z Starting training...
2021-01-07T15:06:02.5790650Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5791552Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T15:06:02.5792427Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5798769Z Traceback (most recent call last):
2021-01-07T15:06:02.5799483Z   File "simple_example.py", line 38, in <module>
2021-01-07T15:06:02.5799965Z     early_stopping_rounds=5)
2021-01-07T15:06:02.5801170Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
2021-01-07T15:06:02.5801839Z     booster = Booster(params=params, train_set=train_set)
2021-01-07T15:06:02.5802709Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in __init__
2021-01-07T15:06:02.5803309Z     ctypes.byref(self.handle)))
2021-01-07T15:06:02.5804122Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
2021-01-07T15:06:02.5805012Z     raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
2021-01-07T15:06:02.5811139Z lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T15:06:02.5811782Z 
2021-01-07T15:06:05.3524322Z ##[error]Process completed with exit code 1.

So I will really appreciate your suggestions for

to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file

Speaking about to how re-compile and reinstall LightGBM, it is quite simple.

Commands to compile the dynamic library:

mkdir \$BUILD_DIRECTORY/build && cd \$BUILD_DIRECTORY/build
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' \$BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' \$BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
cmake -DUSE_CUDA=ON ..
make _lightgbm -j4 || exit -1

Command to install python package with just compiled library:
cd \$BUILD_DIRECTORY/python-package && python setup.py install --precompile --user || exit -1

Here is the full script that is used to install and setup Docker, clone repository, install CMake, Python and so on:
https://github.com/microsoft/LightGBM/blob/test_cuda/.github/workflows/cuda.yml

(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

Thanks! I setup the same Python version (3.6) to mimic your environment.

@austinpagan
Copy link
Contributor

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in src/treelearner/cuda_tree_learner.cpp...

@austinpagan
Copy link
Contributor

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

(base) [root@58814263a195 treelearner]# sum cuda_tree_learner.cpp
36657    40
(base) [root@58814263a195 treelearner]# 

@StrikerRUS
Copy link
Collaborator

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in

Thank you very much!

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

Have you applied two those fixes?

However, I went ahead and fixed the error which didn't allow to compile the library.

  1. fcdeb10
  2. 5eee55c

@austinpagan
Copy link
Contributor

This may or may not end up being a "fix" if it helps, but it's useful information to have, and it's an easy change.

Please replace line 414 of src/treelearner/cuda_tree_learner with a different line, as follows:

Current line:

CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));

Suggested new line:

CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));

@StrikerRUS
Copy link
Collaborator

Have you applied two those fixes?

Here what I'm getting after a patch:

Check sum of cuda_tree_learner.cpp
15848    40

@austinpagan
Copy link
Contributor

Sorry, I don't know how to "apply" a fix.

@austinpagan
Copy link
Contributor

Oh, never mind. I see now. Give me a couple minutes.

@austinpagan
Copy link
Contributor

still claiming the problem is in line 414, right?

@StrikerRUS
Copy link
Collaborator

And, on my end, I can't get this code to build. So frustrating...

"this code" = code with these fixes #3450 (comment)?

Maybe you don't have all source files? Could you please try to re-clone the repo and only after that apply a fix?

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
git checkout 5d79ff20d1b7ae226531e2445b17d747b253a637


<apply fixes to src/treelearner/cuda_tree_learner.h and src/treelearner/cuda_tree_learner.cpp>

so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed?

Yes, absolutely right.

still claiming the problem is in line 414, right?

I guess so. At least the error comes from line 414...

@austinpagan
Copy link
Contributor

so when I try to build, it's trying to get files from the "external_libs" directory, but in my clone, that directory just contains two empty sub-directories... any idea whether I'm missing some piece of the build that populates those directories? It looks like there's a "setup.py" file that mentions this directory, but I don't know who is supposed to execute that setup command...

@austinpagan
Copy link
Contributor

We are investigating, but I figured it wouldn't hurt to ask you if you just know the answer off the top of your head...

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 7, 2021

that directory just contains two empty sub-directories...

Please make sure you don't forget --recursive flag during cloning the repo.

git clone --recursive https://github.com/microsoft/LightGBM.git

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 7, 2021

I've tried and can confirm that we can reproduce the error with simple command-line program. I simplified reproducible example so that it no longer requires Python installation. I believe it will help to sync environments.

Fortunately, the error is still the same. But we do not need a proxy of Python layer anymore. Now we run simple regression example from the repository directly via CLI version of LightGBM. Previously we run it via our Python-package.

Please take a look at greatly simplified script (no Python, no any env. variables) we run inside a Docker to reproduce the error:

- name: Test CUDA
run: |
cat > docker-script.sh <<EOF
nvidia-smi
apt-get update
apt-get install --no-install-recommends -y curl
curl -sL https://cmake.org/files/v3.19/cmake-3.19.2-Linux-x86_64.sh -o cmake.sh
chmod +x cmake.sh
./cmake.sh --prefix=/usr/local --exclude-subdir
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' /LightGBM/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' /LightGBM/include/LightGBM/config.h || exit -1 # make sure that changes were really done
echo "Check sum of cuda_tree_learner.cpp:"
sum /LightGBM/src/treelearner/cuda_tree_learner.cpp
mkdir /LightGBM/build && cd /LightGBM/build
cmake -DUSE_CUDA=ON ..
make lightgbm -j4 || exit -1
cd ../examples/regression
../../lightgbm config=train.conf
EOF
sudo docker run -v "$(pwd)":"/LightGBM" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash /LightGBM/docker-script.sh

This script

  • ensures we have NVIDIA card available inside a Docker via nvidia-smi
  • installs curl and CMake to compile LightGBM
  • changes default device type from cpu to cuda in source config file (we will see later [Warning] CUDA currently requires double precision calculations. warning that proves successful change)
  • checks sum of cuda_tree_learner.cpp
  • compiles lightgbm executable programm
  • runs regression example

@StrikerRUS
Copy link
Collaborator

And here are more verbose logs from the run after applying your proposed change in 414 line of src/treelearner/cuda_tree_learner.cpp file #3450 (comment):

2021-01-07T18:54:57.1318861Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T18:54:57.1320390Z [LightGBM] [Info] Finished loading parameters
2021-01-07T18:54:57.1320991Z [LightGBM] [Debug] Loading train file...
2021-01-07T18:54:57.1405940Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:57.1597220Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T18:54:58.2787014Z [LightGBM] [Debug] Loading validation file #1...
2021-01-07T18:54:58.2879002Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:58.2964932Z [LightGBM] [Info] Finished loading data in 1.165807 seconds
2021-01-07T18:54:58.2965532Z [LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
2021-01-07T18:54:58.2971585Z [LightGBM] [Info] Total Bins 6132
2021-01-07T18:54:58.2981032Z [LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
2021-01-07T18:54:58.2981689Z [LightGBM] [Debug] device_bin_size_ = 256
2021-01-07T18:54:58.2982161Z [LightGBM] [Debug] Resized feature masks
2021-01-07T18:54:58.2982684Z [LightGBM] [Debug] Memset pinned_feature_masks_
2021-01-07T18:54:58.2983679Z [LightGBM] [Debug] Allocated device_features_ addr=0x7ff5aaa00000 sz=196000
2021-01-07T18:54:58.2985727Z [LightGBM] [Debug] Memset device_data_indices_
2021-01-07T18:54:58.2991002Z [LightGBM] [Fatal] [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.2995493Z [LightGBM] [Debug] created device_subhistograms_: 0x7ff5ab000000
2021-01-07T18:54:58.3027139Z 
2021-01-07T18:54:58.3027684Z [LightGBM] [Debug] Started copying dense features from CPU to GPU
2021-01-07T18:54:58.3028247Z Met Exceptions:
2021-01-07T18:54:58.3028802Z [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.3029237Z 
2021-01-07T18:54:58.3030255Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
2021-01-07T18:54:58.3031103Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3031917Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3032773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3033581Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3034408Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3035216Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3036038Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3036843Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3037660Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3038459Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3039263Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3040077Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3041108Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3041993Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3042794Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3043607Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3044405Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3045225Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3046029Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3046847Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3047646Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3048447Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3049264Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3050082Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3050902Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3051702Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3052521Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3053318Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3054138Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3054939Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3055754Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3056550Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3057347Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3058351Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3059161Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3059976Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3060773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3061589Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3062382Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:59.6885338Z ##[error]Process completed with exit code 255.

Hope they will help somehow. Please let me know how can I modify the source code of CUDA treelearner more to get useful info that will help to narrow the problem.

@austinpagan
Copy link
Contributor

So, sorry for the delay in response. My colleague seems to be close to figuring out how we can reproduce this problem on Power systems. You can rest easy for now, because if he is successful, we can handle it from here on out...

@StrikerRUS
Copy link
Collaborator

Oh, great news! Thank you very much!

@austinpagan
Copy link
Contributor

And, it is confirmed. On my power system, I get this now:

(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Fatal] [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

Traceback (most recent call last):
File "simple_example.py", line 39, in
early_stopping_rounds=5)
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
booster = Booster(params=params, train_set=train_set)
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in init
ctypes.byref(self.handle)))
File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

(base) [root@58814263a195 python-guide]#

@austinpagan
Copy link
Contributor

So, again, I can pursue this now, without pestering you. Wish me luck!

@StrikerRUS
Copy link
Collaborator

Hah, in any other situation people shouldn't be happy when someone another gets errors from software, but right now I'm happy! 😄
Hope it won't be hard to find a root cause for you.

Again, if you are not comfortable using Python, please check this my message where I show how to reproduce the same error with LightGBM's executable binary from command line interface. Feel free to ask for any details if something is not clear.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Jan 8, 2021

@StrikerRUS The problem is that the non-CUDA vector allocators were changed to use kAlignedSize with VirtualFileWriter::AlignedSize between 3.0 and 3.1. Therefore the CUDA vector allocator wasn't allocating enough space in some instances. Here is a purpose change to fix the CUDA vector allocator. simple_example.py and advanced_example.py work with this change.

diff --git a/include/LightGBM/cuda/vector_cudahost.h b/include/LightGBM/cuda/vector_cudahost.h
index 03db338..46698d0 100644
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
@@ -42,6 +42,7 @@ struct CHAllocator {
   T* allocate(std::size_t n) {
     T* ptr;
     if (n == 0) return NULL;
+    n = (n + kAlignedSize - 1) & -kAlignedSize;
     #ifdef USE_CUDA
       if (LGBM_config_::current_device == lgbm_device_cuda) {
         cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);

@StrikerRUS
Copy link
Collaborator

@austinpagan @ChipKerchner Awesome! I can confirm that this fix helps to get rid from errors on X86 machines as well. Many thanks for the research you've done and providing the fix!

Would you like to contribute this fix from your account so that GitHub will associate fixing commit with you? Or maybe it's not very important for you and you prefer to let do this to someone from LightGBM maintainers to save your time?

@StrikerRUS
Copy link
Collaborator

Fixed via #3748.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Jan 11, 2021

@StrikerRUS This should fix the remaining CUDA failures. Let me know if you see any issues.

diff --git a/src/treelearner/cuda_tree_learner.cpp b/src/treelearner/cuda_tree_learner.cpp
index 16569ee..4495578 100644
--- a/src/treelearner/cuda_tree_learner.cpp
+++ b/src/treelearner/cuda_tree_learner.cpp
@@ -408,7 +408,7 @@ void CUDATreeLearner::copyDenseFeature() {
     // looking for dword_features_ non-sparse feature-groups
     if (!train_data_->IsMultiGroup(i)) {
       dense_feature_group_map_.push_back(i);
-      auto sizes_in_byte = train_data_->FeatureGroupSizesInByte(i);
+      auto sizes_in_byte = std::min(train_data_->FeatureGroupSizesInByte(i), static_cast<size_t>(num_data_));
       void* tmp_data = train_data_->FeatureGroupData(i);
       Log::Debug("Started copying dense features from CPU to GPU - 2");
       CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cu
@@ -534,8 +534,8 @@ void CUDATreeLearner::InitGPU(int num_gpu) {
   copyDenseFeature();
 }

@StrikerRUS
Copy link
Collaborator

@ChipKerchner After applying this fix all but two tests are passed! Very nice indeed!

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in #3672 (comment).

============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/c_api_test/test_.py ..                                          [  0%]
../tests/python_package_test/test_basic.py .............                 [  6%]
../tests/python_package_test/test_consistency.py ......                  [  8%]
../tests/python_package_test/test_dask.py ............................   [ 20%]
../tests/python_package_test/test_dual.py s                              [ 21%]
../tests/python_package_test/test_engine.py ............................ [ 32%]
.......................................                                  [ 49%]
../tests/python_package_test/test_plotting.py F...F                      [ 51%]
../tests/python_package_test/test_sklearn.py ........................... [ 62%]
......x.........................................x....................... [ 92%]
.................                                                        [100%]
= 2 failed, 233 passed, 1 skipped, 2 xfailed, 74 warnings in 195.32s (0:03:15) =

@ChipKerchner
Copy link
Contributor

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in #3672 (comment).

============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/python_package_test/test_plotting.py F...F                      [ 51%]

In my branch , test_plotting passes all tests.

python -m unittest tests/python_package_test/test_plotting.py
.../test_plotting.py:156: UserWarning: More than one metric available, picking one to plot.
  ax0 = lgb.plot_metric(evals_result0)
..s
----------------------------------------------------------------------
Ran 5 tests in 1.956s

OK (skipped=2)

@jameslamb
Copy link
Collaborator

woo! Thanks @ChipKerchner . Like @StrikerRUS mentioned, I think it's very very unlikely that the two failing plotting tests are related to your changes. I found in #3672 (comment) that there might be some issues with the conda-forge recipe for graphviz.

@StrikerRUS
Copy link
Collaborator

Yeah, thanks for the info about tests @ChipKerchner ! I'm 100% sure that 2 failing plotting tests on our side is related to our environment. And I'll fix this environment issue during working on making CUDA builds run on a regular basis.

@austinpagan
Copy link
Contributor

@StrikerRUS: Look at you, making all our dreams come true!!! Thank you!

@StrikerRUS
Copy link
Collaborator

@austinpagan Thanks a lot for all your hard work!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants