CI CUDA job #3402

StrikerRUS · 2020-09-21T15:13:16Z

Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to #3160 (comment).

@guolinke Will linux-gpu-pool be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?

The text was updated successfully, but these errors were encountered:

guolinke · 2020-09-21T15:16:06Z

It is exclusive, feel free to use it.

StrikerRUS · 2020-09-25T22:49:49Z

I think we can go the following way.

Create a separate pipeline for CUDA job (https://sethreid.co.nz/using-multiple-yaml-build-definitions-azure-devops/).
Mark it as non-required and disable auto-builds (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#run-pull-request-validation-only-when-authorized-by-your-team).
Setup comments triggers (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#comment-triggers). Now collaborators will be able to run CUDA builds only when it is really needed by commenting something like /azp run cuda-builds.
Use NVIDIA Docker containers similarly we are using Ubuntu 14.04 container for compatibility purposes right now.

I've made some progress with this in .vsts-ci.yml of the test branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.

@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.

StrikerRUS · 2020-09-25T22:52:22Z

@guolinke

I think it will be enough to have 1 machine.

guolinke · 2020-09-26T00:48:46Z

@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.

guolinke · 2020-09-26T00:52:26Z

I just install the gpu driver extension.

guolinke · 2020-09-26T00:56:08Z

let us set the max workers to 2, in case for some cocurrence jobs.

StrikerRUS · 2020-09-26T19:12:05Z

Looks like driver extension didn't help: there is no nvidia-smi utility which is normally installed with NVIDIA drivers.

Also, I found experimental option that allows to not install driver on host machine but use driver containers.

Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

Alternatively, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver

https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804

Unfortunately, driver containers also requires rebooting:

sudo reboot

So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.

guolinke · 2020-09-27T01:31:33Z

Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs.
I will try to build one in the next week.

guolinke · 2020-09-27T02:50:35Z

just create an runner

you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux

StrikerRUS · 2020-09-27T18:00:10Z

Amazing! Just got it work!

Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.

https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true

StrikerRUS · 2020-09-27T19:03:41Z

Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger:
AcademySoftwareFoundation/tac#156
https://github.com/jfpanisset/cloud_gpu_build_agent

Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.

StrikerRUS · 2020-09-30T20:38:26Z

@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.

[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

/LightGBM/docker-script.sh: line 12:  1861 Aborted                 (core dumped) python /LightGBM/examples/python-guide/simple_example.py

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

LightGBM/CMakeLists.txt

Line 159 in 79d288a

CUDA_SELECT_NVCC_ARCH_FLAGS(CUDA_ARCH_FLAGS 6.0 6.1 6.2 7.0 7.5+PTX)

I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled.
#3160 (comment)

guolinke · 2020-10-01T00:39:29Z

I see. I will change it to p100 or p40

guolinke · 2020-10-01T00:54:50Z

Now it is p100

StrikerRUS · 2020-10-01T06:11:32Z

@guolinke

Now it is p100

Thank you!

Is there any similar to AWS G4 machines in Azure? It will probably cost less:
dmlc/xgboost#4881 (comment)
dmlc/xgboost#4921 (comment)

guolinke · 2020-10-01T06:46:02Z

The only other option is p40, which provides more GPU memories, but slightly slower. The cost is the same. So I choose p100.

github-actions · 2023-08-23T20:47:22Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS mentioned this issue Sep 30, 2020

setup CUDA CI job #3424

Merged

jameslamb mentioned this issue Oct 8, 2020

[ci] [R-package] Fix memory leaks found by valgrind #3443

Merged

guolinke closed this as completed in #3424 Oct 26, 2020

jameslamb mentioned this issue Apr 6, 2021

[ci] CUDA builds failing to start #4165

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI CUDA job #3402

CI CUDA job #3402

StrikerRUS commented Sep 21, 2020

guolinke commented Sep 21, 2020

StrikerRUS commented Sep 25, 2020

StrikerRUS commented Sep 25, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020

StrikerRUS commented Sep 26, 2020 •

edited

Loading

guolinke commented Sep 27, 2020

guolinke commented Sep 27, 2020

StrikerRUS commented Sep 27, 2020

StrikerRUS commented Sep 27, 2020

StrikerRUS commented Sep 30, 2020

guolinke commented Oct 1, 2020

guolinke commented Oct 1, 2020

StrikerRUS commented Oct 1, 2020

guolinke commented Oct 1, 2020

github-actions bot commented Aug 23, 2023

CI CUDA job #3402

CI CUDA job #3402

Comments

StrikerRUS commented Sep 21, 2020

guolinke commented Sep 21, 2020

StrikerRUS commented Sep 25, 2020

StrikerRUS commented Sep 25, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020

StrikerRUS commented Sep 26, 2020 • edited Loading

guolinke commented Sep 27, 2020

guolinke commented Sep 27, 2020

StrikerRUS commented Sep 27, 2020

StrikerRUS commented Sep 27, 2020

StrikerRUS commented Sep 30, 2020

guolinke commented Oct 1, 2020

guolinke commented Oct 1, 2020

StrikerRUS commented Oct 1, 2020

guolinke commented Oct 1, 2020

github-actions bot commented Aug 23, 2023

StrikerRUS commented Sep 26, 2020 •

edited

Loading