Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI CUDA job #3402

Closed
StrikerRUS opened this issue Sep 21, 2020 · 17 comments · Fixed by #3424
Closed

CI CUDA job #3402

StrikerRUS opened this issue Sep 21, 2020 · 17 comments · Fixed by #3424

Comments

@StrikerRUS
Copy link
Collaborator

Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to #3160 (comment).

@guolinke Will linux-gpu-pool be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?

@guolinke
Copy link
Collaborator

It is exclusive, feel free to use it.

@StrikerRUS
Copy link
Collaborator Author

I think we can go the following way.

  1. Create a separate pipeline for CUDA job (https://sethreid.co.nz/using-multiple-yaml-build-definitions-azure-devops/).
  2. Mark it as non-required and disable auto-builds (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#run-pull-request-validation-only-when-authorized-by-your-team).
  3. Setup comments triggers (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#comment-triggers). Now collaborators will be able to run CUDA builds only when it is really needed by commenting something like /azp run cuda-builds.
  4. Use NVIDIA Docker containers similarly we are using Ubuntu 14.04 container for compatibility purposes right now.

I've made some progress with this in .vsts-ci.yml of the test branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.

@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

I think it will be enough to have 1 machine.

image

@guolinke
Copy link
Collaborator

@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.

@guolinke
Copy link
Collaborator

I just install the gpu driver extension.

image

@guolinke
Copy link
Collaborator

let us set the max workers to 2, in case for some cocurrence jobs.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Sep 26, 2020

Looks like driver extension didn't help: there is no nvidia-smi utility which is normally installed with NVIDIA drivers.

Also, I found experimental option that allows to not install driver on host machine but use driver containers.

Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

Alternatively, the NVIDIA driver can be deployed through a container.
https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver

https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804

Unfortunately, driver containers also requires rebooting:

sudo reboot

So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.

@guolinke
Copy link
Collaborator

Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs.
I will try to build one in the next week.

@guolinke
Copy link
Collaborator

just create an runner
image

you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux

@StrikerRUS
Copy link
Collaborator Author

Amazing! Just got it work!

Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.

https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true
image

@StrikerRUS
Copy link
Collaborator Author

Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger:
AcademySoftwareFoundation/tac#156
https://github.com/jfpanisset/cloud_gpu_build_agent

Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.

@StrikerRUS
Copy link
Collaborator Author

@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.

[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

/LightGBM/docker-script.sh: line 12:  1861 Aborted                 (core dumped) python /LightGBM/examples/python-guide/simple_example.py

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

image

CUDA_SELECT_NVCC_ARCH_FLAGS(CUDA_ARCH_FLAGS 6.0 6.1 6.2 7.0 7.5+PTX)

I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled.
#3160 (comment)

@guolinke
Copy link
Collaborator

guolinke commented Oct 1, 2020

I see. I will change it to p100 or p40

@guolinke
Copy link
Collaborator

guolinke commented Oct 1, 2020

Now it is p100

@StrikerRUS
Copy link
Collaborator Author

@guolinke

Now it is p100

Thank you!

Is there any similar to AWS G4 machines in Azure? It will probably cost less:
dmlc/xgboost#4881 (comment)
dmlc/xgboost#4921 (comment)

@guolinke
Copy link
Collaborator

guolinke commented Oct 1, 2020

The only other option is p40, which provides more GPU memories, but slightly slower. The cost is the same. So I choose p100.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants