Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] bump CUDA version from 11.6.2 to 11.7.0 at CI #5287

Merged
merged 1 commit into from
Aug 25, 2022
Merged

Conversation

StrikerRUS
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994 Current Ubuntu version that is used on the self-hosted CUDA runner is 18.04.

This is quite old version. We use 20.04 as the latest version in this repo's CI and are in progress of migrating to the 22.04 one.

Could you please update Ubuntu on the self-hosted CUDA machine to the 22.04 version?

If it's hard or requires a lot of time, please at least update NVIDIA drivers in the current OS because container with the latest CUDA fails to start with the following error:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.7, please update your driver to a newer version, or use an earlier cuda container: unknown.

@StrikerRUS
Copy link
Collaborator Author

@guolinke Maybe you still have an access to the CUDA runner or know someone in Microsoft who can do this #5287 (comment)?

@guolinke
Copy link
Collaborator

I am afraid that I don't have permission. Ping @shiyu1994 for help.

@shiyu1994
Copy link
Collaborator

Sorry for the late response. Will handle this right now.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Jul 12, 2022

@StrikerRUS Will use a VM with Ubuntu 20.04 work? We created the self-hosted CUDA runner with Azure Portal. I've checked the available OS images in our Azure Portal Subscription when creating a VM and only 20.04 is available. Perhaps 22.04 has not been supported yet.

If 20.04 does not work, maybe I can upgrade the NVIDIA driver for now.

@guolinke
Copy link
Collaborator

@shiyu1994 you can update the driver first.

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994

Perhaps 22.04 has not been supported yet.

Ubuntu 22.04 is on public beta right now.
actions/runner-images#5490

Probably it's not available via GUI.

maybe I can upgrade the NVIDIA driver for now.

Yes, please do this. It should be a quick workaround.

@StrikerRUS
Copy link
Collaborator Author

Ping @shiyu1994 for

If 20.04 does not work, maybe I can upgrade the NVIDIA driver for now.

Yes, please do this. It should be a quick workaround.

@jameslamb
Copy link
Collaborator

@shiyu1994 can you please help with this?

@shiyu1994
Copy link
Collaborator

Sure. I'm trying with that. Thanks for your reminder.

@shiyu1994
Copy link
Collaborator

I've upgrade the NVIDIA Driver version on the self-hosted CUDA runner to nvidia-driver-515. Now CUDA 11.7 container can successfully run.
image

@shiyu1994
Copy link
Collaborator

Close and reopen to test for CUDA 11.7.

@shiyu1994 shiyu1994 closed this Aug 24, 2022
@shiyu1994 shiyu1994 reopened this Aug 24, 2022
@shiyu1994
Copy link
Collaborator

Close and reopen due to HTTP error.

@shiyu1994 shiyu1994 closed this Aug 25, 2022
@shiyu1994 shiyu1994 reopened this Aug 25, 2022
@shiyu1994
Copy link
Collaborator

shiyu1994 commented Aug 25, 2022

It seems that we encounter a random failure in dask tests.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13306&view=logs&j=fb919173-48b9-522d-5342-9a59a13eb10b&t=7fcbeee5-d4c1-571c-8072-1d5874591254

            )
    
>           dask_model2.fit(dX, dy, group=dg)

../tests/python_package_test/test_dask.py:1596: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:1341: in fit
    return self._lgb_dask_fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:1050: in _lgb_dask_fit
    model = _train(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:789: in _train
    results = client.gather(futures_classifiers)
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/client.py:2174: in gather
    return self.sync(
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:338: in sync
    return sync(
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:405: in sync
    raise exc.with_traceback(tb)
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:378: in f
    result = yield future
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/tornado/gen.py:762: in run
    value = future.result()
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/client.py:2037: in _gather
    raise exception.with_traceback(traceback)
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:322: in _train_part
    model.fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/sklearn.py:993: in fit
    super().fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/sklearn.py:792: in fit
    self._Booster = train(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/engine.py:244: in train
    booster.update(fobj=fobj)
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:3129: in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E   lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)

@shiyu1994 shiyu1994 closed this Aug 25, 2022
@shiyu1994 shiyu1994 reopened this Aug 25, 2022
@shiyu1994 shiyu1994 merged commit 504ff50 into master Aug 25, 2022
@shiyu1994 shiyu1994 deleted the ci_cuda branch August 25, 2022 13:20
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants