-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] bump CUDA version from 11.6.2
to 11.7.0
at CI
#5287
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
@shiyu1994 Current Ubuntu version that is used on the self-hosted CUDA runner is 18.04. This is quite old version. We use 20.04 as the Could you please update Ubuntu on the self-hosted CUDA machine to the 22.04 version? If it's hard or requires a lot of time, please at least update NVIDIA drivers in the current OS because container with the latest CUDA fails to start with the following error:
|
@guolinke Maybe you still have an access to the CUDA runner or know someone in Microsoft who can do this #5287 (comment)? |
I am afraid that I don't have permission. Ping @shiyu1994 for help. |
Sorry for the late response. Will handle this right now. |
@StrikerRUS Will use a VM with Ubuntu 20.04 work? We created the self-hosted CUDA runner with Azure Portal. I've checked the available OS images in our Azure Portal Subscription when creating a VM and only 20.04 is available. Perhaps 22.04 has not been supported yet. If 20.04 does not work, maybe I can upgrade the NVIDIA driver for now. |
@shiyu1994 you can update the driver first. |
Ubuntu 22.04 is on public beta right now. Probably it's not available via GUI.
Yes, please do this. It should be a quick workaround. |
Ping @shiyu1994 for
|
@shiyu1994 can you please help with this? |
Sure. I'm trying with that. Thanks for your reminder. |
Close and reopen to test for CUDA 11.7. |
Close and reopen due to HTTP error. |
It seems that we encounter a random failure in dask tests. )
> dask_model2.fit(dX, dy, group=dg)
../tests/python_package_test/test_dask.py:1596:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:1341: in fit
return self._lgb_dask_fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:1050: in _lgb_dask_fit
model = _train(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:789: in _train
results = client.gather(futures_classifiers)
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/client.py:2174: in gather
return self.sync(
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:338: in sync
return sync(
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:405: in sync
raise exc.with_traceback(tb)
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/utils.py:378: in f
result = yield future
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/tornado/gen.py:762: in run
value = future.result()
/opt/miniforge/envs/test-env/lib/python3.8/site-packages/distributed/client.py:2037: in _gather
raise exception.with_traceback(traceback)
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:322: in _train_part
model.fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/sklearn.py:993: in fit
super().fit(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/sklearn.py:792: in fit
self._Booster = train(
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/engine.py:244: in train
booster.update(fobj=fobj)
/home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:3129: in update
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104) |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
No description provided.