Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] [gpu] Distributed training is VERY slow #4761

Closed
chixujohnny opened this issue Nov 1, 2021 · 10 comments
Closed

[dask] [gpu] Distributed training is VERY slow #4761

chixujohnny opened this issue Nov 1, 2021 · 10 comments

Comments

@chixujohnny
Copy link

chixujohnny commented Nov 1, 2021

Description

I have many Linux machines, every machine has A100 GPU *8, 128 threads per machine.
I found a problem thesedays in DaskLGB:

  1. Fastest plan is using 1 machine 1*GPU (single mode), 8 threads. Running time=2.5min. Use 1300m GPU memory.
  2. Slow plan is using 1 machine 2*GPU (local distribution mode), 8 threads per machine, the more GPUs, the slower the speed. Running time=4min. Use 1100m+700m memory in 2 GPUs.
  3. VERY SLOW plan is 2*machine, every machine only 1 GPU, 8 threads per machine. It is too slow to use. Running time=30min. Use 1100m+700m memory in 2 GPUs.

I used the Dask command line to build a distributed system, doc: https://docs.dask.org/en/latest/how-to/deploy-dask/cli.html

So I'm very confuse that what is the significance of the existence of distributed DaskLGB??

Faster? NO
Saving GPU memory? NO

So have you encountered this problem when using multi-machine distributed LGB?

Environment info

LightGBM version or commit hash: 3.2.1(gpu)

@chixujohnny
Copy link
Author

`client = Client(address='xxx.xxx.xxx.xxx:12345') # this is the scheduler ip

params = {
        'n_estimators'       : 1500,
        'objective'          : 'rmse',
        'reg_lambda'         : 0.3731450715226679,
        'reg_alpha'          : 0.30424044277458473,
        'subsample'          : 0.6413363492779808,
        'learning_rate'      : 0.013556055988623302,
        'min_child_weight'   : 91,
        'max_bin'            : 63,
        'random_state'       : 1111,
        'device_type'        : 'gpu',
        'colsample_bytree'   : 0.4237786478391063,
        'gpu_use_dp'         : False,
        'metric'             : 'None',
        'min_data_in_leaf'   : 880,
        'first_metric_only'  : True,
        'num_leaves'         : 248,
        'max_depth'          : 9,
        'verbosity'          : -1,
        }
model = lgb.DaskLGBMRegressor(client=client, **params)
print(f'X_train.shape={X_train.shape}   y_train.shape={y_train.shape}')

print(f'Process X to dask.array')
X_train = dask.array.from_array(X_train, chunks=(100000,3095)); X_train.compute() ; print(f'type(X_train)={type(X_train)}  X_train.shape={X_train.shape}')
y_train = dask.array.from_array(y_train, chunks=(100000,)); y_train.compute() ; print(f'type(y_train)={type(y_train)}  y_train.shape={y_train.shape}')

st = datetime.datetime.now()
print(f'Training start time:{st}')
model.fit(X_train, y_train)
et = datetime.datetime.now()
print(f'Training end time: {et}')
print(f'Running cost: {et-st}')`

@chixujohnny chixujohnny changed the title Distribution DaskLGB is VERY slow Distribution DaskLGB training is VERY slow Nov 1, 2021
@jmoralez
Copy link
Collaborator

jmoralez commented Nov 1, 2021

@chixujohnny thanks for using LightGBM!

Right now the dask interface doesn't directly support distributed training using GPU, you can subscribe to #3776 if you're interested in that. Are you getting any warnings about this? I think it probably isn't using the GPU at all.

Furthermore, if your data fits in a single machine then it's probably best not using distributed training at all. The dask interface is there to help you train a model on data that doesn't fit on a single machine by having partitions of the data on different machines which communicate with each other, which adds some overhead compared to single-node training.

If you want to use multiple GPUs on a single machine you can try the CUDA version and set num_gpu to a value greater than 1.

@jameslamb jameslamb changed the title Distribution DaskLGB training is VERY slow [daks] Distribution DaskLGB training is VERY slow Nov 1, 2021
@jmoralez jmoralez changed the title [daks] Distribution DaskLGB training is VERY slow [dask] Distributed training is VERY slow Nov 1, 2021
@chixujohnny
Copy link
Author

@jmoralez Thanks for your reply.
I find the problem now.

The GPU is really working, not CPU mode.
Using command to watch: watch -n 1 nvidia-smi

To sovle this promblem, just using dask-cuda package.

Please see this doc to deploy your worker: https://docs.rapids.ai/api/dask-cuda/nightly/api.html

Just using command: dask-cuda-worker xxx.xxx.xxx.xxx:9876
Instead of: dask-worker xxx.xxx.xxx.xxx:9876

But, compare with dask-cude XGB, it is more usable than LGB. But it won't be particularly obvious.

@jmoralez
Copy link
Collaborator

jmoralez commented Nov 3, 2021

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

@chixujohnny
Copy link
Author

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

Faster, but slower than 1 GPU mode.
I use 1 machine + 2 GPUs + LocalCUDACluster mode.

@chixujohnny
Copy link
Author

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

By the way, in a normal situation, when I training a regression job on only 1GPU. What is the useage rate of GPU? On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. It’s not like XGB has 100% usage. Is this normal?

@jameslamb
Copy link
Collaborator

On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. Is this normal?

LightGBM's existing CUDA-based implementation does some work on the GPU and some on CPU (#4082 (comment)), which is why you might not see high GPU utilization.

This is a known issue, and @shiyu1994 and others are working on it. I recommend subscribing to updates on the following PRs to track the progress on a new implementation that should better-utilize the GPU:

The reviews on those pull requests are going to get quite large, so if you have questions about the plans please open new issues here that reference them, instead of commenting on the PRs directly.

@jameslamb jameslamb changed the title [dask] Distributed training is VERY slow [dask] [gpu] Distributed training is VERY slow Nov 5, 2021
@chixujohnny
Copy link
Author

On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. Is this normal?

LightGBM's existing CUDA-based implementation does some work on the GPU and some on CPU (#4082 (comment)), which is why you might not see high GPU utilization.

This is a known issue, and @shiyu1994 and others are working on it. I recommend subscribing to updates on the following PRs to track the progress on a new implementation that should better-utilize the GPU:

The reviews on those pull requests are going to get quite large, so if you have questions about the plans please open new issues here that reference them, instead of commenting on the PRs directly.

Thank you very much~

@no-response
Copy link

no-response bot commented Dec 10, 2021

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@no-response no-response bot closed this as completed Dec 10, 2021
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants