-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submitting multiple dask.xgboost calls causes fewer worker utilization #7544
Comments
Can you try to let dask decide the partitioning? Dask usually has a more fine-grained partitioning scheme and can saturate all the workers. For xgboost, it doesn't move the data. XGBoost will take whatever dask provide and run on corresponding workers |
I tried with the default auto chunks of dask array. Runtime varied from 117 sec to 235 sec to more than 5 minutes. From the dashboard, I have sometimes seen only 2 workers actually handling the dispatch_train call and in those cases, overall CPU utilization was very low.
Is there a way to make sure that dask uniformly partitions my data to each worker? |
There are some ongoing efforts on I will close this issue now since it's about dask data management instead of XGBoost. |
I have created a random classification dataset with 100,000 rows and 30 columns and I am training the distributed xgboost on this dataset. My system has 8 workers, therefore I created 8 partitions of my dataset. While running dask.xgboost only once on this dataset, each of the 8 workers get one part of the DaskDmatrix (total it has 8 parts). But when I submit multiple dask.xgboost calls, then the 8 partitions get randomly divided among a subset of the workers. Therefore, only those workers end up utilizing CPU and runtime becomes very high.
Here is a reproducible example,
Looking at this part of the source code of xgboost.dask.train
I was hoping to see each part will be distributed to one worker only. However, I am getting the following output,
Since there are only 3 unique workers getting all the parititions, number of dispatched_train is 3, therefore only 3 workers are utilizing CPU according to the dask dashboard.
The text was updated successfully, but these errors were encountered: