Submitting multiple dask.xgboost calls causes fewer worker utilization #5644

rudra0713 · 2022-01-07T01:42:21Z

I have created a random classification dataset with 100,000 rows and 30 columns and I am training the distributed xgboost on this dataset. My system has 8 workers, therefore I created 8 partitions of my dataset. While running dask.xgboost only once on this dataset, each of the 8 workers get one part of the DaskDmatrix (total it has 8 parts). But when I submit multiple dask.xgboost calls, then the 8 partitions get randomly divided among a subset of the workers. Therefore, only those workers end up utilizing CPU and runtime becomes very high.

Here is a reproducible example,

from dask.distributed import Client
import xgboost as xgb
from sklearn.datasets import make_classification
import dask.array as da
from dask.distributed import get_client
import dask

n_samples, n_features, num_test_samples = 100000, 30, 100
dask.config.set({'distributed.worker.daemon': False})

def invoke_dis_xgboost(X, y, number_of_classes, number_of_estimators):
  client_xgboost = get_client()
  dtrain = xgb.dask.DaskDMatrix(client_xgboost, X, y)
  xgb_params = {
                  'n_estimators': number_of_estimators,
                  'num_class': number_of_classes,
                  ## all other xgb parameter
              }
output = xgb.dask.train(client_xgboost, xgb_params, dtrain, num_boost_round=100)
return


def main(client):
    print(f'n_samples={n_samples}, n_features={n_features}')
    X_local, y_local = make_classification(n_samples=n_samples, n_features=n_features, random_state=12345)
    number_of_classes = len(set(y_local))
    X = da.from_array(X_local, chunks=(n_samples//8,n_features), name='train_feature')
    y = da.from_array(y_local, chunks=(n_samples//8,), name='train_label')

    futures = []
    results = []
    
    for i in range(100, 105):
        f1 = client.submit(invoke_dis_xgboost, X, y, number_of_classes, i)
        futures.append(f1)
    for i, f in enumerate(futures):
        results.append(f.result())

return


if __name__ == '__main__':
    client = Client('127.0.0.1:8786')
    main(client)

Looking at this part of the source code of xgboost.dask.train

key_to_partition = {part.key: part for part in parts}
who_has = await client.scheduler.who_has(keys=[part.key for part in parts])
worker_map: Dict[str, "distributed.Future"] = defaultdict(list)
for key, workers in who_has.items():
     worker_map[next(iter(workers))].append(key_to_partition[key])

I was hoping to see each part will be distributed to one worker only. However, I am getting the following output,

('tuple-4d2a6701-77c1-41f6-9566-b5d360cde213', ('tcp://127.0.0.1:43719',), 
('tuple-f301e12c-b53d-4520-a764-383d3a5e1785', ('tcp://127.0.0.1:46133',), 
('tuple-4f5d4be3-4e51-4b49-87ea-ef4bf2460f21', ('tcp://127.0.0.1:32889',), 
('tuple-2ad89ca9-2e8a-4486-bfd4-5e36eed92576', ('tcp://127.0.0.1:32889',), 
('tuple-5856e611-f229-4c41-bf49-a39e4d3c6f9a', ('tcp://127.0.0.1:46133',), 
('tuple-d176cd30-4049-40c2-927e-49d06088dc8b', ('tcp://127.0.0.1:46133',), 
('tuple-aba3b2a6-d245-44a9-b3ec-9231b6637c88', ('tcp://127.0.0.1:43719',), 
('tuple-f82b4ee7-47f4-4ccf-a431-efe5b930bd34', ('tcp://127.0.0.1:32889',)

Since there are only 3 unique workers getting all the parititions, number of dispatched_train is 3, therefore only 3 workers are utilizing CPU according to the dask dashboard.

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-01-07T19:01:14Z

Thanks for raising an issue @rudra0713. It's not clear to me if this behavior is due to Dask or XGBoost -- cc'ing @trivialfis for thoughts on the XGBoost side of things

trivialfis · 2022-01-07T20:42:02Z

Thanks for the ping. This is from dmlc/xgboost#7544 . The background is dask distributes the data to only a limited set of workers (3 in the issue), as a result XGBoost runs training only on these workers instead of utilizing all available resources. The solution is to have something similar to client.rebalance that distributes data to workers in a more uniform way. But I think this function is still being worked on, so redirected the discussion from xgboost to distributed.

gjoseph92 · 2022-01-13T00:00:19Z

Is it possible to reproduce this problem without xgboost? Can you create a similar situation where, just submitting some dummy futures, and then submitting more futures, the second batch doesn't get distributed to all workers as expected? It sounds to me like it's a more general issue with scheduling policy than something xgboost is doing, but I may be wrong.

jrbourbeau transferred this issue from dask/dask Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submitting multiple dask.xgboost calls causes fewer worker utilization #5644

Submitting multiple dask.xgboost calls causes fewer worker utilization #5644

rudra0713 commented Jan 7, 2022

jrbourbeau commented Jan 7, 2022

trivialfis commented Jan 7, 2022

gjoseph92 commented Jan 13, 2022

Submitting multiple dask.xgboost calls causes fewer worker utilization #5644

Submitting multiple dask.xgboost calls causes fewer worker utilization #5644

Comments

rudra0713 commented Jan 7, 2022

jrbourbeau commented Jan 7, 2022

trivialfis commented Jan 7, 2022

gjoseph92 commented Jan 13, 2022