[dask] Use a 1 line sample to infer output shape.

This is for inferring shape with direct prediction (without DaskDMatrix). There are a few things that requires known output shape before carrying out actual prediction, including dask meta data, output dataframe columns. * Infer output shape based on local prediction. * Remove set param in predict function as it's not thread safe nor necessary as we now let dask to decide the parallelism. * Simplify prediction on `DaskDMatrix`.
dmlc · Jan 29, 2021 · 4a40f80 · 4a40f80
1 parent c3c8e66
commit 4a40f80
Show file tree

Hide file tree

Showing 3 changed files with 295 additions and 220 deletions.
diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst
@@ -108,8 +108,14 @@ computation a bit faster when meta information like ``base_margin`` is not neede
  prediction = xgb.dask.inplace_predict(client, output, X)
 
 Here ``prediction`` is a dask ``Array`` object containing predictions from model if input
-is a ``DaskDMatrix`` or ``da.Array``. For ``dd.DataFrame``, the return value is a
-``dd.Series``.
+is a ``DaskDMatrix`` or ``da.Array``. For ``dd.DataFrame`` input, the return value is a
+either a ``dd.Series`` or ``dd.DataFrame`` for normal prediction, depending on the output
+shape of prediction. When shap based prediction is used, the return value can exceed 2
+dimension, in such cases an ``Array`` is returned.
+
+The performance of running prediction is sensitive to number of parations/blocks for each
+input dataset. Internally it's implemented as running local prediction on each data
+partition.
 
 Alternatively, XGBoost also implements the Scikit-Learn interface with ``DaskXGBClassifier``
 and ``DaskXGBRegressor``. See ``xgboost/demo/dask`` for more examples.
@@ -147,6 +153,20 @@ Also for inplace prediction:
  prediction = xgb.dask.inplace_predict(client, booster, X)
 
 
+The performance of running prediction directly on dask collection, either using
+``predict`` or ``inplace_predict``, is particularly sensitive to number of blocks.
+Internally, it's implemented using ``da.map_blocks`` or ``dd.map_partitions``. When
+number of partitions is large and each of them have only small amount of data, the
+overhead of calling predict becomes huge. On the other hand, if not using GPU, the number
+of threads used for prediction on each block matters. If the number of blocks on each
+workers is small, then the CPU workers might not be fully utilized. Right now, xgboost
+just use single thread for each partition.
+
+When putting dask collection directly into the ``predict`` function or using
+``inplace_predict``, the output type depends on input data. When input is ``da.Array``
+object, output is always ``da.Array``. However, if the input type is ``dd.DataFrame``,
+output can be ``dd.Series`` or ``dd.DataFrame``, depending on output shape.
+
 ***************************
 Working with other clusters
 ***************************