Skip to content

Commit

Permalink
[dask] Use a 1 line sample to infer output shape.
Browse files Browse the repository at this point in the history
This is for inferring shape with direct prediction (without DaskDMatrix).
There are a few things that requires known output shape before carrying out
actual prediction, including dask meta data, output dataframe columns.

* Infer output shape based on local prediction.
* Remove set param in predict function as it's not thread safe nor necessary as
we now let dask to decide the parallelism.
* Simplify prediction on `DaskDMatrix`.
  • Loading branch information
trivialfis committed Jan 29, 2021
1 parent c3c8e66 commit 4a40f80
Show file tree
Hide file tree
Showing 3 changed files with 295 additions and 220 deletions.
24 changes: 22 additions & 2 deletions doc/tutorials/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,14 @@ computation a bit faster when meta information like ``base_margin`` is not neede
prediction = xgb.dask.inplace_predict(client, output, X)
Here ``prediction`` is a dask ``Array`` object containing predictions from model if input
is a ``DaskDMatrix`` or ``da.Array``. For ``dd.DataFrame``, the return value is a
``dd.Series``.
is a ``DaskDMatrix`` or ``da.Array``. For ``dd.DataFrame`` input, the return value is a
either a ``dd.Series`` or ``dd.DataFrame`` for normal prediction, depending on the output
shape of prediction. When shap based prediction is used, the return value can exceed 2
dimension, in such cases an ``Array`` is returned.

The performance of running prediction is sensitive to number of parations/blocks for each
input dataset. Internally it's implemented as running local prediction on each data
partition.

Alternatively, XGBoost also implements the Scikit-Learn interface with ``DaskXGBClassifier``
and ``DaskXGBRegressor``. See ``xgboost/demo/dask`` for more examples.
Expand Down Expand Up @@ -147,6 +153,20 @@ Also for inplace prediction:
prediction = xgb.dask.inplace_predict(client, booster, X)
The performance of running prediction directly on dask collection, either using
``predict`` or ``inplace_predict``, is particularly sensitive to number of blocks.
Internally, it's implemented using ``da.map_blocks`` or ``dd.map_partitions``. When
number of partitions is large and each of them have only small amount of data, the
overhead of calling predict becomes huge. On the other hand, if not using GPU, the number
of threads used for prediction on each block matters. If the number of blocks on each
workers is small, then the CPU workers might not be fully utilized. Right now, xgboost
just use single thread for each partition.

When putting dask collection directly into the ``predict`` function or using
``inplace_predict``, the output type depends on input data. When input is ``da.Array``
object, output is always ``da.Array``. However, if the input type is ``dd.DataFrame``,
output can be ``dd.Series`` or ``dd.DataFrame``, depending on output shape.

***************************
Working with other clusters
***************************
Expand Down
Loading

0 comments on commit 4a40f80

Please sign in to comment.