Skip to content

Commit

Permalink
[jvm-packages] update doc based on the latest changes (#10847)
Browse files Browse the repository at this point in the history
  • Loading branch information
wbo4958 authored Oct 11, 2024
1 parent f7cb8e0 commit 59e6c92
Show file tree
Hide file tree
Showing 4 changed files with 164 additions and 258 deletions.
36 changes: 5 additions & 31 deletions doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ R
JVM
---

* XGBoost4j/XGBoost4j-Spark
* XGBoost4j-Spark

.. code-block:: xml
:caption: Maven
Expand All @@ -172,11 +172,6 @@ JVM
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
Expand All @@ -188,11 +183,10 @@ JVM
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num"
)
* XGBoost4j-GPU/XGBoost4j-Spark-GPU
* XGBoost4j-Spark-GPU

.. code-block:: xml
:caption: Maven
Expand All @@ -205,11 +199,6 @@ JVM
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
Expand All @@ -221,15 +210,14 @@ JVM
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num"
)
This will check out the latest stable version from the Maven Central.

For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_.

To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).


.. note:: Windows not supported in the JVM package
Expand Down Expand Up @@ -292,7 +280,7 @@ JVM
resolvers += "XGBoost4J Snapshot Repo" at "https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/"
Then add XGBoost4J as a dependency:
Then add XGBoost4J-Spark as a dependency:

.. code-block:: xml
:caption: maven
Expand All @@ -304,12 +292,6 @@ Then add XGBoost4J as a dependency:
</properties>
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
Expand All @@ -321,11 +303,10 @@ Then add XGBoost4J as a dependency:
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j" % "latest_version_num-SNAPSHOT",
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT"
)
* XGBoost4j-GPU/XGBoost4j-Spark-GPU
* XGBoost4j-Spark-GPU

.. code-block:: xml
:caption: maven
Expand All @@ -337,12 +318,6 @@ Then add XGBoost4J as a dependency:
</properties>
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
Expand All @@ -354,7 +329,6 @@ Then add XGBoost4J as a dependency:
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num-SNAPSHOT",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num-SNAPSHOT"
)
Expand Down
32 changes: 14 additions & 18 deletions doc/jvm/xgboost4j_spark_gpu_tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#############################################
XGBoost4J-Spark-GPU Tutorial (version 1.6.1+)
#############################################
############################
XGBoost4J-Spark-GPU Tutorial
############################

**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from
end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_ product.
Expand Down Expand Up @@ -71,7 +71,7 @@ To make the Iris dataset recognizable to XGBoost, we need to encode the String-t
label, i.e. "class", to the Double-typed label.

One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
`StringIndexer <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/StringIndexer.html>`_.
But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back
to CPU. Instead, we use an alternative way to achieve the same goal with the following code:

Expand Down Expand Up @@ -107,10 +107,10 @@ With window operations, we have mapped the string column of labels to label indi
Training
========

The GPU version of XGBoost-Spark supports both regression and classification
XGBoost4j-Spark-Gpu supports regression, classification and ranking
models. Although we use the Iris dataset in this tutorial to show how we use
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression is very similar to classification.
``XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression and Ranking is very similar to classification.

To train a XGBoost model for classification, we need to define a XGBoostClassifier first:

Expand Down Expand Up @@ -168,12 +168,13 @@ model can then be used in other tasks like prediction.
Prediction
==========

When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
When we get a model, a XGBoostClassificationModel or a XGBoostRegressionModel or a XGBoostRankerModel, it takes a DataFrame as an input,
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
with the following columns by default:

* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
* XGBoostRankerModel will output prediction a label(``predictionCol``).

.. code-block:: scala
Expand Down Expand Up @@ -226,25 +227,20 @@ would be ``"spark.task.resource.gpu.amount=1/spark.executor.cores"``. However, i
using a XGBoost version earlier than 2.1.0 or a Spark standalone cluster version below 3.4.0,
you still need to set ``"spark.task.resource.gpu.amount"`` equal to ``"spark.executor.resource.gpu.amount"``.

.. note::

As of now, the stage-level scheduling feature in XGBoost is limited to the Spark standalone cluster mode.
However, we have plans to expand its compatibility to YARN and Kubernetes once Spark 3.5.1 is officially released.

Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",`
Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",
provided below is an instance demonstrating how to submit the xgboost application to an Apache
Spark Standalone cluster.

.. code-block:: bash
rapids_version=23.10.0
xgboost_version=2.0.1
rapids_version=24.08.0
xgboost_version=$LATEST_VERSION
main_class=Iris
app_jar=iris-1.0.0.jar
spark-submit \
--master $master \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--conf spark.executor.cores=12 \
--conf spark.task.cpus=1 \
--conf spark.executor.resource.gpu.amount=1 \
Expand All @@ -255,7 +251,7 @@ Spark Standalone cluster.
--class ${main_class} \
${app_jar}
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-spark-gpu`` packages by ``--packages``
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``

For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
Expand Down
Loading

0 comments on commit 59e6c92

Please sign in to comment.