[jvm-packages] XGBoost Spark supports ranking with group data. #3369

yanboliang · 2018-06-07T23:23:49Z

XGBoost Spark supports ranking with group data.

CodingCat · 2018-06-08T03:51:50Z

eh....large than I expect...will look at it tmr afternoon (Friday afternoon :-) )

hcho3 · 2018-06-08T04:38:48Z

@CodingCat @yanboliang Sorry for hijacking this thread, but would #2749 be useful for ranking tasks on Spark?

yanboliang · 2018-06-08T05:01:08Z

@hcho3 Yep, it would be useful for ranking on xgboost-spark, we are in the same direction. This PR expose a new group data API for xgboost-spark, we can update internal implementation if the backend C++ code changed. Thanks.

hcho3 · 2018-06-08T05:07:34Z

@yanboliang That's good to know. I will add some tests to #2749 and merge it. Thanks!

CodingCat · 2018-06-12T03:00:32Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -21,12 +21,10 @@ import java.nio.file.Files

 import scala.collection.mutable
 import scala.util.Random
-


can we keep these empty lines to separate xgboost4j and the other imports?

CodingCat · 2018-06-12T03:01:30Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -56,8 +54,8 @@ object XGBoost extends Serializable {
  private val logger = LogFactory.getLog("XGBoostSpark")

  private def removeMissingValues(
-      denseLabeledPoints: Iterator[XGBLabeledPoint],
-      missing: Float): Iterator[XGBLabeledPoint] = {
+      denseLabeledPoints: Seq[XGBLabeledPoint],


if you take sequence, that means you will load a partition into memory entirely which will leads to OOM potentially

CodingCat · 2018-06-12T03:02:33Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -129,7 +127,7 @@ object XGBoost extends Serializable {
      rabitEnv.put("DMLC_TASK_ID", taskId)
      Rabit.init(rabitEnv)
      val watches = Watches(params,
-        removeMissingValues(labeledPoints, missing),
+        removeMissingValues(labeledPoints.toSeq, missing),


yeah, this toSeq is risky to lead to OOM

CodingCat · 2018-06-12T03:07:33Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -308,9 +306,26 @@ private class Watches private(

 private object Watches {

+  def formatGroups(groups: Seq[Int]): Seq[Int] = {


buildGroups or formGroups as a better name?

codecov-io · 2018-06-13T01:56:47Z

Codecov Report

Merging #3369 into spark_dev_do_not_delete will increase coverage by 0.09%.
The diff coverage is 75%.

@@                      Coverage Diff                      @@
##             spark_dev_do_not_delete    #3369      +/-   ##
=============================================================
+ Coverage                      44.92%   45.02%   +0.09%     
- Complexity                       186      188       +2     
=============================================================
  Files                            163      163              
  Lines                          12932    12952      +20     
  Branches                         439      443       +4     
=============================================================
+ Hits                            5810     5831      +21     
+ Misses                          6921     6920       -1     
  Partials                         201      201

Impacted Files	Coverage Δ	Complexity Δ
.../dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala	`61.72% <50%> (+0.96%)`	`16 <0> (+2)`	⬆️
...c/xgboost4j/scala/spark/params/GeneralParams.scala	`71.64% <50%> (-0.67%)`	`0 <0> (ø)`
.../scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala	`73.59% <94.44%> (+3.47%)`	`0 <0> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b3dafe...f156dbb. Read the comment docs.

CodingCat · 2018-06-15T03:08:12Z

LGTM

CodingCat · 2018-06-15T03:08:48Z

can you file a PR from spark_dev_do_not_delete to master to ensure that they count as your contribution

* add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * [jvm-packages] XGBoost Spark integration refactor. (#3313) * XGBoost Spark integration refactor. * Make corresponding update for xgboost4j-example * Address comments. * [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326) * Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib * Fix extra space. * [jvm-packages] XGBoost Spark supports ranking with group data. (#3369) * XGBoost Spark supports ranking with group data. * Use Iterator.duplicate to prevent OOM. * Update CheckpointManagerSuite.scala * Resolve conflicts

XGBoost Spark supports ranking with group data.

9ee21f8

CodingCat reviewed Jun 12, 2018

View reviewed changes

Use Iterator.duplicate to prevent OOM.

f156dbb

CodingCat merged commit 2903283 into dmlc:spark_dev_do_not_delete Jun 15, 2018

yanboliang deleted the spark_dev_do_not_delete branch June 15, 2018 19:14

CodingCat mentioned this pull request Jul 7, 2018

[jvm-packages] group data is only set for training set and is set incorrectly #3097

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] XGBoost Spark supports ranking with group data. #3369

[jvm-packages] XGBoost Spark supports ranking with group data. #3369

yanboliang commented Jun 7, 2018

CodingCat commented Jun 8, 2018

hcho3 commented Jun 8, 2018 •

edited

Loading

yanboliang commented Jun 8, 2018

hcho3 commented Jun 8, 2018

CodingCat Jun 12, 2018

CodingCat Jun 12, 2018

CodingCat Jun 12, 2018

CodingCat Jun 12, 2018

codecov-io commented Jun 13, 2018 •

edited

Loading

CodingCat commented Jun 15, 2018

CodingCat commented Jun 15, 2018

		@@ -21,12 +21,10 @@ import java.nio.file.Files

		import scala.collection.mutable
		import scala.util.Random

		@@ -308,9 +306,26 @@ private class Watches private(

		private object Watches {

		def formatGroups(groups: Seq[Int]): Seq[Int] = {

[jvm-packages] XGBoost Spark supports ranking with group data. #3369

[jvm-packages] XGBoost Spark supports ranking with group data. #3369

Conversation

yanboliang commented Jun 7, 2018

CodingCat commented Jun 8, 2018

hcho3 commented Jun 8, 2018 • edited Loading

yanboliang commented Jun 8, 2018

hcho3 commented Jun 8, 2018

CodingCat Jun 12, 2018

Choose a reason for hiding this comment

CodingCat Jun 12, 2018

Choose a reason for hiding this comment

CodingCat Jun 12, 2018

Choose a reason for hiding this comment

CodingCat Jun 12, 2018

Choose a reason for hiding this comment

codecov-io commented Jun 13, 2018 • edited Loading

Codecov Report

CodingCat commented Jun 15, 2018

CodingCat commented Jun 15, 2018

hcho3 commented Jun 8, 2018 •

edited

Loading

codecov-io commented Jun 13, 2018 •

edited

Loading