[jvm-packages] XGBoost Spark integration refactor #3387

yanboliang · 2018-06-15T19:19:17Z

Combine a bunch of PRs into one, to merge dev branch to master.

codecov-io · 2018-06-15T20:34:50Z

Codecov Report

Merging #3387 into master will decrease coverage by 0.5%.
The diff coverage is 63.22%.

@@             Coverage Diff              @@
##             master    #3387      +/-   ##
============================================
- Coverage     44.99%   44.49%   -0.51%     
+ Complexity      228      188      -40     
============================================
  Files           166      163       -3     
  Lines         12787    12769      -18     
  Branches        466      443      -23     
============================================
- Hits           5754     5681      -73     
- Misses         6841     6887      +46     
- Partials        192      201       +9

Impacted Files	Coverage Δ	Complexity Δ
...c/main/scala/ml/dmlc/xgboost4j/scala/Booster.scala	`35.71% <0%> (-1.33%)`	`8 <0> (ø)`
...t4j/scala/example/spark/SparkModelTuningTool.scala	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...ost4j/scala/example/spark/SparkWithDataFrame.scala	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...dmlc/xgboost4j/scala/spark/CheckpointManager.scala	`72.91% <100%> (-4.17%)`	`13 <0> (ø)`
...c/xgboost4j/scala/spark/params/BoosterParams.scala	`59.3% <51.78%> (-6.04%)`	`0 <0> (ø)`
.../dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala	`61.72% <61.72%> (ø)`	`16 <16> (?)`
...dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala	`62.5% <62.5%> (ø)`	`18 <18> (?)`
...c/xgboost4j/scala/spark/params/GeneralParams.scala	`71.64% <68.96%> (-20.96%)`	`0 <0> (ø)`
...oost4j/scala/spark/params/LearningTaskParams.scala	`80% <76.47%> (-13.11%)`	`0 <0> (ø)`
.../scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala	`73.59% <88.88%> (-1.72%)`	`0 <0> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 578a0c7...c1ff626. Read the comment docs.

* XGBoost Spark integration refactor. * Make corresponding update for xgboost4j-example * Address comments.

…th both XGBoost and Spark MLLib (#3326) * Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib * Fix extra space.

* XGBoost Spark supports ranking with group data. * Use Iterator.duplicate to prevent OOM.

CodingCat · 2018-06-18T16:17:43Z

There are several PRs merged to master before this one, would you check if this PR contains those fixes

yanboliang · 2018-06-18T21:00:18Z

@CodingCat Yes, I have rebased master.

CodingCat · 2018-06-18T21:58:37Z

@yanboliang sure, thanks, looks like some newly added test were failed....@RAMitchell would you mind sharing some insights on this?

a-johnston · 2018-06-25T23:50:18Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

+    var count = 1
+    var i = 1
+    while (i < groups.length) {
+      if (groups(i) != groups(i - 1)) {


Maybe related to #3406. Since this is a sequence made from an iterable the apply happening twice in (groups(i) != groups(i - 1)) could be linear time causing this method to be quadratic (thus blowing up for larger datasets)?

@yanboliang can you look at this, I think @a-johnston 's point is valid

and additionally, loading the whole group as Seq still makes it vulnerable to OOM

Anecdotally, @ngoyal2707 and I were checking this out earlier and it worked for smaller datasets (tens of thousands) but did not for larger (tens of millions) without ever hitting OOM although we could train data without groups on the same framework. Instead it just hung forever processing, although we never connected a debugger to check what impl is actually used. I think it might even be cleaner to just do something like

var lastGroup = -1 for (group <- groups) { if (lastGroup != group) { lastGroup = group <etc>

@a-johnston You have a good point and I think your suggestion should be more optimized, Scala apply may have some performance issue. But the current code only cause time from O(N) to O(2N) at worst, it's still linear not quadratic(O(N^2)).
I will do some experiments in the following days for further check, if you found more clues, please feel free to share them. Thanks.

@yanboliang looking a bit closer, I'm pretty sure that groups here is a scala.collection.immutable.Stream$Cons which would lead to linear time apply and then quadratic overall (including the while loop). Linear is necessary with this approach since it's streaming over the entire column. While I generally prefer this approach compared to the old groupData, this is definitely a degradation.

Also if you're too busy to dig more into this, I can open a PR for this later today.

@a-johnston I see, Seq.apply is linear time, so overall will be quadratic. You suggested code could solve the performance issue as well. What about change the input format from Seq to Iterator to make it less vulnerable to OOM? Please feel free to open a PR for this.
BTW, the new approach is align with #2749 , if that PR get merged, we can switch the underlying implementation to leverage it. Thanks.

CodingCat and others added 9 commits June 15, 2018 14:37

add back train method but mark as deprecated

8d0e14d

fix scalastyle error

2b862e2

add back train method but mark as deprecated

1c4685f

fix scalastyle error

2fe34d2

[jvm-packages] XGBoost Spark integration refactor. (#3313)

7300f5e

* XGBoost Spark integration refactor. * Make corresponding update for xgboost4j-example * Address comments.

[jvm-packages] Refactor XGBoost-Spark params to make it compatible wi…

af3b980

…th both XGBoost and Spark MLLib (#3326) * Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib * Fix extra space.

[jvm-packages] XGBoost Spark supports ranking with group data. (#3369)

6697cfc

* XGBoost Spark supports ranking with group data. * Use Iterator.duplicate to prevent OOM.

Update CheckpointManagerSuite.scala

404f60b

Resolve conflicts

c1ff626

CodingCat changed the title ~~XGBoost Spark integration refactor~~ [jvm-packages] XGBoost Spark integration refactor Jun 18, 2018

CodingCat merged commit 2c4359e into dmlc:master Jun 18, 2018

yanboliang deleted the spark_dev_do_not_delete branch June 18, 2018 22:49

a-johnston reviewed Jun 25, 2018

View reviewed changes

This was referenced Jun 26, 2018

[jvm-packages] buildGroups is unnecessarily expensive #3412

Closed

[jvm-packages] Avoid use of Seq.apply in buildGroups #3413

Merged

beautifulskylfsd mentioned this pull request Jun 27, 2018

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

Closed

a-johnston pushed a commit to a-johnston/xgboost that referenced this pull request Jun 28, 2018

update for compatibility with dmlc#3387

15ab47c

vincent-grosbois mentioned this pull request Aug 27, 2018

[jvm-packages] Confusion about param map keywords #3641

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] XGBoost Spark integration refactor #3387

[jvm-packages] XGBoost Spark integration refactor #3387

yanboliang commented Jun 15, 2018

codecov-io commented Jun 15, 2018 •

edited

Loading

CodingCat commented Jun 18, 2018

yanboliang commented Jun 18, 2018

CodingCat commented Jun 18, 2018

a-johnston Jun 25, 2018

CodingCat Jun 26, 2018

a-johnston Jun 26, 2018

yanboliang Jun 26, 2018

a-johnston Jun 26, 2018

yanboliang Jun 26, 2018

[jvm-packages] XGBoost Spark integration refactor #3387

[jvm-packages] XGBoost Spark integration refactor #3387

Conversation

yanboliang commented Jun 15, 2018

codecov-io commented Jun 15, 2018 • edited Loading

Codecov Report

CodingCat commented Jun 18, 2018

yanboliang commented Jun 18, 2018

CodingCat commented Jun 18, 2018

a-johnston Jun 25, 2018

Choose a reason for hiding this comment

CodingCat Jun 26, 2018

Choose a reason for hiding this comment

a-johnston Jun 26, 2018

Choose a reason for hiding this comment

yanboliang Jun 26, 2018

Choose a reason for hiding this comment

a-johnston Jun 26, 2018

Choose a reason for hiding this comment

yanboliang Jun 26, 2018

Choose a reason for hiding this comment

codecov-io commented Jun 15, 2018 •

edited

Loading