Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] XGBoost Spark integration refactor #3387

Merged
merged 9 commits into from
Jun 18, 2018
Merged

[jvm-packages] XGBoost Spark integration refactor #3387

merged 9 commits into from
Jun 18, 2018

Conversation

yanboliang
Copy link
Contributor

Combine a bunch of PRs into one, to merge dev branch to master.

@codecov-io
Copy link

codecov-io commented Jun 15, 2018

Codecov Report

Merging #3387 into master will decrease coverage by 0.5%.
The diff coverage is 63.22%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3387      +/-   ##
============================================
- Coverage     44.99%   44.49%   -0.51%     
+ Complexity      228      188      -40     
============================================
  Files           166      163       -3     
  Lines         12787    12769      -18     
  Branches        466      443      -23     
============================================
- Hits           5754     5681      -73     
- Misses         6841     6887      +46     
- Partials        192      201       +9
Impacted Files Coverage Δ Complexity Δ
...c/main/scala/ml/dmlc/xgboost4j/scala/Booster.scala 35.71% <0%> (-1.33%) 8 <0> (ø)
...t4j/scala/example/spark/SparkModelTuningTool.scala 0% <0%> (ø) 0 <0> (ø) ⬇️
...ost4j/scala/example/spark/SparkWithDataFrame.scala 0% <0%> (ø) 0 <0> (ø) ⬇️
...dmlc/xgboost4j/scala/spark/CheckpointManager.scala 72.91% <100%> (-4.17%) 13 <0> (ø)
...c/xgboost4j/scala/spark/params/BoosterParams.scala 59.3% <51.78%> (-6.04%) 0 <0> (ø)
.../dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala 61.72% <61.72%> (ø) 16 <16> (?)
...dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala 62.5% <62.5%> (ø) 18 <18> (?)
...c/xgboost4j/scala/spark/params/GeneralParams.scala 71.64% <68.96%> (-20.96%) 0 <0> (ø)
...oost4j/scala/spark/params/LearningTaskParams.scala 80% <76.47%> (-13.11%) 0 <0> (ø)
.../scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala 73.59% <88.88%> (-1.72%) 0 <0> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 578a0c7...c1ff626. Read the comment docs.

CodingCat and others added 9 commits June 15, 2018 14:37
* XGBoost Spark integration refactor.

* Make corresponding update for xgboost4j-example

* Address comments.
…th both XGBoost and Spark MLLib (#3326)

* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib

* Fix extra space.
* XGBoost Spark supports ranking with group data.

* Use Iterator.duplicate to prevent OOM.
@CodingCat CodingCat changed the title XGBoost Spark integration refactor [jvm-packages] XGBoost Spark integration refactor Jun 18, 2018
@CodingCat
Copy link
Member

There are several PRs merged to master before this one, would you check if this PR contains those fixes

@yanboliang
Copy link
Contributor Author

@CodingCat Yes, I have rebased master.

@CodingCat
Copy link
Member

@yanboliang sure, thanks, looks like some newly added test were failed....@RAMitchell would you mind sharing some insights on this?

@CodingCat CodingCat merged commit 2c4359e into dmlc:master Jun 18, 2018
@yanboliang yanboliang deleted the spark_dev_do_not_delete branch June 18, 2018 22:49
var count = 1
var i = 1
while (i < groups.length) {
if (groups(i) != groups(i - 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe related to #3406. Since this is a sequence made from an iterable the apply happening twice in (groups(i) != groups(i - 1)) could be linear time causing this method to be quadratic (thus blowing up for larger datasets)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang can you look at this, I think @a-johnston 's point is valid

and additionally, loading the whole group as Seq still makes it vulnerable to OOM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, @ngoyal2707 and I were checking this out earlier and it worked for smaller datasets (tens of thousands) but did not for larger (tens of millions) without ever hitting OOM although we could train data without groups on the same framework. Instead it just hung forever processing, although we never connected a debugger to check what impl is actually used. I think it might even be cleaner to just do something like

var lastGroup = -1
for (group <- groups) {
    if (lastGroup != group) {
        lastGroup = group
<etc>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-johnston You have a good point and I think your suggestion should be more optimized, Scala apply may have some performance issue. But the current code only cause time from O(N) to O(2N) at worst, it's still linear not quadratic(O(N^2)).
I will do some experiments in the following days for further check, if you found more clues, please feel free to share them. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang looking a bit closer, I'm pretty sure that groups here is a scala.collection.immutable.Stream$Cons which would lead to linear time apply and then quadratic overall (including the while loop). Linear is necessary with this approach since it's streaming over the entire column. While I generally prefer this approach compared to the old groupData, this is definitely a degradation.

Also if you're too busy to dig more into this, I can open a PR for this later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-johnston I see, Seq.apply is linear time, so overall will be quadratic. You suggested code could solve the performance issue as well. What about change the input format from Seq to Iterator to make it less vulnerable to OOM? Please feel free to open a PR for this.
BTW, the new approach is align with #2749 , if that PR get merged, we can switch the underlying implementation to leverage it. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants