-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] XGBoost Spark integration refactor #3387
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3387 +/- ##
============================================
- Coverage 44.99% 44.49% -0.51%
+ Complexity 228 188 -40
============================================
Files 166 163 -3
Lines 12787 12769 -18
Branches 466 443 -23
============================================
- Hits 5754 5681 -73
- Misses 6841 6887 +46
- Partials 192 201 +9
Continue to review full report at Codecov.
|
* XGBoost Spark integration refactor. * Make corresponding update for xgboost4j-example * Address comments.
…th both XGBoost and Spark MLLib (#3326) * Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib * Fix extra space.
* XGBoost Spark supports ranking with group data. * Use Iterator.duplicate to prevent OOM.
There are several PRs merged to master before this one, would you check if this PR contains those fixes |
@CodingCat Yes, I have rebased master. |
@yanboliang sure, thanks, looks like some newly added test were failed....@RAMitchell would you mind sharing some insights on this? |
var count = 1 | ||
var i = 1 | ||
while (i < groups.length) { | ||
if (groups(i) != groups(i - 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe related to #3406. Since this is a sequence made from an iterable the apply
happening twice in (groups(i) != groups(i - 1))
could be linear time causing this method to be quadratic (thus blowing up for larger datasets)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang can you look at this, I think @a-johnston 's point is valid
and additionally, loading the whole group as Seq still makes it vulnerable to OOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally, @ngoyal2707 and I were checking this out earlier and it worked for smaller datasets (tens of thousands) but did not for larger (tens of millions) without ever hitting OOM although we could train data without groups on the same framework. Instead it just hung forever processing, although we never connected a debugger to check what impl is actually used. I think it might even be cleaner to just do something like
var lastGroup = -1
for (group <- groups) {
if (lastGroup != group) {
lastGroup = group
<etc>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-johnston You have a good point and I think your suggestion should be more optimized, Scala apply
may have some performance issue. But the current code only cause time from O(N) to O(2N) at worst, it's still linear not quadratic(O(N^2)).
I will do some experiments in the following days for further check, if you found more clues, please feel free to share them. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang looking a bit closer, I'm pretty sure that groups
here is a scala.collection.immutable.Stream$Cons
which would lead to linear time apply and then quadratic overall (including the while
loop). Linear is necessary with this approach since it's streaming over the entire column. While I generally prefer this approach compared to the old groupData
, this is definitely a degradation.
Also if you're too busy to dig more into this, I can open a PR for this later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-johnston I see, Seq.apply
is linear time, so overall will be quadratic. You suggested code could solve the performance issue as well. What about change the input format from Seq
to Iterator
to make it less vulnerable to OOM? Please feel free to open a PR for this.
BTW, the new approach is align with #2749 , if that PR get merged, we can switch the underlying implementation to leverage it. Thanks.
Combine a bunch of PRs into one, to merge dev branch to master.