-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters when loop finished and update code where should release unused broadcast/RDD in proper time #14333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
52afc03 to
c40f7f8
Compare
|
Test build #62768 has finished for PR 14333 at commit
|
|
Test build #62769 has finished for PR 14333 at commit
|
|
How about the same for Yeah, it seems like it's pretty rare to want to call I suppose it saves the overhead of new bookkeeping for a |
|
@srowen The second problem, what's the meaning of So that I think the safe place to use |
|
Test build #62778 has finished for PR 14333 at commit
|
f129a2b to
ecd15b2
Compare
|
Test build #62779 has finished for PR 14333 at commit
|
ecd15b2 to
01f4d3a
Compare
|
I don't understand the last change. As far as I can see it can be destroyed inside the loop iteration. It's also possible to reuse the broadcast (declare outside the loop), and unpersist each iteration, and destroy afterwards. But I don't see the need to hold on to all these broadcasts? Yes you make a fair point about recovery. I think your changes are safe in this respect, good. |
|
Test build #62793 has finished for PR 14333 at commit
|
|
@srowen |
|
Oh, it is indeed building up a lineage. I think it's easier to leave this broadcast as-is then unless we know that destroying them is essential for reclaiming driver resources. Here's another issue though: we now have a lineage where all of the RDDs are persisted (the cost RDDs), but I think only the last one is unpersisted. Ideally each would be persisted, materialized, and then unpersist the previous one. We had this problem and solution pattern in Word2Vec. Does that make sense? It might be worth tackling here, because, if that's implemented, then I think the broadcast can safely be disposed at each iteration too, rather than only at the end. |
|
@srowen The KMeans.initKMeansParallel already implements the pattern "persist current step RDD, and unpersist previous one", but I think an RDD persisted can also break down because of disk error or something? If we want to reach the goal "the broadcast can safely be disposed( |
|
Yeah, I think you're right, because the unpersisted RDD can still be recomputed but not a destroyed Broadcast. Hm, then isn't this also true of I suppose I think we should prefer to keep it simple and correct first, and only introduce complexity to optimize while preserving correctness. If some of the current unpersist calls can't be safely changed to destroy, maybe best to leave them rather than find a way to destroy them, if we don't know that it's a problem. I think we still have an RDD-related problem here in that the intermediate RDDs aren't unpersisted, and all of them remain persisted after the loop. Kind of a separate issue, I suppose. |
|
@srowen yeah, but the |
|
OK sounds good. So maybe we're back to this: for |
|
@srowen |
|
Yes, I suppose the issue is consistency ... there are loads of places where RDDs and broadcasts aren't really cleaned up properly in the code. Maybe it's fine to take extra steps here to at least ensure that all broadcasts we know of are handled correctly. OK I think it's pretty good if you're OK with the change as is. |
|
@srowen
|
|
Test build #62907 has finished for PR 14333 at commit
|
|
Test build #62908 has finished for PR 14333 at commit
|
| val indices = updateAssignments(assignments, divisibleIndices, newClusterCenters).keys | ||
| if (preIndices != null) preIndices.unpersist() | ||
| preIndices = indices | ||
| indices = updateAssignments(assignments, divisibleIndices, newClusterCenters).keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably ahead of me on this, but let me check something. To compute assignments in the next line, the current indices is needed, and that in turn needs the current value of assignments, which needs the previous copy of indices (preIndices). But that was unpersisted just above. Should preIndices be unpersisted later, after indices is materialized?
But ... is there even an action here? if this just creating a large lineage then the persisting isn't helping much.
|
@srowen yeah, the code logic here seems confusing, but I think it is right.
|
|
If the last problem is really pretty related to this code, then it should change here as well. However if you're not sure there's an easy fix, we can leave it for later. Are you comfortable that the current change is ready? |
|
I check other modifications in this PR again and it seems no problems. So I think the PR is OK now. Thanks! |
|
Merged to master |
What changes were proposed in this pull request?
update unused broadcast in KMeans/Word2Vec,
use destroy(false) to release memory in time.
and several place destroy() update to destroy(false) so that it will be async-called,
it will better than blocking called.
and update bcNewCenters in KMeans to make it destroy in correct time.
I use a list to store all historical
bcNewCentersgenerated in each loop iteration and delay them to release at the end of loop.fix TODO in
BisectingKMeans.run"unpersist old indices",Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration.
How was this patch tested?
Existing tests.