-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16469] enhanced simulate multiply #14068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi srowen, |
|
I don't think this is trivial. You also need to explain the change in more detail. |
|
Sure. val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
val partitions = rightCounterparts.map(b => partitioner.getPartition((rowIndex, b._2)))
((rowIndex, colIndex), partitions.toSet)
}.toMapmore clearly this part: val rightCounterparts = rightMatrix.filter(_._1 == colIndex)So if we were to cache this check for example in a HashMap: val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) =>
(rowIndex, arr.map(b => b._2))
}We can omit the inner filter and just use it: val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
partitioner.getPartition((rowIndex, b))).toSet)
}.toMapAnd to put it al toghether: val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) =>
(rowIndex, arr.map(b => b._2))
}
val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
partitioner.getPartition((rowIndex, b))).toSet)
}.toMapAnd the same trick also for the rightDestinations. |
| val leftMatrix = blockInfo.keys.collect() // blockInfo should already be cached | ||
| val rightMatrix = other.blocks.keys.collect() | ||
|
|
||
| val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could this just be rightMatrix.groupBy(_._1).mapValues(_.map(_._2))
|
It does make sense. Please make a JIRA and connect this though. |
|
I have opened SPARK-16469. |
|
LGTM but CC @brkyvz |
| val rightCounterparts = rightMatrix.filter(_._1 == colIndex) | ||
| val partitions = rightCounterparts.map(b => partitioner.getPartition((rowIndex, b._2))) | ||
| ((rowIndex, colIndex), partitions.toSet) | ||
| ((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for readability could you assign this to a variable instead of inlining it?
In addition, for multi-line expressions
.map { b =>
blah
}is more preferred
|
LGTM, just a minor nit! Thanks for this PR |
|
Have done the requested nit changes. |
|
Jenkins retest this please |
|
Test build #62234 has finished for PR 14068 at commit
|
## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <oraviv@paypal.com> Closes #14068 from uzadude/master. (cherry picked from commit ea06e4e) Signed-off-by: Sean Owen <sowen@cloudera.com>
|
@uzadude hey, i am looking to multiply VERY LARGE AND VERY SPARSE matrixes using Spark. I would love some discussion over it. Can you give me a way to contact you? |
|
Sure, what size are you talking about? we had to some internal code fixes to do that. right now the support for sparse matrices is pretty poor - mainly because Breeze doesn't support it. |
|
@uzadude Looking forward to make it work for matrixes bigger than 1Mx1M and sparsity of 0.001-0.002. |
What changes were proposed in this pull request?
We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).
How was this patch tested?
We have added a performance test and verified the reduced time.