Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Nov 2, 2016

What changes were proposed in this pull request?

The problem in current block matrix mulitiplication

As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have M*N dimensions matrix A multiply N*P dimensions matrix B, when N is much larger than M and P, then the following problem may occur:

  • when the middle dimension N is too large, it will cause reducer OOM.
  • even if OOM do not occur, it will still cause parallism too low.
  • when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.)

Key point of my improvement

In this PR, I introduce midDimSplitNum parameter, and improve the algorithm, to resolve this problem.

In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above:

suppose we have block matrix A, contains 200 blocks (2 numRowBlocks * 100 numColBlocks), blocks arranged in 2 rows, 100 cols:

A00 A01 A02 ... A0,99
A10 A11 A12 ... A1,99

and we have block matrix B, also contains 200 blocks (100 numRowBlocks * 2 numColBlocks), blocks arranged in 100 rows, 2 cols:

B00    B01
B10    B11
B20    B21
...
B99,0  B99,1

Suppose all blocks in the two matrices are dense for now.
Now we call A.multiply(B), suppose the generated resultPartitioner contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains 2 * 2 blocks), the current algorithm will contains two shuffle steps:

step-1
Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following:

A00 A01 A02 ... A0,99
B00 B10 B20 ... B99,0    shuffled into reducer-00

A00 A01 A02 ... A0,99
B01 B11 B21 ... B99,1    shuffled into reducer-01

A10 A11 A12 ... A1,99
B00 B10 B20 ... B99,0    shuffled into reducer-10

A10 A11 A12 ... A1,99
B01 B11 B21 ... B99,1    shuffled into reducer-11

and the shuffling above is a cogroup transform, note that each reducer contains only one group.

step-2
Step-2 will do an aggregateByKey transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block.

The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small.
and, we can see that, each reducer contains only one group(the group concept in coGroup transform), each group contains 200 blocks. This is terrible because we know that coGroup transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM.

This PR try to resolve the problem described above.
When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly.
In this PR, I introduce a numMidDimSplits parameter, represent how many splits it will cut on the middle dimension N.
Still using the example described above, now we set numMidDimSplits = 10, now we can generate 40 reducers in step-1:

the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
now the shuffle works as following:

reducer-000 to reducer-009

A0,0 A0,10 A0,20 ... A0,90
B0,0 B10,0 B20,0 ... B90,0    shuffled into reducer-000

A0,1 A0,11 A0,21 ... A0,91
B1,0 B11,0 B21,0 ... B91,0    shuffled into reducer-001

A0,2 A0,12 A0,22 ... A0,92
B2,0 B12,0 B22,0 ... B92,0    shuffled into reducer-002

...

A0,9 A0,19 A0,29 ... A0,99
B9,0 B19,0 B29,0 ... B99,0    shuffled into reducer-009

reducer-010 to reducer-019

A0,0 A0,10 A0,20 ... A0,90
B0,1 B10,1 B20,1 ... B90,1    shuffled into reducer-010

A0,1 A0,11 A0,21 ... A0,91
B1,1 B11,1 B21,1 ... B91,1    shuffled into reducer-011

A0,2 A0,12 A0,22 ... A0,92
B2,1 B12,1 B22,1 ... B92,1    shuffled into reducer-012

...

A0,9 A0,19 A0,29 ... A0,99
B9,1 B19,1 B29,1 ... B99,1    shuffled into reducer-019

reducer-100 to reducer-109 and reducer-110 to reducer-119 is similar to the above, I omit to write them out.

API for this optimized algorithm

I add a new API as following:

  def multiply(
      other: BlockMatrix,
      numMidDimSplits: Int // middle dimension split number, expained above
): BlockMatrix

Shuffled data size analysis (compared under the same parallelism)

The optimization has some subtle influence on the total shuffled data size. Appropriate numMidDimSplits will significantly reduce the shuffled data size,
but too large numMidDimSplits may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here:

Suppose we have two same size square matrices X and Y, both have 16 numRowBlocks * 16 numColBlocks. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case:

case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1
ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks
parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4
ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks
parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

The two cases above all have parallism = 256, case 1 numMidDimSplits = 1 is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, it shows that under the same parallelism, proper numMidDimSplits will significantly reduce the shuffling data size.

How was this patch tested?

Test suites added.
Running result:
blockmatrix

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Nov 2, 2016

@yanboliang @sethah I will be very pleasure to hear your opinions, thanks! This optimization may also bring some inspirations to a series of algos based on distributed matrix.

@WeichenXu123 WeichenXu123 changed the title [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication and move it into ml package [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication Nov 2, 2016
@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67969 has finished for PR 15730 at commit a6bc12d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockMatrix @Since(\"2.1.0\") (
    • trait DistributedMatrix extends Serializable

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67977 has finished for PR 15730 at commit 4324a52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication which may cause OOM and low parallelism usage problem in several cases Nov 2, 2016
@sethah
Copy link
Contributor

sethah commented Nov 2, 2016

ping @brkyvz who wrote these libraries.

@brkyvz
Copy link
Contributor

brkyvz commented Nov 18, 2016

Hi @WeichenXu123 Thank you for this PR. Sorry for taking so long to get back to you. Your optimization would be very helpful. I have a couple thoughts though. Your examples always take into account fully dense matrices, i.e. that all blocks exist all the time. How would sparsity affect shuffling? Would there ever be a case where sparsity of blocks and unlucky alignment of blocks could actually cause a lot more shuffling with your parameter?

Nevertheless, I can see fully dense matrix multiplications benefitting significantly from your optimization. I guess we will need to work on the APIs a bit and document it a bit more clearly.

@WeichenXu123
Copy link
Contributor Author

@brkyvz

Good question about the shuffling data in the sparse case. Now I give some simple analysis on it (maybe not very strict):
as discussed above, the shuffling data contains step-1 and step-2,
if we keep parallelism the same, we need to increase the midDimSplitNum and decrease ShuffleRowPartitions and ShuffleColPartitions, in such case, we can make sure that step-1 shuffling data will always reduce, and step-2 shuffling data will always increase.
Now let us consider the sparse case, the step-2 shuffling data increasing because when each pair of left-matrix row-sets multiplying right-matrix col-sets, we split it into multiple parts (midDimSplitNum parts), in each parts we do the multiplying and we need to aggregate all parts together, the more parts need to be aggregate, the more shuffling data it needs. BUT, in sparse case, these parts(after multiplying) will be empty in high probability, so to those empty parts it shuffle nothing. So, we can expect that in sparse case, step-2 shuffling data won't increase much rather than the midDimSplitNum = 1 case, because most split parts will shuffle nothing.

Now I am considering improve the API interface to make it easier to use.
I would like to let user specified the parallism, and the algorithm automatically calculate the optimal midDimSplitNum, what do you think about it ?

And thanks for careful review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't change the signature of a public method and still call it @Since(1.3.0). Maybe move the inside of this to

private def multiplyImpl(
  other: BlockMatrix,
  shufflePartitioner: GridPartitioner,
  midDimPartNum: Int,
  resultPartitioner: GridPartitioner): BlockMatrix

but keep the

@Since("1.3.0")
def multiply(other: BlockMatrix): BlockMatrix = {
  multiplyImpl(other, ...)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not used?

@brkyvz
Copy link
Contributor

brkyvz commented Nov 28, 2016

@WeichenXu123 How about if we only add:

def multiply(other: BlockMatrix, numMidDimSplits: Integer): BlockMatrix

as the public API. Users shouldn't have to define GridPartitioners themselves.

@brkyvz
Copy link
Contributor

brkyvz commented Nov 28, 2016

This way, all that needs to change in the implementation of multiply is:

val intermediatePartitioner = new Partitioner {
  override def numPartitions: Int = resultPartitioner.numPartitions * numMidDimSplits
  override def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
val newBlocks = flatA.cogroup(flatB, intermediatePartitioner).flatMap { case (pId, (a, b)) =>

@WeichenXu123
Copy link
Contributor Author

@brkyvz All right, I'll update code ASAP. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 25, 2016

Test build #70574 has finished for PR 15730 at commit 13ccfff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@brkyvz I update code and attach a running result screenshot, waiting for your review, thanks!

@brkyvz
Copy link
Contributor

brkyvz commented Jan 3, 2017

@WeichenXu123 Thanks! Will take a look once I get back from vacation (in a week). Happy new year!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add documentation for this? This is the most important part about this PR, understanding how this parameter improves performance. You may copy most of your PR description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather write a test function testMultiply which takes A, B, and expected C and make numSplits configurable. Here you can simply have

testMultiply(largeA, largeB, largeC, 1)
testMultiply(largeA, largeB, largeC, 2)
testMultiply(largeA, largeB, largeC, 3)
testMultiply(largeA, largeB, largeC, 4)

@brkyvz
Copy link
Contributor

brkyvz commented Jan 9, 2017

Looks pretty good overall. Left one major comment about documentation and one about tests.

@WeichenXu123 WeichenXu123 changed the title [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication which may cause OOM and low parallelism usage problem in several cases [SPARK-18218][ML][MLLib] Reduce shuffled data size of BlockMatrix multiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication Jan 11, 2017
@WeichenXu123
Copy link
Contributor Author

@brkyvz Done. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71219 has finished for PR 15730 at commit bf2a603.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71242 has finished for PR 15730 at commit bf2a603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

cc @yanboliang Thanks!

}

/**
* Left multiplies this [[BlockMatrix]] to `other`, another [[BlockMatrix]]. This method add
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I feel this is very verbose in terms of documentation. Can we summarize it somehow?

/**
 * Left multiplies this [[BlockMatrix]] to `other`, another [[BlockMatrix]]. The `colsPerBlock`
 * of this matrix must equal the `rowsPerBlock` of `other`. If `other` contains
 * `SparseMatrix`, they will have to be converted to a `DenseMatrix`. The output
 * [[BlockMatrix]] will only consist of blocks of `DenseMatrix`. This may cause
 * some performance issues until support for multiplying two sparse matrices is added.
 * Blocks with duplicate indices will be added with each other.
 *
 * @param other Matrix `B` in `A * B = C`
 * @param numMidDimSplits Number of splits to cut on the middle dimension when doing multiplication.
 For example, when multiplying a Matrix `A` of size `m x n` with Matrix `B` of size `n x k`, this parameter
 configures the parallelism to use when grouping the matrices. The parallelism will increase from `m x k` to
 `m x k x numMidDimSplits`, which in some cases also reduces total shuffled data.

Wrote out a sketch, please put in proper format, i.e. omitted * on the last lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, thanks!

@SparkQA
Copy link

SparkQA commented Jan 15, 2017

Test build #71390 has finished for PR 15730 at commit 55dabe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Jan 17, 2017

@WeichenXu123 This LGTM thanks. @mengxr Would you also like to take a look?

@jkbradley
Copy link
Member

The API looks good to me. I have not reviewed the internals carefully.

One comment: Let's add a check to verify that numMidDimSplits is > 0.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71691 has finished for PR 15730 at commit 3feca58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"of B must be equal. A.numCols: ${numCols()}, B.numRows: ${other.numRows()}. If you " +
"think they should be equal, try setting the dimensions of A and B explicitly while " +
"initializing them.")
require(numMidDimSplits > 0, "numMidDimSplits should be positive value.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra nit: should be a positive value or should be a positive integer

@brkyvz
Copy link
Contributor

brkyvz commented Jan 26, 2017

@WeichenXu123 This LGTM! Thank you for providing this functionality. I left one final comment then I'll merge this

@WeichenXu123
Copy link
Contributor Author

@brkyvz Also thanks for your careful code review! ^_^

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72063 has finished for PR 15730 at commit 4edcd74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Jan 27, 2017

Merging to master! Thanks!

@asfgit asfgit closed this in 1191fe2 Jan 27, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication

## What changes were proposed in this pull request?

### The problem in current block matrix mulitiplication

As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have `M*N` dimensions matrix A multiply `N*P` dimensions matrix B, when N is much larger than M and P, then the following problem may occur:
- when the middle dimension N is too large, it will cause reducer OOM.
- even if OOM do not occur, it will still cause parallism too low.
- when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.)

### Key point of my improvement

In this PR, I introduce `midDimSplitNum` parameter, and improve the algorithm, to resolve this problem.

In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above:

suppose we have block matrix A, contains 200 blocks (`2 numRowBlocks * 100 numColBlocks`), blocks arranged in 2 rows, 100 cols:
```
A00 A01 A02 ... A0,99
A10 A11 A12 ... A1,99
```
and we have block matrix B, also contains 200 blocks (`100 numRowBlocks * 2 numColBlocks`), blocks arranged in 100 rows, 2 cols:
```
B00    B01
B10    B11
B20    B21
...
B99,0  B99,1
```
Suppose all blocks in the two matrices are dense for now.
Now we call A.multiply(B), suppose the generated `resultPartitioner` contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains `2 * 2` blocks), the current algorithm will contains two shuffle steps:

**step-1**
Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following:
```
A00 A01 A02 ... A0,99
B00 B10 B20 ... B99,0    shuffled into reducer-00

A00 A01 A02 ... A0,99
B01 B11 B21 ... B99,1    shuffled into reducer-01

A10 A11 A12 ... A1,99
B00 B10 B20 ... B99,0    shuffled into reducer-10

A10 A11 A12 ... A1,99
B01 B11 B21 ... B99,1    shuffled into reducer-11
```

and the shuffling above is a `cogroup` transform, note that each reducer contains **only one group**.

**step-2**
Step-2 will do an `aggregateByKey` transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block.

The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small.
and, we can see that, each reducer contains only one group(the group concept in `coGroup` transform), each group contains 200 blocks. This is terrible because we know that `coGroup` transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM.

This PR try to resolve the problem described above.
When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly.
In this PR, I introduce a `numMidDimSplits` parameter, represent how many splits it will cut on the middle dimension N.
Still using the example described above, now we set `numMidDimSplits = 10`, now we can generate 40 reducers in **step-1**:

the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
now the shuffle works as following:

**reducer-000 to reducer-009**
```
A0,0 A0,10 A0,20 ... A0,90
B0,0 B10,0 B20,0 ... B90,0    shuffled into reducer-000

A0,1 A0,11 A0,21 ... A0,91
B1,0 B11,0 B21,0 ... B91,0    shuffled into reducer-001

A0,2 A0,12 A0,22 ... A0,92
B2,0 B12,0 B22,0 ... B92,0    shuffled into reducer-002

...

A0,9 A0,19 A0,29 ... A0,99
B9,0 B19,0 B29,0 ... B99,0    shuffled into reducer-009
```

**reducer-010 to reducer-019**
```
A0,0 A0,10 A0,20 ... A0,90
B0,1 B10,1 B20,1 ... B90,1    shuffled into reducer-010

A0,1 A0,11 A0,21 ... A0,91
B1,1 B11,1 B21,1 ... B91,1    shuffled into reducer-011

A0,2 A0,12 A0,22 ... A0,92
B2,1 B12,1 B22,1 ... B92,1    shuffled into reducer-012

...

A0,9 A0,19 A0,29 ... A0,99
B9,1 B19,1 B29,1 ... B99,1    shuffled into reducer-019
```

**reducer-100 to reducer-109** and **reducer-110 to reducer-119** is similar to the above, I omit to write them out.

### API for this optimized algorithm

I add a new API as following:
```
  def multiply(
      other: BlockMatrix,
      numMidDimSplits: Int // middle dimension split number, expained above
): BlockMatrix
```

### Shuffled data size analysis (compared under the same parallelism)

The optimization has some subtle influence on the total shuffled data size. Appropriate `numMidDimSplits` will significantly reduce the shuffled data size,
but too large `numMidDimSplits` may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here:

Suppose we have two same size square matrices X and Y, both have `16 numRowBlocks * 16 numColBlocks`. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case:

**case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1**
ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks
parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4**
ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks
parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**The two cases above all have parallism = 256**, case 1 `numMidDimSplits = 1` is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, **it shows that under the same parallelism, proper `numMidDimSplits` will significantly reduce the shuffling data size**.

## How was this patch tested?

Test suites added.
Running result:
![blockmatrix](https://cloud.githubusercontent.com/assets/19235986/21600989/5e162cc2-d1bf-11e6-868c-0ec29190b605.png)

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#15730 from WeichenXu123/optim_block_matrix.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication

## What changes were proposed in this pull request?

### The problem in current block matrix mulitiplication

As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have `M*N` dimensions matrix A multiply `N*P` dimensions matrix B, when N is much larger than M and P, then the following problem may occur:
- when the middle dimension N is too large, it will cause reducer OOM.
- even if OOM do not occur, it will still cause parallism too low.
- when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.)

### Key point of my improvement

In this PR, I introduce `midDimSplitNum` parameter, and improve the algorithm, to resolve this problem.

In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above:

suppose we have block matrix A, contains 200 blocks (`2 numRowBlocks * 100 numColBlocks`), blocks arranged in 2 rows, 100 cols:
```
A00 A01 A02 ... A0,99
A10 A11 A12 ... A1,99
```
and we have block matrix B, also contains 200 blocks (`100 numRowBlocks * 2 numColBlocks`), blocks arranged in 100 rows, 2 cols:
```
B00    B01
B10    B11
B20    B21
...
B99,0  B99,1
```
Suppose all blocks in the two matrices are dense for now.
Now we call A.multiply(B), suppose the generated `resultPartitioner` contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains `2 * 2` blocks), the current algorithm will contains two shuffle steps:

**step-1**
Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following:
```
A00 A01 A02 ... A0,99
B00 B10 B20 ... B99,0    shuffled into reducer-00

A00 A01 A02 ... A0,99
B01 B11 B21 ... B99,1    shuffled into reducer-01

A10 A11 A12 ... A1,99
B00 B10 B20 ... B99,0    shuffled into reducer-10

A10 A11 A12 ... A1,99
B01 B11 B21 ... B99,1    shuffled into reducer-11
```

and the shuffling above is a `cogroup` transform, note that each reducer contains **only one group**.

**step-2**
Step-2 will do an `aggregateByKey` transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block.

The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small.
and, we can see that, each reducer contains only one group(the group concept in `coGroup` transform), each group contains 200 blocks. This is terrible because we know that `coGroup` transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM.

This PR try to resolve the problem described above.
When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly.
In this PR, I introduce a `numMidDimSplits` parameter, represent how many splits it will cut on the middle dimension N.
Still using the example described above, now we set `numMidDimSplits = 10`, now we can generate 40 reducers in **step-1**:

the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
now the shuffle works as following:

**reducer-000 to reducer-009**
```
A0,0 A0,10 A0,20 ... A0,90
B0,0 B10,0 B20,0 ... B90,0    shuffled into reducer-000

A0,1 A0,11 A0,21 ... A0,91
B1,0 B11,0 B21,0 ... B91,0    shuffled into reducer-001

A0,2 A0,12 A0,22 ... A0,92
B2,0 B12,0 B22,0 ... B92,0    shuffled into reducer-002

...

A0,9 A0,19 A0,29 ... A0,99
B9,0 B19,0 B29,0 ... B99,0    shuffled into reducer-009
```

**reducer-010 to reducer-019**
```
A0,0 A0,10 A0,20 ... A0,90
B0,1 B10,1 B20,1 ... B90,1    shuffled into reducer-010

A0,1 A0,11 A0,21 ... A0,91
B1,1 B11,1 B21,1 ... B91,1    shuffled into reducer-011

A0,2 A0,12 A0,22 ... A0,92
B2,1 B12,1 B22,1 ... B92,1    shuffled into reducer-012

...

A0,9 A0,19 A0,29 ... A0,99
B9,1 B19,1 B29,1 ... B99,1    shuffled into reducer-019
```

**reducer-100 to reducer-109** and **reducer-110 to reducer-119** is similar to the above, I omit to write them out.

### API for this optimized algorithm

I add a new API as following:
```
  def multiply(
      other: BlockMatrix,
      numMidDimSplits: Int // middle dimension split number, expained above
): BlockMatrix
```

### Shuffled data size analysis (compared under the same parallelism)

The optimization has some subtle influence on the total shuffled data size. Appropriate `numMidDimSplits` will significantly reduce the shuffled data size,
but too large `numMidDimSplits` may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here:

Suppose we have two same size square matrices X and Y, both have `16 numRowBlocks * 16 numColBlocks`. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case:

**case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1**
ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks
parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4**
ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks
parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm.

**The two cases above all have parallism = 256**, case 1 `numMidDimSplits = 1` is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, **it shows that under the same parallelism, proper `numMidDimSplits` will significantly reduce the shuffling data size**.

## How was this patch tested?

Test suites added.
Running result:
![blockmatrix](https://cloud.githubusercontent.com/assets/19235986/21600989/5e162cc2-d1bf-11e6-868c-0ec29190b605.png)

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#15730 from WeichenXu123/optim_block_matrix.
@WeichenXu123 WeichenXu123 deleted the optim_block_matrix branch January 20, 2018 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants