Skip to content

Conversation

@brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Sep 18, 2014

Note: This is still a work in progress

This is the first of the pull requests to support multi-model training in MLlib. It batches examples and trains multiple models with different regularization parameters and step sizes all at once using Matrix-Matrix multiplication. It uses Native BLAS when the data matrix is dense, and uses sparse matrices as much as possible for both better memory utilization and performance (I will post performance results in the comments).

This is a HUGE Pull Request, therefore I'm posting this now. It is not finished, docs need to be updated, code can be somewhat cleaned up for ease of understanding. I'm posting this now so that users can comment and make suggestions along the way.

Most of the PR consists of adding additional Local Matrix operations for the calculation of gradients and losses.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2451 at commit 5e7d744.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2451 at commit 5e7d744.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait Matrix extends Serializable
    • class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable
    • class SparseMatrix(
    • sealed trait Vector extends Serializable
    • abstract class MultiModelGradient extends Serializable
    • class MultiModelLogisticGradient extends MultiModelGradient
    • class MultiModelLeastSquaresGradient extends MultiModelGradient
    • class MultiModelHingeGradient extends MultiModelGradient
    • trait Optimizer[V] extends Serializable
    • abstract class MultiModelUpdater extends Serializable
    • class MultiModelSimpleUpdater extends MultiModelUpdater
    • class MultiModelL1Updater extends MultiModelUpdater
    • class MultiModelSquaredL2Updater extends MultiModelUpdater

@AtlasPilotPuppy
Copy link
Contributor

With some guidance I could help you with the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this segment merits a one-line explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are related to #2294
I can add explanations there. I realize the math is hard to understand.

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 19, 2014

@anantasty: If you could look through the code and mark places where you're like "What the heck is going on here", it would be easier for me to write up proper comments. I'm going to add a lot today, I can incorporate yours as well. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need sparse versions of rand and randn? It should not be too much more expensive to use the dense versions, and then convert to a sparse matrix. (I figure < 2x the cost.) I can not think of use cases for these either, except unit testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're nice functions to have. It will be helpful for people who want to do random projections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not see the "density" argument. Sounds OK to me (but is there a use case?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No use case in MLlib yet. Randomized SVD for big matrices (distributed) may make use of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds fine then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a good reason to implement this in DenseMatrix: You could avoid the expensive index (multiplication), and just iterate through counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc: "We will concatenate results (weights) to finalWeights as we iterate." (or something like that)

@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

@brkyvz Let's try to split this PR into small ones. For example, functions like factory methods for sparse matrices should not be included in this PR. We want to keep the vector and matrix classes in MLlib simple and let user use breeze for linear algebra operations. If breeze has performance issues, maybe we should contribute the optimization to breeze to centralize the effort on single-machine linear algebra computation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miniBatchSize is inexact. We could avoid the initial count() and instead aggregate the minibatch size during the treeAggregate.

@jkbradley
Copy link
Member

@brkyvz I've made a rough pass, and have listed all of my comments. I can make future passes as needed. Lots of work & it will be great to have!

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 29, 2015

closing this PR as a lot of functionality has changed

@brkyvz brkyvz closed this Jan 29, 2015
@brkyvz brkyvz deleted the SPARK-1486 branch February 3, 2019 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants