[SPARK-17163][ML] Unified LogisticRegression interface by sethah · Pull Request #14834 · apache/spark

sethah · 2016-08-26T18:03:02Z

What changes were proposed in this pull request?

Merge MultinomialLogisticRegression into LogisticRegression and remove MultinomialLogisticRegression.

Marked as WIP because we should discuss the coefficients API in the model. See discussion below.

JIRA: SPARK-17163

How was this patch tested?

Merged test suites and added some new unit tests.

Design

Switching between binomial and multinomial

We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called family which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.

coefficients/intercept model API (TODO)

This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:

We must maintain compatibility with the old API, i.e. we must expose def coefficients: Vector and def intercept: Double
There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have numClasses sets of coefficients and numClasses intercepts.

Some options:

Store the binomial coefficients as a 2 x numFeatures matrix. This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives 1 * numFeatures coefficients, but we would convert them to 2 x numFeatures coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less if (isMultinomial) ... else ...) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in def coefficients: Vector every time.
Store the binomial coefficients as a 1 x numFeatures matrix. We still store the coefficients as a matrix and the intercepts as a vector. When users call coefficients we return them a Vector that is backed by the same underlying array as the coefficientMatrix, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.

If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.

Update: We have chosen option 2

Threshold/thresholds (TODO)

Currently, when threshold is set we clear whatever value is in thresholds and when thresholds is set we clear whatever value is in threshold. SPARK-11543 was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.

Update: Let's leave it for a follow up PR

Follow up

Summary model for multiclass logistic regression SPARK-17139
Thresholds vs threshold SPARK-11543

sethah · 2016-08-26T18:12:12Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

I did an offline test to make sure that we can successfully load old models into the new API

How about 2.0.1?

Will this patch make it into 2.0.1? If so, we'd need to change this to also check the "micro" version number. Otherwise, this check should still be valid.

We're not backporting MLOR to 2.0.x. I got it now, since you do minor.toInt so even minor is 0.1, you will load it in the old way.

sethah · 2016-08-26T18:20:55Z

cc @yanboliang @jkbradley @dbtsai

SparkQA · 2016-08-26T18:27:01Z

Test build #64494 has finished for PR 14834 at commit 7cfbcd3.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-26T18:38:17Z

Test build #64495 has finished for PR 14834 at commit 4048570.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-09-01T17:11:18Z

Also, I'm not sure I understand the MiMa failure. It's complaining about the constructor being different for LogisticRegressionModel, but that constructor has always been private[spark]. I appreciate any thoughts on this.

yanboliang · 2016-09-02T02:13:59Z

MiMa do binary compatibility check for model constructor even it's private, so we should exclude it at MimaExcludes.scala.

sethah · 2016-09-02T04:53:20Z

@yanboliang Thanks for the tip. Done.

SparkQA · 2016-09-02T07:03:59Z

Test build #64832 has finished for PR 14834 at commit c52ef66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-02T07:10:22Z

Test build #64831 has finished for PR 14834 at commit 5bce1ba.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

jkbradley · 2016-09-06T20:44:08Z

+1 for Option 2: Store the binomial coefficients as a 1 x numFeatures matrix.

It's such an important code path that I think it's worth avoiding the regression for current users.

sethah · 2016-09-06T21:25:48Z

@jkbradley Thanks for your input. Let's see what @dbtsai thinks as well :)

dbtsai · 2016-09-07T00:40:19Z

@sethah Thank you for coming up with PR with detailed documentation. For option 2, if a two class model is trained with multinomial family, how do you store it? I was thinking about maybe we could always store the coefficients as nClasses * numFeatures, and when nClasses == 2, we convert it into 1 x numFeatures for prediction. Thus, we don't lose the performance, and also have a consistent representation of models.

sethah · 2016-09-07T03:31:54Z

numClasses	isMultinomial	coefficientMatrix size
3+	true	3+ x numFeatures
2	true	2 x numFeatures
2	false	1 x numFeatures

The current behavior is as follows:

If it is binary classification trained with multinomial family, then we store 2 x numFeatures coefficients in a matrix. We will predict with this matrix (i.e. we do not convert to 1 x numFeatures).
If it is binary classification trained with binomial family, then we store 1 x numFeatures (i.e. these coefficients are pivoted) and we use a DenseVector instead of a matrix for prediction.

The coefficients are stored in an array, truly. There is always coefficientMatrix which is backed by that array and in some cases has only 1 row. When it is binomial family, we also have a cofficients vector which is backed by the same array as the matrix. We use that vector for prediction in the binomial case.

Hopefully that clears it up. I don't think it's necessary to convert the case of multinomial family but binary classification to 1 x numFeatures for prediction since it won't be a regression and users would have to explicitly specify that family (hopefully knowing the consequences of that choice).

I also vote for Option 2 in the original description. We can avoid any regressions with past versions and the implementation isn't too messy.

jkbradley · 2016-09-07T16:31:27Z

@dbtsai For numClasses = 2, this conversion would involve copying half of the array (since Vector constructors require Arrays, not views), and it would mean doubling the size of the model. That's fine for small models but pretty expensive for large ones---and it would happen on the driver. Just saw @sethah 's new comment. +1 for his approach!

dbtsai · 2016-09-07T17:31:00Z

@sethah +1 for this approach. Couple minor questions. With L1, the coefficients can be very sparse. Currently, we will store them as sparse vector and use sparse vector for prediction. (It is decided to store as sparse or dense vector based on size, not as prediction speed, and we probably need to do some experiment around it). Do you plan to always store the coefficients as dense matrix even for binomial case?

Also, for 2 classes LOR with multinomial family, will users be able to def coefficients: Vector and def intercept: Double by pivoting the coefficients?

sethah · 2016-09-07T18:19:17Z

@dbtsai Good point. This patch in its current state would change the behavior of binomial LOR to always have dense coefficients. I think we need to find a solution to this. I wonder why there isn't a compressed method for Matrix?

If we store the coefficients as SparseMatrix in some L1 cases, then before prediction we have to convert it to a SparseVector. This amounts to an extra 4 * nnz bytes being stored (we have to create the sparse vector indices since we cannot reuse them from the matrix case). We could implement a compressed method for matrices if we are ok with the extra storage overhead.

Otherwise I guess we'd have to store the binomial case as a vector and then do some conversion to matrix iff coefficientMatrix is called.

Finally, I don't think it's necessary to pivot the coefficients in the case of 2 classes with multinomial family. Currently, we throw an exception.

dbtsai · 2016-09-07T23:57:17Z

@sethah I remember that compressed method for Matrix is one of the todo in the followup tasks. For sparse binary logistic regression, if we store the models as 1 x numFeatures compressed sparse row major matrices, I think the space will be the same as current sparse vector implementation. And this CSR format should be able to convert to sparse vector without changing the underline data structure.

Throwing an exception in the case of 2 classes with multinomial family should good to me.

sethah · 2016-09-08T20:53:31Z

@dbtsai Yeah, if we store it as a row major sparse matrix then the rowIndices will exactly be the indices needed for the sparse vector. We'll have to add some functionality to the linalg classes to accomplish this. I can look into it. We can continue moving forward for this PR without it, and address the compressed option later, but IMO it must be done before 2.1 release. Otherwise, we can block this PR until it is done.

dbtsai · 2016-09-08T21:26:56Z

@sethah For sparse MLOR problems with L1, the models will be sparse in row. As a result, in the sparse, we need to store the models in CSR format, and CSR models can be used for model prediction with potential speedup (although we need to do benchmark and see how much speed up we get). Let's have a separate PR to implement compressed option in matrix. This will be a little bit complicated. By default, compressed has to determine CSR or CSC will be used depending on the compression rate. Users need to have a option to choose the format as well.

dbtsai · 2016-09-08T21:34:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

could you import with full classpath without using wildcard?

SparkQA · 2016-09-16T07:01:00Z

Test build #65476 has finished for PR 14834 at commit 38fad98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-19T22:49:42Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

    set(threshold, value)
  }

+


remove this extra new line.

dbtsai · 2016-09-19T22:59:40Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

-    @Since("1.3.0") val intercept: Double)
+    @Since("2.1.0") val coefficientMatrix: Matrix,
+    @Since("2.1.0") val interceptVector: Vector,
+    @Since("1.3.0") override val numClasses: Int,


How about we make numClasses as a function determined by isMultinomial and coefficientMatrix .numRows. Thus, we can reduce one parameter in the constructor.

Actually that won't work under the current edge case behaviors. When the labels are all 0.0 then the coefficient matrix will have only one row regardless of multinomial or binomial. We could potentially change this behavior though, as if we always assume there will be at minimum two classes.

Okay. Let's merge it as it for now. I want to make change so we don't train on the classes that is non-seen. I'll address them together. Thanks.

dbtsai · 2016-09-19T23:01:00Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+    @Since("2.1.0") val coefficientMatrix: Matrix,
+    @Since("2.1.0") val interceptVector: Vector,
+    @Since("1.3.0") override val numClasses: Int,
+    private val isMultinomial: Boolean)


actually, isMultinomial can be determined by coefficientMatrix .numRows as well.

dbtsai · 2016-09-19T23:02:14Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+    private val isMultinomial: Boolean)
  extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
  with LogisticRegressionParams with MLWritable {



Can we have a require(coefficientMatrix .numRows == interceptVector.length) here?

dbtsai · 2016-09-19T23:12:13Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

 *    $$
 * </blockquote></p>
 *
+ *


remove extra line

dbtsai · 2016-09-19T23:16:54Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

-import org.apache.spark.ml.feature.{Instance, LabeledPoint}
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.feature.LabeledPoint
+import org.apache.spark.ml.linalg._


avoid import _ if possible

dbtsai · 2016-09-19T23:18:04Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

    ParamsSuite.checkParams(new LogisticRegression)
-    val model = new LogisticRegressionModel("logReg", Vectors.dense(0.0), 0.0)
+    val model = new LogisticRegressionModel("logReg",
+      new DenseMatrix(1, 1, Array(0.0)), Vectors.dense(0.0), 2, isMultinomial = false)


If we have old constructor, revert this.

Done. and below

dbtsai · 2016-09-19T23:20:48Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

      .setThreshold(0.6)
-    val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2)
+    val lrModel = new LogisticRegressionModel(lr.uid,
+      new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)


ditto. revert this.

dbtsai · 2016-09-19T23:21:01Z

mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala

      .setThreshold(0.6)
-    val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2)
+    val lrModel = new LogisticRegressionModel(lr.uid,
+      new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)


ditto. revert this.

dbtsai · 2016-09-19T23:21:21Z

mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala

    ParamsSuite.checkParams(new OneVsRest)
-    val lrModel = new LogisticRegressionModel("lr", Vectors.dense(0.0), 0.0)
+    val lrModel = new LogisticRegressionModel("logReg",
+      new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)


ditto. revert this.

dbtsai · 2016-09-20T00:50:53Z

LGTM. Wait for the test.

SparkQA · 2016-09-20T03:00:03Z

Test build #65622 has finished for PR 14834 at commit 4dae595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-09-20T04:34:07Z

Merged into master. Thanks.

sethah reviewed Aug 26, 2016
View reviewed changes

sethah mentioned this pull request Sep 1, 2016

[SPARK-17156][ML][EXAMPLE] Add multiclass logistic regression Scala Example #14808

Closed

sethah force-pushed the SPARK-17163 branch from 5bce1ba to c52ef66 Compare September 2, 2016 04:52

dbtsai reviewed Sep 8, 2016
View reviewed changes

sethah added 18 commits September 15, 2016 20:28

correcting initial model test and deleting multinomial

942c3b7

small fixes, remove temp constructor

ae6150c

rebase

47fa5fd

removing old test suite

79273f7

some small fixes

262bc99

use _coefficients

b64ffad

use strings in supported families

7895c81

mima exclusion for lr model constructor

c9b6d97

address initial review

b532692

rewriting family detection logic

af8fb45

set family explicitly in tests

b27cb2c

fix compression bug

be030b5

use regex util

73158e5

sparse storage for binary lor

f538e1e

remove scores and address some review

a3a7d20

transposed error in test suites

cb1666e

update scaladoc and correct predict method

bd7fca1

revert predict changes and add tests

38fad98

sethah force-pushed the SPARK-17163 branch from 292f481 to 38fad98 Compare September 16, 2016 05:01

dbtsai reviewed Sep 19, 2016

View reviewed changes

code review, add secondary constructor

4dae595

asfgit closed this in 26145a5 Sep 20, 2016

This was referenced Sep 20, 2016

[SPARK-17138][ML][MLib] Add Python API for multinomial logistic regression #14852

Closed

[SPARK-17157][SPARKR][WIP]: Add multiclass logistic regression SparkR Wrapper #14818

Closed

Conversation

sethah commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Design

Switching between binomial and multinomial

coefficients/intercept model API (TODO)

Threshold/thresholds (TODO)

Follow up

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

sethah commented Sep 1, 2016

Uh oh!

yanboliang commented Sep 2, 2016

Uh oh!

sethah commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

jkbradley commented Sep 6, 2016

Uh oh!

sethah commented Sep 6, 2016

Uh oh!

dbtsai commented Sep 7, 2016

Uh oh!

sethah commented Sep 7, 2016

Uh oh!

jkbradley commented Sep 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbtsai commented Sep 7, 2016

Uh oh!

sethah commented Sep 7, 2016

Uh oh!

dbtsai commented Sep 7, 2016

Uh oh!

sethah commented Sep 8, 2016

Uh oh!

dbtsai commented Sep 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Aug 26, 2016 •

edited

Loading

jkbradley commented Sep 7, 2016 •

edited

Loading