Skip to content

[SPARK-17163][ML] Unified LogisticRegression interface#14834

Closed
sethah wants to merge 24 commits intoapache:masterfrom
sethah:SPARK-17163
Closed

[SPARK-17163][ML] Unified LogisticRegression interface#14834
sethah wants to merge 24 commits intoapache:masterfrom
sethah:SPARK-17163

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Aug 26, 2016

What changes were proposed in this pull request?

Merge MultinomialLogisticRegression into LogisticRegression and remove MultinomialLogisticRegression.

Marked as WIP because we should discuss the coefficients API in the model. See discussion below.

JIRA: SPARK-17163

How was this patch tested?

Merged test suites and added some new unit tests.

Design

Switching between binomial and multinomial

We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called family which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.

coefficients/intercept model API (TODO)

This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:

  • We must maintain compatibility with the old API, i.e. we must expose def coefficients: Vector and def intercept: Double
  • There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have numClasses sets of coefficients and numClasses intercepts.

Some options:

  1. Store the binomial coefficients as a 2 x numFeatures matrix. This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives 1 * numFeatures coefficients, but we would convert them to 2 x numFeatures coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less if (isMultinomial) ... else ...) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in def coefficients: Vector every time.
  2. Store the binomial coefficients as a 1 x numFeatures matrix. We still store the coefficients as a matrix and the intercepts as a vector. When users call coefficients we return them a Vector that is backed by the same underlying array as the coefficientMatrix, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.

If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.

Update: We have chosen option 2

Threshold/thresholds (TODO)

Currently, when threshold is set we clear whatever value is in thresholds and when thresholds is set we clear whatever value is in threshold. SPARK-11543 was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.

Update: Let's leave it for a follow up PR

Follow up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an offline test to make sure that we can successfully load old models into the new API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 2.0.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this patch make it into 2.0.1? If so, we'd need to change this to also check the "micro" version number. Otherwise, this check should still be valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not backporting MLOR to 2.0.x. I got it now, since you do minor.toInt so even minor is 0.1, you will load it in the old way.

@sethah
Copy link
Contributor Author

sethah commented Aug 26, 2016

cc @yanboliang @jkbradley @dbtsai

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64494 has finished for PR 14834 at commit 7cfbcd3.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64495 has finished for PR 14834 at commit 4048570.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Sep 1, 2016

Also, I'm not sure I understand the MiMa failure. It's complaining about the constructor being different for LogisticRegressionModel, but that constructor has always been private[spark]. I appreciate any thoughts on this.

@yanboliang
Copy link
Contributor

MiMa do binary compatibility check for model constructor even it's private, so we should exclude it at MimaExcludes.scala.

@sethah
Copy link
Contributor Author

sethah commented Sep 2, 2016

@yanboliang Thanks for the tip. Done.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64832 has finished for PR 14834 at commit c52ef66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64831 has finished for PR 14834 at commit 5bce1ba.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

+1 for Option 2: Store the binomial coefficients as a 1 x numFeatures matrix.

It's such an important code path that I think it's worth avoiding the regression for current users.

@sethah
Copy link
Contributor Author

sethah commented Sep 6, 2016

@jkbradley Thanks for your input. Let's see what @dbtsai thinks as well :)

@dbtsai
Copy link
Member

dbtsai commented Sep 7, 2016

@sethah Thank you for coming up with PR with detailed documentation. For option 2, if a two class model is trained with multinomial family, how do you store it? I was thinking about maybe we could always store the coefficients as nClasses * numFeatures, and when nClasses == 2, we convert it into 1 x numFeatures for prediction. Thus, we don't lose the performance, and also have a consistent representation of models.

@sethah
Copy link
Contributor Author

sethah commented Sep 7, 2016

numClasses isMultinomial coefficientMatrix size
3+ true 3+ x numFeatures
2 true 2 x numFeatures
2 false 1 x numFeatures

The current behavior is as follows:

  • If it is binary classification trained with multinomial family, then we store 2 x numFeatures coefficients in a matrix. We will predict with this matrix (i.e. we do not convert to 1 x numFeatures).
  • If it is binary classification trained with binomial family, then we store 1 x numFeatures (i.e. these coefficients are pivoted) and we use a DenseVector instead of a matrix for prediction.

The coefficients are stored in an array, truly. There is always coefficientMatrix which is backed by that array and in some cases has only 1 row. When it is binomial family, we also have a cofficients vector which is backed by the same array as the matrix. We use that vector for prediction in the binomial case.

Hopefully that clears it up. I don't think it's necessary to convert the case of multinomial family but binary classification to 1 x numFeatures for prediction since it won't be a regression and users would have to explicitly specify that family (hopefully knowing the consequences of that choice).

I also vote for Option 2 in the original description. We can avoid any regressions with past versions and the implementation isn't too messy.

@jkbradley
Copy link
Member

jkbradley commented Sep 7, 2016

@dbtsai For numClasses = 2, this conversion would involve copying half of the array (since Vector constructors require Arrays, not views), and it would mean doubling the size of the model. That's fine for small models but pretty expensive for large ones---and it would happen on the driver. Just saw @sethah 's new comment. +1 for his approach!

@dbtsai
Copy link
Member

dbtsai commented Sep 7, 2016

@sethah +1 for this approach. Couple minor questions. With L1, the coefficients can be very sparse. Currently, we will store them as sparse vector and use sparse vector for prediction. (It is decided to store as sparse or dense vector based on size, not as prediction speed, and we probably need to do some experiment around it). Do you plan to always store the coefficients as dense matrix even for binomial case?

Also, for 2 classes LOR with multinomial family, will users be able to def coefficients: Vector and def intercept: Double by pivoting the coefficients?

@sethah
Copy link
Contributor Author

sethah commented Sep 7, 2016

@dbtsai Good point. This patch in its current state would change the behavior of binomial LOR to always have dense coefficients. I think we need to find a solution to this. I wonder why there isn't a compressed method for Matrix?

If we store the coefficients as SparseMatrix in some L1 cases, then before prediction we have to convert it to a SparseVector. This amounts to an extra 4 * nnz bytes being stored (we have to create the sparse vector indices since we cannot reuse them from the matrix case). We could implement a compressed method for matrices if we are ok with the extra storage overhead.

Otherwise I guess we'd have to store the binomial case as a vector and then do some conversion to matrix iff coefficientMatrix is called.

Finally, I don't think it's necessary to pivot the coefficients in the case of 2 classes with multinomial family. Currently, we throw an exception.

@dbtsai
Copy link
Member

dbtsai commented Sep 7, 2016

@sethah I remember that compressed method for Matrix is one of the todo in the followup tasks. For sparse binary logistic regression, if we store the models as 1 x numFeatures compressed sparse row major matrices, I think the space will be the same as current sparse vector implementation. And this CSR format should be able to convert to sparse vector without changing the underline data structure.

Throwing an exception in the case of 2 classes with multinomial family should good to me.

@sethah
Copy link
Contributor Author

sethah commented Sep 8, 2016

@dbtsai Yeah, if we store it as a row major sparse matrix then the rowIndices will exactly be the indices needed for the sparse vector. We'll have to add some functionality to the linalg classes to accomplish this. I can look into it. We can continue moving forward for this PR without it, and address the compressed option later, but IMO it must be done before 2.1 release. Otherwise, we can block this PR until it is done.

@dbtsai
Copy link
Member

dbtsai commented Sep 8, 2016

@sethah For sparse MLOR problems with L1, the models will be sparse in row. As a result, in the sparse, we need to store the models in CSR format, and CSR models can be used for model prediction with potential speedup (although we need to do benchmark and see how much speed up we get). Let's have a separate PR to implement compressed option in matrix. This will be a little bit complicated. By default, compressed has to determine CSR or CSC will be used depending on the compression rate. Users need to have a option to choose the format as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you import with full classpath without using wildcard?

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65476 has finished for PR 14834 at commit 38fad98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

set(threshold, value)
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this extra new line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Since("1.3.0") val intercept: Double)
@Since("2.1.0") val coefficientMatrix: Matrix,
@Since("2.1.0") val interceptVector: Vector,
@Since("1.3.0") override val numClasses: Int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make numClasses as a function determined by isMultinomial and coefficientMatrix .numRows. Thus, we can reduce one parameter in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that won't work under the current edge case behaviors. When the labels are all 0.0 then the coefficient matrix will have only one row regardless of multinomial or binomial. We could potentially change this behavior though, as if we always assume there will be at minimum two classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Let's merge it as it for now. I want to make change so we don't train on the classes that is non-seen. I'll address them together. Thanks.

@Since("2.1.0") val coefficientMatrix: Matrix,
@Since("2.1.0") val interceptVector: Vector,
@Since("1.3.0") override val numClasses: Int,
private val isMultinomial: Boolean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, isMultinomial can be determined by coefficientMatrix .numRows as well.

private val isMultinomial: Boolean)
extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
with LogisticRegressionParams with MLWritable {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a require(coefficientMatrix .numRows == interceptVector.length) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* $$
* </blockquote></p>
*
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import org.apache.spark.ml.feature.{Instance, LabeledPoint}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid import _ if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ParamsSuite.checkParams(new LogisticRegression)
val model = new LogisticRegressionModel("logReg", Vectors.dense(0.0), 0.0)
val model = new LogisticRegressionModel("logReg",
new DenseMatrix(1, 1, Array(0.0)), Vectors.dense(0.0), 2, isMultinomial = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have old constructor, revert this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. and below

.setThreshold(0.6)
val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2)
val lrModel = new LogisticRegressionModel(lr.uid,
new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. revert this.

.setThreshold(0.6)
val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2)
val lrModel = new LogisticRegressionModel(lr.uid,
new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. revert this.

ParamsSuite.checkParams(new OneVsRest)
val lrModel = new LogisticRegressionModel("lr", Vectors.dense(0.0), 0.0)
val lrModel = new LogisticRegressionModel("logReg",
new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. revert this.

@dbtsai
Copy link
Member

dbtsai commented Sep 20, 2016

LGTM. Wait for the test.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65622 has finished for PR 14834 at commit 4dae595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Sep 20, 2016

Merged into master. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants