[SPARK-17163][ML] Unified LogisticRegression interface#14834
[SPARK-17163][ML] Unified LogisticRegression interface#14834sethah wants to merge 24 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
I did an offline test to make sure that we can successfully load old models into the new API
There was a problem hiding this comment.
Will this patch make it into 2.0.1? If so, we'd need to change this to also check the "micro" version number. Otherwise, this check should still be valid.
There was a problem hiding this comment.
We're not backporting MLOR to 2.0.x. I got it now, since you do minor.toInt so even minor is 0.1, you will load it in the old way.
|
Test build #64494 has finished for PR 14834 at commit
|
|
Test build #64495 has finished for PR 14834 at commit
|
|
Also, I'm not sure I understand the MiMa failure. It's complaining about the constructor being different for |
|
MiMa do binary compatibility check for model constructor even it's private, so we should exclude it at |
|
@yanboliang Thanks for the tip. Done. |
|
Test build #64832 has finished for PR 14834 at commit
|
|
Test build #64831 has finished for PR 14834 at commit
|
|
+1 for Option 2: Store the binomial coefficients as a 1 x numFeatures matrix. It's such an important code path that I think it's worth avoiding the regression for current users. |
|
@jkbradley Thanks for your input. Let's see what @dbtsai thinks as well :) |
|
@sethah Thank you for coming up with PR with detailed documentation. For option 2, if a two class model is trained with multinomial family, how do you store it? I was thinking about maybe we could always store the coefficients as |
The current behavior is as follows:
The coefficients are stored in an array, truly. There is always Hopefully that clears it up. I don't think it's necessary to convert the case of multinomial family but binary classification to I also vote for Option 2 in the original description. We can avoid any regressions with past versions and the implementation isn't too messy. |
|
|
|
@sethah +1 for this approach. Couple minor questions. With L1, the coefficients can be very sparse. Currently, we will store them as sparse vector and use sparse vector for prediction. (It is decided to store as sparse or dense vector based on size, not as prediction speed, and we probably need to do some experiment around it). Do you plan to always store the coefficients as dense matrix even for binomial case? Also, for 2 classes LOR with multinomial family, will users be able to |
|
@dbtsai Good point. This patch in its current state would change the behavior of binomial LOR to always have dense coefficients. I think we need to find a solution to this. I wonder why there isn't a If we store the coefficients as Otherwise I guess we'd have to store the binomial case as a vector and then do some conversion to matrix iff Finally, I don't think it's necessary to pivot the coefficients in the case of 2 classes with multinomial family. Currently, we throw an exception. |
|
@sethah I remember that Throwing an exception in the case of 2 classes with multinomial family should good to me. |
|
@dbtsai Yeah, if we store it as a row major sparse matrix then the |
|
@sethah For sparse MLOR problems with L1, the models will be sparse in row. As a result, in the sparse, we need to store the models in CSR format, and CSR models can be used for model prediction with potential speedup (although we need to do benchmark and see how much speed up we get). Let's have a separate PR to implement |
There was a problem hiding this comment.
could you import with full classpath without using wildcard?
|
Test build #65476 has finished for PR 14834 at commit
|
| set(threshold, value) | ||
| } | ||
|
|
||
|
|
| @Since("1.3.0") val intercept: Double) | ||
| @Since("2.1.0") val coefficientMatrix: Matrix, | ||
| @Since("2.1.0") val interceptVector: Vector, | ||
| @Since("1.3.0") override val numClasses: Int, |
There was a problem hiding this comment.
How about we make numClasses as a function determined by isMultinomial and coefficientMatrix .numRows. Thus, we can reduce one parameter in the constructor.
There was a problem hiding this comment.
Actually that won't work under the current edge case behaviors. When the labels are all 0.0 then the coefficient matrix will have only one row regardless of multinomial or binomial. We could potentially change this behavior though, as if we always assume there will be at minimum two classes.
There was a problem hiding this comment.
Okay. Let's merge it as it for now. I want to make change so we don't train on the classes that is non-seen. I'll address them together. Thanks.
| @Since("2.1.0") val coefficientMatrix: Matrix, | ||
| @Since("2.1.0") val interceptVector: Vector, | ||
| @Since("1.3.0") override val numClasses: Int, | ||
| private val isMultinomial: Boolean) |
There was a problem hiding this comment.
actually, isMultinomial can be determined by coefficientMatrix .numRows as well.
| private val isMultinomial: Boolean) | ||
| extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] | ||
| with LogisticRegressionParams with MLWritable { | ||
|
|
There was a problem hiding this comment.
Can we have a require(coefficientMatrix .numRows == interceptVector.length) here?
| * $$ | ||
| * </blockquote></p> | ||
| * | ||
| * |
| import org.apache.spark.ml.feature.{Instance, LabeledPoint} | ||
| import org.apache.spark.ml.linalg.{Vector, Vectors} | ||
| import org.apache.spark.ml.feature.LabeledPoint | ||
| import org.apache.spark.ml.linalg._ |
| ParamsSuite.checkParams(new LogisticRegression) | ||
| val model = new LogisticRegressionModel("logReg", Vectors.dense(0.0), 0.0) | ||
| val model = new LogisticRegressionModel("logReg", | ||
| new DenseMatrix(1, 1, Array(0.0)), Vectors.dense(0.0), 2, isMultinomial = false) |
There was a problem hiding this comment.
If we have old constructor, revert this.
| .setThreshold(0.6) | ||
| val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2) | ||
| val lrModel = new LogisticRegressionModel(lr.uid, | ||
| new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false) |
| .setThreshold(0.6) | ||
| val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 2.0), 1.2) | ||
| val lrModel = new LogisticRegressionModel(lr.uid, | ||
| new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false) |
| ParamsSuite.checkParams(new OneVsRest) | ||
| val lrModel = new LogisticRegressionModel("lr", Vectors.dense(0.0), 0.0) | ||
| val lrModel = new LogisticRegressionModel("logReg", | ||
| new DenseMatrix(1, 1, Array(0.0), isTransposed = true), Vectors.dense(0.0), 2, false) |
|
LGTM. Wait for the test. |
|
Test build #65622 has finished for PR 14834 at commit
|
|
Merged into master. Thanks. |
What changes were proposed in this pull request?
Merge
MultinomialLogisticRegressionintoLogisticRegressionand removeMultinomialLogisticRegression.Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
JIRA: SPARK-17163
How was this patch tested?
Merged test suites and added some new unit tests.
Design
Switching between binomial and multinomial
We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called
familywhich defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.coefficients/intercept model API (TODO)
This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
def coefficients: Vectoranddef intercept: DoublenumClassessets of coefficients andnumClassesintercepts.Some options:
2 x numFeaturesmatrix. This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives1 * numFeaturescoefficients, but we would convert them to2 x numFeaturescoefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. lessif (isMultinomial) ... else ...) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it indef coefficients: Vectorevery time.1 x numFeaturesmatrix. We still store the coefficients as a matrix and the intercepts as a vector. When users callcoefficientswe return them aVectorthat is backed by the same underlying array as thecoefficientMatrix, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
Update: We have chosen option 2
Threshold/thresholds (TODO)
Currently, when
thresholdis set we clear whatever value is inthresholdsand whenthresholdsis set we clear whatever value is inthreshold. SPARK-11543 was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.Update: Let's leave it for a follow up PR
Follow up