[SPARK-9112] [ML] Implement Stats for LogisticRegression #7538

MechCoder · 2015-07-20T14:12:03Z

I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary

I have some queries and asked them inline.

MechCoder · 2015-07-20T14:12:10Z

@feynmanliang

MechCoder · 2015-07-20T14:12:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Should this be returned as a dataframe?

Yes; ditto for the other metrics.

What is the stepSize for the ROC curve (maybe put in doc)?

Does this (and the other metrics) even need to be a distributed data structure? It's hard to imagine we care about so many decision thresholds that they won't fit on a single machine. I understand the RDD used in BinaryClassificationMetrics is used to parallelize evaluation, but it's probably fine to collect them here and use a local data structure.

If we need to keep these distributed, I suggest making it transient since this summary will be sent to every executor that uses the model during (e.g. during prediction on an RDD, the enclosing class of model.predict is serialized in the closure).

It seems that the size of the ROC curve and all other metrics is equal to the size of the data. (i.e it chooses every possible score as a threshold) , hence they are stored in a distributed way. I'm not sure that this is necessary (especially when the data is very large)

This is controlled by the numBins parameter (that I did not see). Any idea how to make this accessible to the user? Maybe have a setBins parameter in BinaryClassificationMetrics?

SparkQA · 2015-07-20T16:14:09Z

Test build #37832 has finished for PR 7538 at commit 70a0fc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

feynmanliang · 2015-07-20T21:30:21Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Haha, this is embarassing :P

feynmanliang · 2015-07-20T21:57:39Z

Made a first pass.

Maybe it might make sense to make a HasSummary trait with the hasSummary and setSummary logic for *Models to mix-in? Then we can introduce a ModelSummary trait implemented by all the summary objects.

feynmanliang · 2015-07-20T22:05:00Z

Forgot to add, would be nice to include @transient confusion : RDD[(threshold: Double, confusionMatrix: BinaryConfusionMatrix)] in LogisticRegressionSummary since it's a good diagnostic tool despite already being summarized by precision/recall/etc (although may be hard to get this into a DataFrame since BinaryConfusionMatrix doesn't have a UDT, @jkbradley ?) ... We do that in LinearRegressionSummary with residuals

MechCoder · 2015-07-21T08:57:13Z

Thanks a lot for your kind reviews :)

Maybe it might make sense to make a HasSummary trait with the hasSummary and setSummary

Yes, indeed. Where should such a trait go? Should we have a ml/summary ? Would it better to refactor this in a different PR or this one?

Also It might help to make a RegressionSummary or a ClassificationSummary after this has been done because most of the regression metrics and classification metrics are common to different ML algorithms.

MechCoder · 2015-07-21T08:58:28Z

Btw, I assume it had been decided not to add hinge loss, log loss etc ?

MechCoder · 2015-07-21T09:20:15Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Can you please explain why this copyValues is necessary? and I'm unable to understand how $(probabilityCol) gives a string because when I do this.

val model = lr.fit(dataset) $(lr.probabilityCol)

I get

error: not found: value $ $(probabilityCol)

$ is defined in Params, which LogisticRegression mixes in via LogisticRegressionParams. See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L463

Without copyValues, the model you return will not contain any non-default user-specified params (e.g. predictionCol).

Ah, I see thanks !

SparkQA · 2015-07-21T11:30:02Z

Test build #37947 has finished for PR 7538 at commit fbed861.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-07-21T13:29:27Z

I've addressed your comments about the dataframe storage.

SparkQA · 2015-07-21T14:00:37Z

Test build #37952 has finished for PR 7538 at commit 80d9954.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-07-25T14:21:52Z

@feynmanliang any news on this? thanks.

feynmanliang · 2015-07-25T15:28:51Z

@MechCoder sorry for the delays! We are having a hackathon at my work; I will review when I am in the office on monday.

MechCoder · 2015-07-25T15:30:30Z

okay, thanks :)

feynmanliang · 2015-07-27T19:40:46Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

This should be a def; we only want to lazily evaluate roc if the user asks for it (same thing is going on in BinaryClassificationMetrics). Ditto for others

MechCoder · 2015-08-04T07:42:52Z

retest this please

SparkQA · 2015-08-04T08:28:48Z

Test build #201 has finished for PR 7538 at commit d775371.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed trait LogisticRegressionTrainingSummary extends Serializable
- sealed trait LogisticRegressionSummary extends Serializable

SparkQA · 2015-08-04T08:32:55Z

Test build #39685 has finished for PR 7538 at commit d775371.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed trait LogisticRegressionTrainingSummary extends Serializable
- sealed trait LogisticRegressionSummary extends Serializable

jkbradley · 2015-08-05T22:02:52Z

Are we planning to have a MulticlassLogisticRegressionSummary inheriting from LogisticRegressionSummary in the future because without that I'm unable to understand how using a trait would help since there is no access to the predictions dataframe.

Yes, MulticlassLogisticRegressionSummary should be analogous to the binary version, with both inheriting from LogisticRegressionSummary.

can you give a concrete example of how not using a sealed trait will break the API?

Adding a method to a trait is a breaking API change. If a user has implemented some class which extends the trait, then adding a method to the trait will mean the user's class will no longer implement all of the methods it needs to. Marking it sealed will prevent users from extending the trait so that we can add more methods in the future.

jkbradley · 2015-08-05T22:15:18Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

should be generic summary, not binary one

Now it makes sense. Sorry about this.

jkbradley · 2015-08-05T22:16:05Z

Just a few items remain

SparkQA · 2015-08-06T10:13:31Z

Test build #40027 has finished for PR 7538 at commit 2e9f7c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
- sealed trait LogisticRegressionSummary extends Serializable

MechCoder · 2015-08-06T15:22:35Z

@jkbradley I have addressed your comments in the last commit. I have a few last minor questions.

For accessing the pr(), fMeasureByThreshold etc, I'll have to do model.summary.asInstanceOf[Binary..] I suppose that should be okay, right? (Similar things are being done in LDAModel etc)
If I make objectiveHistory and totalIterations defs then, that would be different from the LinearRegressionSummary where it would be vals. This would create differences when being called from Java. i.e, I'll have to do objectiveHistory() for Logistic nd objectiveHistory for Linear

jkbradley · 2015-08-06T17:07:41Z

For accessing the pr(), fMeasureByThreshold etc, I'll have to do model.summary.asInstanceOf[Binary..] I suppose that should be okay, right? (Similar things are being done in LDAModel etc)

I agree it's a bit awkward, but I prefer that to providing null/bad values. The other big choice we could have made when creating spark.ml is separate binary and multiclass algorithms, but that would have created a bunch of copied APIs.

If I make objectiveHistory and totalIterations defs then, that would be different from the LinearRegressionSummary where it would be vals. This would create differences when being called from Java. i.e, I'll have to do objectiveHistory() for Logistic nd objectiveHistory for Linear

I don't think def and val look different from Java. The Scala compiler creates both as methods, so they should appear to be the same for the Java and Scala APIs.

LGTM. Thanks for iterating through updates with me! I'll merge this with master and branch-1.5

I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary I have some queries and asked them inline. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7538 from MechCoder/log_reg_stats and squashes the following commits: 2e9f7c7 [MechCoder] Change defs into lazy vals d775371 [MechCoder] Clean up class inheritance 9586125 [MechCoder] Add abstraction to handle Multiclass Metrics 40ad8ef [MechCoder] minor 640376a [MechCoder] remove unnecessary dataframe stuff and add docs 80d9954 [MechCoder] Added tests fbed861 [MechCoder] DataFrame support for metrics 70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression (cherry picked from commit c5c6ade) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

MechCoder · 2015-08-06T17:10:41Z

All right, the reason I thought it would be different is that in the last commit (MechCoder@2e9f7c7#diff-1747fe912f0ee426f29b3613e6b0a197R156) just doing model.summary raised an error, while in Scala it works.

jkbradley · 2015-08-06T17:12:12Z

Oh, I see. That's because Scala can have method calls without parentheses, whereas Java requires the parentheses.

MechCoder · 2015-08-06T17:12:43Z

Yes, that's what I had meant. Would that be okay?

jkbradley · 2015-08-06T17:19:26Z

Yes, that's fine since the same method works in both languages.

MechCoder · 2015-08-06T17:29:36Z

Should I open a JIRA to refactor again into a general RegressionSummary and ClassificationSummary since almost all of these metrics would be common to all algorithms? I can open a JIRA for the setBins as well.

jkbradley · 2015-08-06T18:01:07Z

Sure, the refactoring sounds great, thanks! Please link to the R-like stats for models JIRA.
I think setBins is low priority for now but would be good eventually.

[SPARK-9112] [ML] Implement Stats for LogisticRegression

70a0fc4

MechCoder reviewed Jul 20, 2015
View reviewed changes

feynmanliang reviewed Jul 20, 2015
View reviewed changes

MechCoder reviewed Jul 21, 2015
View reviewed changes

DataFrame support for metrics

fbed861

Added tests

80d9954

MechCoder force-pushed the log_reg_stats branch from ce0aaea to 80d9954 Compare July 21, 2015 13:21

feynmanliang reviewed Jul 27, 2015
View reviewed changes

jkbradley reviewed Aug 5, 2015
View reviewed changes

Change defs into lazy vals

2e9f7c7

asfgit closed this in c5c6ade Aug 6, 2015

MechCoder deleted the log_reg_stats branch August 6, 2015 17:10

MechCoder mentioned this pull request Aug 18, 2015

[SPARK-9906] [ML] User guide for LogisticRegressionSummary #8197

Closed

sethah mentioned this pull request Oct 13, 2016

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression #15435

Closed

[SPARK-9112] [ML] Implement Stats for LogisticRegression #7538

[SPARK-9112] [ML] Implement Stats for LogisticRegression #7538

Uh oh!

Conversation

MechCoder commented Jul 20, 2015

Uh oh!

MechCoder commented Jul 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feynmanliang commented Jul 20, 2015

Uh oh!

feynmanliang commented Jul 20, 2015

Uh oh!

MechCoder commented Jul 21, 2015

Uh oh!

MechCoder commented Jul 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

MechCoder commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

MechCoder commented Jul 25, 2015

Uh oh!

feynmanliang commented Jul 25, 2015

Uh oh!

MechCoder commented Jul 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

MechCoder commented Aug 6, 2015

Uh oh!

jkbradley commented Aug 6, 2015

Uh oh!

MechCoder commented Aug 6, 2015

Uh oh!

jkbradley commented Aug 6, 2015

Uh oh!

MechCoder commented Aug 6, 2015

Uh oh!

jkbradley commented Aug 6, 2015

Uh oh!

MechCoder commented Aug 6, 2015

Uh oh!