Skip to content

Conversation

@MechCoder
Copy link
Contributor

I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary

I have some queries and asked them inline.

@MechCoder
Copy link
Contributor Author

@feynmanliang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be returned as a dataframe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; ditto for the other metrics.

What is the stepSize for the ROC curve (maybe put in doc)?

Does this (and the other metrics) even need to be a distributed data structure? It's hard to imagine we care about so many decision thresholds that they won't fit on a single machine. I understand the RDD used in BinaryClassificationMetrics is used to parallelize evaluation, but it's probably fine to collect them here and use a local data structure.

If we need to keep these distributed, I suggest making it transient since this summary will be sent to every executor that uses the model during (e.g. during prediction on an RDD, the enclosing class of model.predict is serialized in the closure).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the size of the ROC curve and all other metrics is equal to the size of the data. (i.e it chooses every possible score as a threshold) , hence they are stored in a distributed way. I'm not sure that this is necessary (especially when the data is very large)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is controlled by the numBins parameter (that I did not see). Any idea how to make this accessible to the user? Maybe have a setBins parameter in BinaryClassificationMetrics?

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37832 has finished for PR 7538 at commit 70a0fc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, this is embarassing :P

@feynmanliang
Copy link
Contributor

Made a first pass.

Maybe it might make sense to make a HasSummary trait with the hasSummary and setSummary logic for *Models to mix-in? Then we can introduce a ModelSummary trait implemented by all the summary objects.

@feynmanliang
Copy link
Contributor

Forgot to add, would be nice to include @transient confusion : RDD[(threshold: Double, confusionMatrix: BinaryConfusionMatrix)] in LogisticRegressionSummary since it's a good diagnostic tool despite already being summarized by precision/recall/etc (although may be hard to get this into a DataFrame since BinaryConfusionMatrix doesn't have a UDT, @jkbradley ?) ... We do that in LinearRegressionSummary with residuals

@MechCoder
Copy link
Contributor Author

Thanks a lot for your kind reviews :)

Maybe it might make sense to make a HasSummary trait with the hasSummary and setSummary

Yes, indeed. Where should such a trait go? Should we have a ml/summary ? Would it better to refactor this in a different PR or this one?

Also It might help to make a RegressionSummary or a ClassificationSummary after this has been done because most of the regression metrics and classification metrics are common to different ML algorithms.

@MechCoder
Copy link
Contributor Author

Btw, I assume it had been decided not to add hinge loss, log loss etc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why this copyValues is necessary? and I'm unable to understand how $(probabilityCol) gives a string because when I do this.

val model = lr.fit(dataset)
$(lr.probabilityCol)

I get

error: not found: value $
$(probabilityCol)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ is defined in Params, which LogisticRegression mixes in via LogisticRegressionParams. See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L463

Without copyValues, the model you return will not contain any non-default user-specified params (e.g. predictionCol).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see thanks !

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37947 has finished for PR 7538 at commit fbed861.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

I've addressed your comments about the dataframe storage.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37952 has finished for PR 7538 at commit 80d9954.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@feynmanliang any news on this? thanks.

@feynmanliang
Copy link
Contributor

@MechCoder sorry for the delays! We are having a hackathon at my work; I will review when I am in the office on monday.

@MechCoder
Copy link
Contributor Author

okay, thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a def; we only want to lazily evaluate roc if the user asks for it (same thing is going on in BinaryClassificationMetrics). Ditto for others

@MechCoder
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #201 has finished for PR 7538 at commit d775371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait LogisticRegressionTrainingSummary extends Serializable
    • sealed trait LogisticRegressionSummary extends Serializable

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39685 has finished for PR 7538 at commit d775371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait LogisticRegressionTrainingSummary extends Serializable
    • sealed trait LogisticRegressionSummary extends Serializable

@jkbradley
Copy link
Member

Are we planning to have a MulticlassLogisticRegressionSummary inheriting from LogisticRegressionSummary in the future because without that I'm unable to understand how using a trait would help since there is no access to the predictions dataframe.

Yes, MulticlassLogisticRegressionSummary should be analogous to the binary version, with both inheriting from LogisticRegressionSummary.

can you give a concrete example of how not using a sealed trait will break the API?

Adding a method to a trait is a breaking API change. If a user has implemented some class which extends the trait, then adding a method to the trait will mean the user's class will no longer implement all of the methods it needs to. Marking it sealed will prevent users from extending the trait so that we can add more methods in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be generic summary, not binary one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it makes sense. Sorry about this.

@jkbradley
Copy link
Member

Just a few items remain

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40027 has finished for PR 7538 at commit 2e9f7c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
    • sealed trait LogisticRegressionSummary extends Serializable

@MechCoder
Copy link
Contributor Author

@jkbradley I have addressed your comments in the last commit. I have a few last minor questions.

  1. For accessing the pr(), fMeasureByThreshold etc, I'll have to do model.summary.asInstanceOf[Binary..] I suppose that should be okay, right? (Similar things are being done in LDAModel etc)
  2. If I make objectiveHistory and totalIterations defs then, that would be different from the LinearRegressionSummary where it would be vals. This would create differences when being called from Java. i.e, I'll have to do objectiveHistory() for Logistic nd objectiveHistory for Linear

@jkbradley
Copy link
Member

For accessing the pr(), fMeasureByThreshold etc, I'll have to do model.summary.asInstanceOf[Binary..] I suppose that should be okay, right? (Similar things are being done in LDAModel etc)

I agree it's a bit awkward, but I prefer that to providing null/bad values. The other big choice we could have made when creating spark.ml is separate binary and multiclass algorithms, but that would have created a bunch of copied APIs.

If I make objectiveHistory and totalIterations defs then, that would be different from the LinearRegressionSummary where it would be vals. This would create differences when being called from Java. i.e, I'll have to do objectiveHistory() for Logistic nd objectiveHistory for Linear

I don't think def and val look different from Java. The Scala compiler creates both as methods, so they should appear to be the same for the Java and Scala APIs.

LGTM. Thanks for iterating through updates with me! I'll merge this with master and branch-1.5

asfgit pushed a commit that referenced this pull request Aug 6, 2015
I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary

I have some queries and asked them inline.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7538 from MechCoder/log_reg_stats and squashes the following commits:

2e9f7c7 [MechCoder] Change defs into lazy vals
d775371 [MechCoder] Clean up class inheritance
9586125 [MechCoder] Add abstraction to handle Multiclass Metrics
40ad8ef [MechCoder] minor
640376a [MechCoder] remove unnecessary dataframe stuff and add docs
80d9954 [MechCoder] Added tests
fbed861 [MechCoder] DataFrame support for metrics
70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression

(cherry picked from commit c5c6ade)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in c5c6ade Aug 6, 2015
@MechCoder
Copy link
Contributor Author

All right, the reason I thought it would be different is that in the last commit (MechCoder@2e9f7c7#diff-1747fe912f0ee426f29b3613e6b0a197R156) just doing model.summary raised an error, while in Scala it works.

@MechCoder MechCoder deleted the log_reg_stats branch August 6, 2015 17:10
@jkbradley
Copy link
Member

Oh, I see. That's because Scala can have method calls without parentheses, whereas Java requires the parentheses.

@MechCoder
Copy link
Contributor Author

Yes, that's what I had meant. Would that be okay?

@jkbradley
Copy link
Member

Yes, that's fine since the same method works in both languages.

@MechCoder
Copy link
Contributor Author

Should I open a JIRA to refactor again into a general RegressionSummary and ClassificationSummary since almost all of these metrics would be common to all algorithms? I can open a JIRA for the setBins as well.

@jkbradley
Copy link
Member

Sure, the refactoring sounds great, thanks! Please link to the R-like stats for models JIRA.
I think setBins is low priority for now but would be good eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants