-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9112] [ML] Implement Stats for LogisticRegression #7538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be returned as a dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; ditto for the other metrics.
What is the stepSize for the ROC curve (maybe put in doc)?
Does this (and the other metrics) even need to be a distributed data structure? It's hard to imagine we care about so many decision thresholds that they won't fit on a single machine. I understand the RDD used in BinaryClassificationMetrics is used to parallelize evaluation, but it's probably fine to collect them here and use a local data structure.
If we need to keep these distributed, I suggest making it transient since this summary will be sent to every executor that uses the model during (e.g. during prediction on an RDD, the enclosing class of model.predict is serialized in the closure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the size of the ROC curve and all other metrics is equal to the size of the data. (i.e it chooses every possible score as a threshold) , hence they are stored in a distributed way. I'm not sure that this is necessary (especially when the data is very large)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is controlled by the numBins parameter (that I did not see). Any idea how to make this accessible to the user? Maybe have a setBins parameter in BinaryClassificationMetrics?
|
Test build #37832 has finished for PR 7538 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, this is embarassing :P
|
Made a first pass. Maybe it might make sense to make a |
|
Forgot to add, would be nice to include |
|
Thanks a lot for your kind reviews :)
Yes, indeed. Where should such a trait go? Should we have a ml/summary ? Would it better to refactor this in a different PR or this one? Also It might help to make a |
|
Btw, I assume it had been decided not to add hinge loss, log loss etc ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why this copyValues is necessary? and I'm unable to understand how $(probabilityCol) gives a string because when I do this.
val model = lr.fit(dataset)
$(lr.probabilityCol)
I get
error: not found: value $
$(probabilityCol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ is defined in Params, which LogisticRegression mixes in via LogisticRegressionParams. See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L463
Without copyValues, the model you return will not contain any non-default user-specified params (e.g. predictionCol).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see thanks !
|
Test build #37947 has finished for PR 7538 at commit
|
|
I've addressed your comments about the dataframe storage. |
|
Test build #37952 has finished for PR 7538 at commit
|
|
@feynmanliang any news on this? thanks. |
|
@MechCoder sorry for the delays! We are having a hackathon at my work; I will review when I am in the office on monday. |
|
okay, thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a def; we only want to lazily evaluate roc if the user asks for it (same thing is going on in BinaryClassificationMetrics). Ditto for others
|
retest this please |
|
Test build #201 has finished for PR 7538 at commit
|
|
Test build #39685 has finished for PR 7538 at commit
|
Yes, MulticlassLogisticRegressionSummary should be analogous to the binary version, with both inheriting from LogisticRegressionSummary.
Adding a method to a trait is a breaking API change. If a user has implemented some class which extends the trait, then adding a method to the trait will mean the user's class will no longer implement all of the methods it needs to. Marking it sealed will prevent users from extending the trait so that we can add more methods in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be generic summary, not binary one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it makes sense. Sorry about this.
|
Just a few items remain |
|
Test build #40027 has finished for PR 7538 at commit
|
|
@jkbradley I have addressed your comments in the last commit. I have a few last minor questions.
|
I agree it's a bit awkward, but I prefer that to providing null/bad values. The other big choice we could have made when creating spark.ml is separate binary and multiclass algorithms, but that would have created a bunch of copied APIs.
I don't think def and val look different from Java. The Scala compiler creates both as methods, so they should appear to be the same for the Java and Scala APIs. LGTM. Thanks for iterating through updates with me! I'll merge this with master and branch-1.5 |
I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary I have some queries and asked them inline. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7538 from MechCoder/log_reg_stats and squashes the following commits: 2e9f7c7 [MechCoder] Change defs into lazy vals d775371 [MechCoder] Clean up class inheritance 9586125 [MechCoder] Add abstraction to handle Multiclass Metrics 40ad8ef [MechCoder] minor 640376a [MechCoder] remove unnecessary dataframe stuff and add docs 80d9954 [MechCoder] Added tests fbed861 [MechCoder] DataFrame support for metrics 70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression (cherry picked from commit c5c6ade) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
All right, the reason I thought it would be different is that in the last commit (MechCoder@2e9f7c7#diff-1747fe912f0ee426f29b3613e6b0a197R156) just doing |
|
Oh, I see. That's because Scala can have method calls without parentheses, whereas Java requires the parentheses. |
|
Yes, that's what I had meant. Would that be okay? |
|
Yes, that's fine since the same method works in both languages. |
|
Should I open a JIRA to refactor again into a general |
|
Sure, the refactoring sounds great, thanks! Please link to the R-like stats for models JIRA. |
I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary
I have some queries and asked them inline.