[SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel #28960

huaxingao · 2020-06-30T22:22:43Z

What changes were proposed in this pull request?

Add training summary for FMClassificationModel...

Why are the changes needed?

so that user can get the training process status, such as loss value of each iteration and total iteration number.

Does this PR introduce any user-facing change?

Yes
FMClassificationModel.summary
FMClassificationModel.evaluate

How was this patch tested?

new tests

SparkQA · 2020-06-30T22:30:30Z

Test build #124694 has finished for PR 28960 at commit 0fac436.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-30T22:39:09Z

Test build #124695 has finished for PR 28960 at commit 0a43c5f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-30T22:55:51Z

mllib/src/main/scala/org/apache/spark/ml/regression/FMRegressor.scala

 private[ml] trait FactorizationMachinesParams extends PredictorParams
  with HasMaxIter with HasStepSize with HasTol with HasSolver with HasSeed
-  with HasFitIntercept with HasRegParam {
+  with HasFitIntercept with HasRegParam with HasWeightCol {


Add with HasWeightCol because ClassificationSummary uses weigthCol. However, FM doesn't really support instance weight yet and all the weight are default to 1.0.

huaxingao · 2020-06-30T22:56:02Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala

    }

-    val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
+    val stochasticLossHistory = new ArrayBuffer[Double](numIterations + 1)


Make this stochasticLossHistory contain initial state + the state for each iteration, so it is consistent with the objectiveHistory in LogisticRegression and LinearRegression

huaxingao · 2020-06-30T22:56:18Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala

+            * and regVal is the regularization value computed in the previous iteration as well.
+            */
+          stochasticLossHistory += lossSum / miniBatchSize + regVal
+          if (converged || i == (numIterations + 1)) break


Currently, stochasticLossHistory only contains initial state + state form 1 to n-1 iteration, so need to add state for the last iteration too. After adding the last state, exist the loop.

SparkQA · 2020-06-30T23:11:28Z

Test build #124696 has finished for PR 28960 at commit 35edf01.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-01T01:47:57Z

Test build #124711 has finished for PR 28960 at commit ba3384d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-01T13:13:13Z

Looks like it needs a rebase after I merged your other commit

SparkQA · 2020-07-02T01:05:51Z

Test build #124830 has finished for PR 28960 at commit 5b6ecb9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-06T13:27:42Z

Weird, a Python 2 failure?

======================================================================
FAIL: test_fm_classification_summary (pyspark.ml.tests.test_training_summary.TrainingSummaryTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@4/python/pyspark/ml/tests/test_training_summary.py", line 345, in test_fm_classification_summary
    self.assertAlmostEqual(s.weightedTruePositiveRate, 0.5, 2)
AssertionError: 1.0 != 0.5 within 2 places

huaxingao · 2020-07-06T17:13:30Z

This is a python 2 failure only, python 3 is OK. I think I can simply change test data to get around this, but I found one more problem that I didn't have time to fix yet.
I will be extremely slow in these couple of weeks. Taking some time off :)

srowen

Looks fine if it doesn't change existing APIs and is just adding more consistent functionality

srowen · 2020-07-12T16:13:15Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala

+        // compute and sum up the subgradients on this subset (this is one map-reduce)
+        val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
+          .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
+            seqOp = (c, v) => {


I forget, can you write stuff like case ((foo, bar, baz), v) => here to avoid all the ._1? I keep thinking it's possible but then I find it isn't.

seems not. Just tried, not working.

nit: it seems that breakable is not used in spark (except two suites):

➜ spark git:(master) ag --scala 'breakable' . mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala 2941: breakable { mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala 142: breakable {

I am not sure whether it is suiteable

Yeah it's a little unusual unless it significantly simplifies the code. Can !converged be added back to the while condition, and then turn the if (X) break condition below into if (!X) { ... code that follows ...} ? should be the same as i will increment and end the loop right after anyway

Fixed. Thanks!

SparkQA · 2020-07-12T17:10:22Z

Test build #125718 has finished for PR 28960 at commit 77aefd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T07:05:02Z

Test build #125809 has finished for PR 28960 at commit 0767117.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T08:06:28Z

Test build #125810 has finished for PR 28960 at commit 9a58603.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-15T14:31:14Z

I think you can go ahead and merge this

huaxingao · 2020-07-15T17:15:54Z

Merged to master. Thanks @srowen @zhengruifeng for reviewing!

probot-autolabeler bot added ML PYTHON labels Jun 30, 2020

huaxingao commented Jun 30, 2020

View reviewed changes

huaxingao force-pushed the fm_summary branch from ba3384d to 5b6ecb9 Compare July 2, 2020 00:07

srowen reviewed Jul 12, 2020

View reviewed changes

huaxingao added 6 commits July 13, 2020 23:56

[SPARK-32140][ML][PySpark] Add summary to FMClassificationModel

b450a50

remove println

4263515

fix comment format

755b7f1

fix python style error

ec25179

fix MiMa

63a1eec

addres comments

9a58603

huaxingao force-pushed the fm_summary branch from 0767117 to 9a58603 Compare July 14, 2020 06:57

huaxingao closed this in b05f309 Jul 15, 2020

huaxingao deleted the fm_summary branch July 15, 2020 17:15

zero323 mentioned this pull request Jul 18, 2020

[SPARK-32140]Add training summary to FMClassificationModel zero323/pyspark-stubs#440

Closed

huaxingao mentioned this pull request Jul 28, 2020

[SPARK-32310][ML][PySpark] ML params default value parity in feature and tuning #29153

Closed

[SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel #28960

[SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel #28960

Uh oh!

Conversation

huaxingao commented Jun 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 30, 2020

Uh oh!

SparkQA commented Jun 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 30, 2020

Uh oh!

SparkQA commented Jul 1, 2020

Uh oh!

srowen commented Jul 1, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

srowen commented Jul 6, 2020

Uh oh!

huaxingao commented Jul 6, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

srowen commented Jul 15, 2020

Uh oh!

huaxingao commented Jul 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants