[SPARK-37178][ML] Add Target Encoding to ml.feature #48347

rebo16v · 2024-10-04T09:07:12Z

What changes were proposed in this pull request?

Adds support for target encoding of ml features.
Target Encoding maps a column of categorical indices into a numerical feature derived from the target.
Leveraging the relationship between categorical variables and the target variable, target encoding usually performs better than one-hot encoding (while avoiding the need to add extra columns)

Why are the changes needed?

Target Encoding is a well-known encoding technique for categorical features.
It's supported on most ml frameworks
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
https://search.r-project.org/CRAN/refmans/dataPreparation/html/target_encode.html

Does this PR introduce any user-facing change?

Spark API now includes 2 new classes in package org.apache.spark.ml

TargetEncoder (estimator)
TargetEncoderModel (transformer)

How was this patch tested?

Scala => org.apache.spark.ml.feature.TargetEncoderSuite
Java => org.apache.spark.ml.feature.JavaTargetEncoderSuite
Python => python.pyspark.ml.tests.test_feature.FeatureTests (added 2 tests)

Was this patch authored or co-authored using generative AI tooling?

No

Some design notes ... |-

binary and continuous target types (no multi-label yet)
available in Scala, Java and Python APIs
fitting implemented on RDD API (treeAggregate)
transformation implemented on Dataframe API (no UDFs)
categorical features must be indices (integers) in Double-typed columns (as if StringIndexer were used before)
unseen categories in training are represented as class -1.0
Encodings structure
- Map[String, Map[Double, Double]]) => Map[ feature_name, Map[ original_category, encoded category ] ]
Parameters
- inputCol(s) / outputCol(s) / labelCol => as usual
- targetType
  - binary => encodings calculated as in-category conditional probability (counting)
  - continuous => encodings calculated as in-category target mean (incrementally)
- handleInvalid
  - error => raises an error if trying to encode an unseen category
  - keep => encodes an unseen category with the overall statistics
- smoothing => controls how in-category stats and overall stats are weighted to calculate final encodings (to avoid overfitting)

…ding into sparkml-target-encoding

srowen

Good start, just need to clarify the implementation and consider supporting a few more cases

srowen · 2024-10-04T13:29:38Z

docs/ml-features.md

@@ -855,6 +855,46 @@ for more details on the API.

 </div>

+## TargetEncoder
+
+Target Encoding maps a column of categorical indices into a numerical feature derived from the target. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.


Let's drop at least a link to information on what target encoding is here.
Also, the explanation you give in the PR about what this actually does to which types of input is valuable and should probably be here too, either here or below in discussion of what the parameters do in some detail.

i think it's ok now. what do you think?

srowen · 2024-10-04T13:37:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+      feature => {
+        try {
+          val field = schema(feature)
+          if (field.dataType != DoubleType) {


Do the features have to be floats? I'd imagine they aren't if they're categorical representations you're encoding. I think it's OK to demand they're not strings and are already passed through StringIndexer in that case, but it feels like any numeric type works here

i mimic this behavior from other encoders (i.e. OneHotEncoder)
what would be your approach? accepting Integers? checking for nominal attribute in metadata?

I think that's OK if it's what other encoders do. But I see checks like https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L93 - maybe follow that?

you're right
now accepting any subclass of NumericType for features & label
(maybe it doesn't make much sense the continuous case, it could be done anyway)

srowen · 2024-10-04T13:39:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+    validateSchema(dataset.schema, fitting = true)
+
+    val stats = dataset
+      .select(ArraySeq.unsafeWrapArray(


Is the ArraySeq business necessary? you're just selecting columns with : _* syntax so any seq would do

it doesn't work in Scala 2.13
Passing an explicit array value to a Scala varargs method is deprecated (since 2.13.0) and will result in a defensive copy; Use the more efficient non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call

OK, I'd say .toIndexedSeq is simpler

you're right. done

srowen · 2024-10-04T13:39:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+                    globalCounter._2 + ((label - globalCounter._2) / (1 + globalCounter._1))))
+              }
+            } catch {
+            case e: SparkRuntimeException =>


got resolved in the overall refactor

srowen · 2024-10-04T13:40:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+            case e: SparkRuntimeException =>
+              if (e.getErrorClass == "ROW_VALUE_IS_NULL") {
+                throw new SparkException(s"Null value found in feature ${inputFeatures(feature)}." +
+                  s" See Imputer estimator for completing missing values.")


It seems like you can still target-encode null; it's just another possible value, no?

yes
it will be encoded as an unseen category (global statistics)
we could raise an error (as we do while fitting)

But this throws an exception? how is it handled as unseen but also raises an exception? it shouldn't, right?

Actually, it raises an exception while fitting and encodes as unseen category while transforming.
I´ll check scikit-learn behavior

Right, I mean while fitting. I don't feel like this is necessary

following scikit approach, now treating null as another category
(category becomes Option[Double])
encodings: Map[String, Map[Option[Double], Double]]

srowen · 2024-10-04T13:40:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+              val value = row.getDouble(feature)
+              if (value < 0.0 || value != value.toInt) throw new SparkException(
+                  s"Values from column ${inputFeatures(feature)} must be indices, but got $value.")
+              val counter = agg(feature).getOrElse(value, (0.0, 0.0))


Use val (foo, bar) = syntax so you don't have to use more cryptic ._1 references later

srowen · 2024-10-04T13:42:43Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+              val globalCounter = agg(feature).getOrElse(TargetEncoder.UNSEEN_CATEGORY, (0.0, 0.0))
+              $(targetType) match {
+                case TargetEncoder.TARGET_BINARY =>
+                  if (label == 1.0) agg(feature) +


These if-else clauses need to be indented and with braces around them for clarity

srowen · 2024-10-04T13:44:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+        })(
+        (agg, row: Row) => {
+          val label = row.getDouble(inputFeatures.length)
+          Range(0, inputFeatures.length).map {


(1 until inputFeatures.length) feels a little more idiomatic, or even for ... yield

finally changed for inputFeatures.indices

srowen · 2024-10-04T13:45:29Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+              val values = agg1(feature).keySet ++ agg2(feature).keySet
+              values.map(value =>
+                value -> {
+                  val stat1 = agg1(feature).getOrElse(value, (0.0, 0.0))


Same, let's give names to the elements of this tuple
A comment or two in these blocks about what this sum is doing would help too

srowen · 2024-10-04T13:46:33Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+          rest
+            .foldLeft(when(col === first._1, first._2))(
+              (new_col: Column, encoding) =>
+                if (encoding._1 != TargetEncoder.UNSEEN_CATEGORY) {


And same again around here - some comments and more descriptive var names are important, as I have trouble evaluating the logic

…proved doc

…ding into sparkml-target-encoding

rebo16v · 2024-10-09T09:30:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+          Map.empty[Option[Double], (Double, Double)]
+        })(
+        (agg, row: Row) => {
+          val label = label_type match {


didn't work yet on handling null labels
i checked scikit and it fails at this (encoding all to NaN)
we could

raise an exception

do not consider the observation and keep going
what do you think?

I see. I guess I think it's most sensible to ignore nulls then

…ding into sparkml-target-encoding

srowen · 2024-10-09T20:29:00Z

docs/ml-features.md

+`TargetEncoder` supports the `targetType` parameter to choose the label type when fitting data, affecting how statistics are calculated.
+Available options include 'binary'  and 'continuous' (mean-encoding).
+When set to 'binary', encodings will be fitted from target conditional probabilities (a.k.a bin-counting).
+When set to 'continuous', encodings will be fitted from according to target mean (a.k.a. mean-encoding).


I think you can describe this a little bit more somewhere, could be here or at the top - what does target encoding actually do? with a simplistic example of a few rows?

Just want to make it immediate clearly in 1 paragraph what this is doing for binary vs continuous targets

srowen · 2024-10-09T20:30:19Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+            case (cat, (class_count, class_stat)) => cat -> {
+              val weight = class_count / (class_count + $(smoothing))
+              $(targetType) match {
+                case TargetEncoder.TARGET_BINARY =>


This all might be worth a few lines of comments explaining the math here

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

…ding into sparkml-target-encoding

srowen · 2024-10-19T03:02:41Z

Let me call in @zhengruifeng for a look at this too. I think it's pretty good

zhengruifeng · 2024-10-20T00:55:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+  @Since("4.0.0")
+  val targetType: Param[String] = new Param[String](this, "targetType",
+    "How to handle invalid data during transform(). " +
+      "Options are 'keep' (invalid data presented as an extra categorical feature) " +


targetType's description is same as handleInvalid?

Ups! Fixed for targetType & smoothing

zhengruifeng · 2024-10-20T00:56:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
+    "How to handle invalid data during transform(). " +
+      "Options are 'keep' (invalid data presented as an extra categorical feature) " +
+      "or error (throw an error). Note that this Param is only used during transform; " +


Suggested change

"or error (throw an error). Note that this Param is only used during transform; " +

"or 'error' (throw an error). Note that this Param is only used during transform; " +

zhengruifeng · 2024-10-20T01:00:06Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+  private[feature] def validateSchema(schema: StructType,
+                                      fitting: Boolean): StructType = {


Suggested change

private[feature] def validateSchema(schema: StructType,

fitting: Boolean): StructType = {

private[feature] def validateSchema(

schema: StructType,

fitting: Boolean): StructType = {

zhengruifeng · 2024-10-20T01:05:56Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+            case ShortType => row.getShort(inputFeatures.length).toDouble
+            case IntegerType => row.getInt(inputFeatures.length).toDouble
+            case LongType => row.getLong(inputFeatures.length).toDouble
+            case DoubleType => row.getDouble(inputFeatures.length)


I would suggest make the casting happen before the aggregation (dataset.select in the above ) to simplify the process

zhengruifeng · 2024-10-20T01:32:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+        }.toArray)
+
+    // encodings: Map[feature, Map[Some(category), encoding]]
+    val encodings: Map[String, Map[Option[Double], Double]] =


I feel this is computation is not very complex and maybe implemented with sql functions.
but I am also fine to start with a RDD implementation.

zhengruifeng · 2024-10-20T01:42:51Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+    dataset.withColumns(
+      inputFeatures.zip(outputFeatures).map {
+        feature =>
+          feature._2 -> (encodings.get(feature._1) match {


model coefficients encodings stores the column names inputCols used in fit, does encodings.get(feature._1) requires inputCols in transform should be exactly the same as inputCols in fit?

you're right. Fixed.
now encodings: Array[Map[Some(category), encoding]]

zhengruifeng · 2024-10-20T01:50:46Z

also cc @WeichenXu123 for visibility

…ding into sparkml-target-encoding

rebo16v · 2024-10-21T07:02:55Z

I think we should pass raw estimates to the model and calculate encodings in transform()
So we can apply different smoothing factors without having to re-fit
Makes sense? Will work on this ...

spark/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

Line 253 in 5a67c50

val encodings: Array[Map[Option[Double], Double]] =

…ransform()

…ding into sparkml-target-encoding

rebo16v · 2024-10-23T20:02:00Z

I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ...

spark/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

Line 253 in 5a67c50

val encodings: Array[Map[Option[Double], Double]] =

done!

rebo16v · 2024-10-27T19:55:53Z

@srowen @zhengruifeng

zhengruifeng · 2024-10-28T03:31:17Z

mllib/src/test/scala/org/apache/spark/ml/feature/TargetEncoderSuite.scala

+
+  }
+
+  test("TargetEncoder - null label") {


how does it handle NaN? treat as a normal value or invalid value?

NaN features => invalid (only accepting indices and null)

NaN labels => was failing, it's fixed now (observation not considered)

NaN features => invalid (only accepting indices and null)

NaN labels => was failing, it's fixed now (observation not considered)

hi @rebo16v

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#targetencoder

TargetEncoder considers missing values, such as np.nan or None, as another category and encodes them like any other category. Categories that are not seen during fit are encoded with the target mean, i.e. target_mean_.

It seems scikit-learn's implementation also treat NaN as a valid missing value?

zhengruifeng · 2024-10-28T04:07:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+  else if (isSet(outputCols)) $(outputCols)
+  else inputFeatures.map{field: String => s"${field}_indexed"}
+
+  private[feature] def validateSchema(


would you mind checking the scala style according to the section Code style guide in https://spark.apache.org/contributing.html?

…ding into sparkml-target-encoding

rebo16v · 2024-10-28T21:44:22Z

@zhengruifeng

srowen · 2024-11-02T13:09:03Z

I think it looks good. There are 'failing' tests but it looks like a timeout. I'll run again to see if they complete. Anyone know about issues with the builder at the moment?

…ding into sparkml-target-encoding

zhengruifeng · 2024-11-05T21:47:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

+@Since("4.0.0")
+class TargetEncoderModel private[ml] (
+                     @Since("4.0.0") override val uid: String,
+                     @Since("4.0.0") val stats: Array[Map[Option[Double], (Double, Double)]])


nit: is it possible to avoid the usage of Option in the model coefficient?

e.g.

val stats: Array[Map[Double, (Double, Double)]]), # for valid values; val statForInvalid: (Double, Double), # for invalid values;

None is reserved for null category
It's possible to avoid it by reserving another double value (i.e. -2)

examples/src/main/java/org/apache/spark/examples/ml/JavaTargetEncoderExample.java

mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala

mllib/src/test/java/org/apache/spark/ml/feature/JavaTargetEncoderSuite.java

HyukjinKwon

The failed test passes locally. Let's merge this and see if the test failure persists.

…ding into sparkml-target-encoding

rebo16v · 2024-11-06T21:35:27Z

@HyukjinKwon @zhengruifeng

HyukjinKwon · 2024-11-06T23:47:45Z

Merged to master.

Enrique Rebollo added 2 commits October 4, 2024 09:16

[SPARK-37178][ML] Add Target Encoding to ml.feature

5715092

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

5f1902e

…ding into sparkml-target-encoding

github-actions bot added ML EXAMPLES DOCS PYTHON labels Oct 4, 2024

srowen reviewed Oct 4, 2024

View reviewed changes

Enrique Rebollo added 2 commits October 8, 2024 23:53

[SPARK-37178][ML] handle null category, support all numeric types, im…

0264264

…proved doc

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

12cba08

…ding into sparkml-target-encoding

rebo16v commented Oct 9, 2024

View reviewed changes

Enrique Rebollo added 2 commits October 9, 2024 19:27

[SPARK-37178][ML] ignore null label observations

0221933

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

a329d16

…ding into sparkml-target-encoding

srowen reviewed Oct 9, 2024

View reviewed changes

Enrique Rebollo added 2 commits October 16, 2024 20:35

[SPARK-37178][ML] improve doc & comments

3f1f86d

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

cc3b78c

…ding into sparkml-target-encoding

zhengruifeng reviewed Oct 20, 2024

View reviewed changes

Enrique Rebollo added 2 commits October 20, 2024 22:39

[SPARK-37178][ML] allow different feature names in model

229e5ed

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

5a67c50

…ding into sparkml-target-encoding

Enrique Rebollo added 2 commits October 23, 2024 19:45

[SPARK-37178][ML] passing raw stats to model, building encodings in t…

bb95c8d

…ransform()

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

8bd8527

…ding into sparkml-target-encoding

srowen approved these changes Oct 28, 2024

View reviewed changes

zhengruifeng reviewed Oct 28, 2024

View reviewed changes

Enrique Rebollo added 2 commits October 28, 2024 20:44

[SPARK-37178][ML] disregard NaN-labeled observations

32adb85

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

1796454

…ding into sparkml-target-encoding

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

fea4b6c

…ding into sparkml-target-encoding

zhengruifeng approved these changes Nov 5, 2024

View reviewed changes

zhengruifeng reviewed Nov 5, 2024

View reviewed changes

HyukjinKwon reviewed Nov 5, 2024

View reviewed changes

HyukjinKwon approved these changes Nov 5, 2024

View reviewed changes

HyukjinKwon and others added 3 commits November 5, 2024 14:09

nits

7ca04f3

[SPARK-37178][ML] changed category datatype to Double

6236bd0

Merge branch 'master' of https://github.com/rebo16v/spark-target-enco…

26410a1

…ding into sparkml-target-encoding

HyukjinKwon closed this in abe990e Nov 6, 2024

	"or error (throw an error). Note that this Param is only used during transform; " +
	"or 'error' (throw an error). Note that this Param is only used during transform; " +

		private[feature] def validateSchema(schema: StructType,
		fitting: Boolean): StructType = {

[SPARK-37178][ML] Add Target Encoding to ml.feature #48347

[SPARK-37178][ML] Add Target Encoding to ml.feature #48347

Conversation

rebo16v commented Oct 4, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Some design notes ... |-

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Oct 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Oct 20, 2024

rebo16v commented Oct 21, 2024

rebo16v commented Oct 23, 2024

rebo16v commented Oct 27, 2024

zhengruifeng Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

rebo16v Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

zhengruifeng Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rebo16v commented Oct 28, 2024

srowen commented Nov 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

rebo16v commented Nov 6, 2024

HyukjinKwon commented Nov 6, 2024

zhengruifeng Oct 28, 2024 •

edited

Loading

rebo16v Oct 28, 2024 •

edited

Loading

zhengruifeng Nov 7, 2024 •

edited

Loading