[SPARK-13969][ML] Add FeatureHasher transformer #18513

MLnick · 2017-07-03T09:53:57Z

This PR adds a FeatureHasher transformer, modeled on scikit-learn and Vowpal wabbit.

The transformer operates on multiple input columns in one pass. Current behavior is:

for numerical columns, the values are assumed to be real values and the feature index is hash(columnName) while feature value is feature_value
for string columns, the values are assumed to be categorical and the feature index is hash(column_name=feature_value), while feature value is 1.0
For hash collisions, feature values will be summed
null (missing) values are ignored

The following dataframe illustrates the basic semantics:

+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features                                 |
+---+------+-----+---------+------+-----------------------------------------+
|3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+

How was this patch tested?

New unit tests and manual experiments.

SparkQA · 2017-07-03T10:06:49Z

Test build #79092 has finished for PR 18513 at commit 9edb3bd.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-07-03T10:10:35Z

Note 1: this is distinct from HashingTF which handles vectorizing text to term frequencies (analogous to HashingVectorizer). Thie feature hasher could be extended to also handle Seq[String] input columns. But I feel it conflates concerns - e.g. HashingTF handles min term frequencies, binarization etc.

However we could later add basic support for Seq[String] columns - this would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets hashed into one feature vector (can be combined with namespaces later).

Note 2: some potential follow ups:

support specifying categorical columns explicitly. This would be to allow forcing some columns that are in numerical format to be treated as categorical. Strings would still be treated as categorical.
support using the sign of hashed value as sign of feature value, and then support non_negative param (see scikit-learn)
support feature namespaces and feature interactions similar to Vowpal Wabbit (see here for an outline of the code used). This could provide an efficient and scalable form of PolynomialExpansion.

cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk

MLnick · 2017-07-03T10:12:53Z

I've moved HashingTF numFeatures param to sharedParams which results in the MiMa failure since it would now be marked final. Can't quite recall what we've done previously in this case - whether we accept that it breaks user code, but that in most cases users should not have really been extending or overriding these params. Or leave it as is.

I'm ok with the latter - numFeatures is not really that necessary to be a shared param.

hhbyyh

This should be useful for reducing the dimensions.

Agree that numFeature does not need to be in SharedParams.

I understand this may be WIP, several general comments.

hhbyyh · 2017-07-10T18:41:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+import org.apache.spark.util.Utils
+import org.apache.spark.util.collection.OpenHashMap
+
+


Yup, forgot that!

hhbyyh · 2017-07-10T18:42:17Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setNumFeatures(value: Int): this.type = set(numFeatures, value)


need a way to know the default value.

Not sure what you mean exactly

hhbyyh · 2017-07-10T18:50:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+  override def transformSchema(schema: StructType): StructType = {
+    val fields = schema($(inputCols).toSet)
+    require(fields.map(_.dataType).forall { case dt =>
+      dt.isInstanceOf[NumericType] || dt.isInstanceOf[StringType]


require message

hhbyyh · 2017-07-10T18:58:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+      val field = dataset.schema(colName)
+      field.dataType match {
+        case DoubleType | StringType => dataset(field.name)
+        case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)


Is it possible to avoid casting to Double, since one key target of Feature Hashing is reducing memory usage.

Fair point, have updated to handle this.

hhbyyh · 2017-07-10T19:03:06Z

mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala

+
+  implicit val vectorEncoder = ExpressionEncoder[Vector]()
+
+  test("params") {


Maybe add a test for the Unicode column name (like Chinese, "中文")

This reverts commit 7d678fb.

MLnick · 2017-07-12T13:31:30Z

@hhbyyh thanks for the comments. Have updated accordingly.

Thought about it and while numFeatures could be shared, it's only 2 transformers, so to avoid any binary compat issues I backed out the shared param version.

SparkQA · 2017-07-12T14:33:35Z

Test build #79558 has finished for PR 18513 at commit b580a5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-07-14T21:40:00Z

Just to clarify:

If I want to treat a column as categorical that is represented by integers, I'd have to map those integers to strings, right? I believe that's one of your bullets above.
This is going to one-hot encoding on categorical columns, effectively, which is going to create linearly dependent columns since there is no parameter to drop the last column. Maybe there's a good solution, but I don't think we have to address it here. Just wanted to check.

sethah

Nice PR! The tests are great. Only minor comments.

sethah · 2017-07-14T16:08:09Z

mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala

+      .setNumFeatures(n)
+    val output = hasher.transform(df)
+    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
+    require(attrGroup.numAttributes === Some(n))


make this an assert

sethah · 2017-07-14T16:43:17Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+
+    val hashFeatures = udf { row: Row =>
+      val map = new OpenHashMap[Int, Double]()
+      $(inputCols).foreach { case colName =>


case does nothing here

also, I think you'll serialize the entire object here by using $(inputCols). Maybe you can make a local pointer to it before the udf.

Ah thanks - this was left over from a previous code version

sethah · 2017-07-14T21:43:06Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+
+  override def transformSchema(schema: StructType): StructType = {
+    val fields = schema($(inputCols).toSet)
+    fields.foreach { case fieldSchema =>


case does nothing

Again, think it was left over from some previous version, will update

sethah · 2017-07-14T22:06:13Z

mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala

+
+  import HashingTFSuite.murmur3FeatureIdx
+
+  implicit val vectorEncoder = ExpressionEncoder[Vector]()


sethah · 2017-07-14T22:10:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
+  }
+
+  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)


since tags on all public methods (copy, transformSchema, transform)

hhbyyh

I'm a little worried that categorical data will be overwhelmed by the Double values in the case of hash collision.
If it's better, we can divide the output vector space into two parts. Since all the real value columns will be mapped to the same feature index, we can just leave out certain range for the real values. E.g., if there're 5 categorical values and 3 Double values, and numFeature=100, then we can just leave the last 3 indexes(97, 98, 99) for Double values and map the categorical values into the first 97 indexes. But I guess there's problem when there're a lot of Double columns and user just want to combine and shrink them.

Another option is that we can just inform user the option in the document and demo it in the example code, where users can use one FeatureHasher for categorical value and another for Doubles and then assemble the output features.

Just to bring up the idea and sorry it's a little late. Other parts LGTM.

hhbyyh · 2017-07-15T18:10:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+      f.dataType.isInstanceOf[NumericType]
+    }.map(_.name).toSet
+
+    def getDouble(x: Any): Double = {


maybe val getDouble...

I read it from here, but never tested it.
https://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function

Hmm, this is a method not a function - so I don't think it will be faster to do val in this case?

hhbyyh · 2017-07-15T18:14:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+    val metadata = outputSchema($(outputCol)).metadata
+    dataset.select(
+      col("*"),
+      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))


MLnick · 2017-07-17T07:14:26Z

@hhbyyh can you elaborate on your concerns in comment #18513 (review)?

I tend to agree that the hasher is perhaps best used for categorical features, while known real features could be "assembled" onto the resulting hashed feature vector. However, one nice thing about hashing is it can handle everything at once in one pass. In practice even with very high cardinality categorical features and some real features, for the "normal" settings of hash bits, hash collision rate is relatively low, and has very little impact on performance (at least from my experiments). Of course it assumes highly sparse data - if the data is not sparse then it's usually best to use other mechanisms.

MLnick · 2017-07-17T11:07:28Z

@sethah thanks for reviewing.

For the 1st question:

Yes, currently categorical columns that are numerical would need to be explicitly encoded as strings. I mentioned it as a follow up improvement. It's easy to handle, it's just the API for this I'm not certain of yet, here are the two options I see:

User can specify param categoricalCols to explicitly set categorical cols. But, do we then assume that all other columns not in that list, that are strings, are categorical? i.e. this param is effectively only for numeric columns that must be treated as categorical? Or do we ignore all other non-numerical columns? etc
User can specify param realCols to explicitly set the numeric columns. All other columns are treated as categorical.

We could potentially offer both formats, though I tend to gravitate towards potentially (2) above, since the default use case will be encoding many (usually high cardinality) categorical columns, with maybe a few real columns in there.

For the second issue:

There is no way (at least that I know of) to provide a dropLast feature, since we don't know how many features there are - the whole point of hashing is not to keep the feature <-> index mapping for speed and memory efficiency.

SparkQA · 2017-07-18T08:23:48Z

Test build #79699 has finished for PR 18513 at commit 990b816.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-07-18T17:43:02Z

Let's make sure to create doc and python JIRAs before this gets merged btw.

MLnick · 2017-07-19T09:44:39Z

Created https://issues.apache.org/jira/browse/SPARK-21468 and https://issues.apache.org/jira/browse/SPARK-21469 for docs and Python API.

hhbyyh

Thanks for the reply. The PR looks good to me.

sethah · 2017-07-21T15:09:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+ * to map features to indices in the feature vector.
+ *
+ * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
+ * (representing a real feature) or string (representing a categorical feature). Boolean columns


It might be good to make the behavior for each type of column clearer here. Specifically for numeric columns that are meant to be categories. Something like:

/** * Behavior * -Numeric columns: For numeric features, the hash value of the column name is used to map the * feature value to its index in the feature vector. Numeric features are never * treated as categorical, even when they are integers. You must convert * categorical columns to strings first. * -String columns: ... * -Boolean columns: ... */

Anyway, this is a very minor suggestion and I think it's also ok to leave as is.

sethah · 2017-07-21T15:11:09Z

LGTM!

MLnick · 2017-07-25T11:22:12Z

Thanks @sethah @hhbyyh for the review. I updated the behavior doc string as suggested.

Any other comments? cc @srowen @jkbradley @yanboliang

SparkQA · 2017-07-25T12:20:06Z

Test build #79934 has finished for PR 18513 at commit a91b53f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-26T08:30:42Z

Test build #79961 has finished for PR 18513 at commit d6a3117.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

I originally has the same concern with @hhbyyh that categorical data will be overwhelmed by the Double values in the case of hash collision, but the comments convinced me. This looks pretty good now. Thanks.

yanboliang · 2017-07-27T03:45:35Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+        s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
+          s"Column $fieldName was $dataType")
+    }
+    val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))


It seems that we didn't store Attributes in the AttributeGroup, but we did it in VectorAssembler, and both of FeatureHasher and VectorAssembler can be followed with ML algorithms directly. I'd like to confirm is it intentional？I understand this may be due to performance considerations, and users may not interested to know the attribute of hashed features. We can leave as it is, until we find it affects some scenarios.

Feature hashing doesn't keep the feature -> idx mapping for memory efficiency, so by extension it won't keep attribute info. This is by design, and the tradeoff is speed & efficiency vs. not being able to do the reverse mapping (or knowing the cardinality of each feature, for example).

If users want to keep the mapping & attribute info, then of course they can just use one-hot encoding and vector assembler.

@MLnick Thanks for clarifying.

MLnick · 2017-08-16T07:28:35Z

jenkins retest this please

SparkQA · 2017-08-16T08:38:30Z

Test build #80724 has finished for PR 18513 at commit d6a3117.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-16T08:54:44Z

Merged to master. Thanks all for reviews.

Nick Pentreath added 8 commits July 3, 2017 09:48

initial WIP

6ab19a9

Further work

ebd2cbf

Clean up

ba255bf

Add tests

0be1e65

Copy, save/load, clean up

2f3ea21

Move numFeatures to HasNumFeatures shared trait

7d678fb

Update shared params from codegen run

6057277

Update tests. Null values ignored in feature hashing.

9edb3bd

hhbyyh reviewed Jul 10, 2017

View reviewed changes

Nick Pentreath added 2 commits July 12, 2017 12:22

Revert "Move numFeatures to HasNumFeatures shared trait"

8c5cb30

This reverts commit 7d678fb.

Address various review comments

b580a5c

sethah reviewed Jul 14, 2017

View reviewed changes

hhbyyh reviewed Jul 15, 2017

View reviewed changes

Address review comments 2

990b816

hhbyyh approved these changes Jul 20, 2017

View reviewed changes

sethah reviewed Jul 21, 2017

View reviewed changes

Update doc string with more detailed behavior info

a91b53f

Add @experimental tag

d6a3117

yanboliang approved these changes Jul 27, 2017

View reviewed changes

asfgit closed this in 0bb8d1f Aug 16, 2017

		import org.apache.spark.util.Utils
		import org.apache.spark.util.collection.OpenHashMap


		implicit val vectorEncoder = ExpressionEncoder[Vector]()

		test("params") {


		import HashingTFSuite.murmur3FeatureIdx

		implicit val vectorEncoder = ExpressionEncoder[Vector]()

[SPARK-13969][ML] Add FeatureHasher transformer #18513

[SPARK-13969][ML] Add FeatureHasher transformer #18513

Uh oh!

Conversation

MLnick commented Jul 3, 2017

How was this patch tested?

Uh oh!

SparkQA commented Jul 3, 2017

Uh oh!

MLnick commented Jul 3, 2017

Uh oh!

MLnick commented Jul 3, 2017

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

sethah commented Jul 14, 2017

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Jul 17, 2017

Uh oh!

MLnick commented Jul 17, 2017

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

sethah commented Jul 18, 2017

Uh oh!

MLnick commented Jul 19, 2017

hhbyyh left a comment •

edited

Loading