Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented Jul 3, 2017

This PR adds a FeatureHasher transformer, modeled on scikit-learn and Vowpal wabbit.

The transformer operates on multiple input columns in one pass. Current behavior is:

  • for numerical columns, the values are assumed to be real values and the feature index is hash(columnName) while feature value is feature_value
  • for string columns, the values are assumed to be categorical and the feature index is hash(column_name=feature_value), while feature value is 1.0
  • For hash collisions, feature values will be summed
  • null (missing) values are ignored

The following dataframe illustrates the basic semantics:

+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features                                 |
+---+------+-----+---------+------+-----------------------------------------+
|3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+

How was this patch tested?

New unit tests and manual experiments.

@SparkQA
Copy link

SparkQA commented Jul 3, 2017

Test build #79092 has finished for PR 18513 at commit 9edb3bd.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Jul 3, 2017

Note 1: this is distinct from HashingTF which handles vectorizing text to term frequencies (analogous to HashingVectorizer). Thie feature hasher could be extended to also handle Seq[String] input columns. But I feel it conflates concerns - e.g. HashingTF handles min term frequencies, binarization etc.

However we could later add basic support for Seq[String] columns - this would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets hashed into one feature vector (can be combined with namespaces later).

Note 2: some potential follow ups:

  • support specifying categorical columns explicitly. This would be to allow forcing some columns that are in numerical format to be treated as categorical. Strings would still be treated as categorical.
  • support using the sign of hashed value as sign of feature value, and then support non_negative param (see scikit-learn)
  • support feature namespaces and feature interactions similar to Vowpal Wabbit (see here for an outline of the code used). This could provide an efficient and scalable form of PolynomialExpansion.

cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk

@MLnick
Copy link
Contributor Author

MLnick commented Jul 3, 2017

I've moved HashingTF numFeatures param to sharedParams which results in the MiMa failure since it would now be marked final. Can't quite recall what we've done previously in this case - whether we accept that it breaks user code, but that in most cases users should not have really been extending or overriding these params. Or leave it as is.

I'm ok with the latter - numFeatures is not really that necessary to be a shared param.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be useful for reducing the dimensions.

Agree that numFeature does not need to be in SharedParams.

I understand this may be WIP, several general comments.

import org.apache.spark.util.Utils
import org.apache.spark.util.collection.OpenHashMap


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, forgot that!


/** @group setParam */
@Since("2.3.0")
def setNumFeatures(value: Int): this.type = set(numFeatures, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a way to know the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean exactly

override def transformSchema(schema: StructType): StructType = {
val fields = schema($(inputCols).toSet)
require(fields.map(_.dataType).forall { case dt =>
dt.isInstanceOf[NumericType] || dt.isInstanceOf[StringType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require message

val field = dataset.schema(colName)
field.dataType match {
case DoubleType | StringType => dataset(field.name)
case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid casting to Double, since one key target of Feature Hashing is reducing memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, have updated to handle this.


implicit val vectorEncoder = ExpressionEncoder[Vector]()

test("params") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test for the Unicode column name (like Chinese, "中文")

@MLnick
Copy link
Contributor Author

MLnick commented Jul 12, 2017

@hhbyyh thanks for the comments. Have updated accordingly.

Thought about it and while numFeatures could be shared, it's only 2 transformers, so to avoid any binary compat issues I backed out the shared param version.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79558 has finished for PR 18513 at commit b580a5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Jul 14, 2017

Just to clarify:

  • If I want to treat a column as categorical that is represented by integers, I'd have to map those integers to strings, right? I believe that's one of your bullets above.
  • This is going to one-hot encoding on categorical columns, effectively, which is going to create linearly dependent columns since there is no parameter to drop the last column. Maybe there's a good solution, but I don't think we have to address it here. Just wanted to check.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! The tests are great. Only minor comments.

.setNumFeatures(n)
val output = hasher.transform(df)
val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
require(attrGroup.numAttributes === Some(n))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this an assert


val hashFeatures = udf { row: Row =>
val map = new OpenHashMap[Int, Double]()
$(inputCols).foreach { case colName =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case does nothing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I think you'll serialize the entire object here by using $(inputCols). Maybe you can make a local pointer to it before the udf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks - this was left over from a previous code version


override def transformSchema(schema: StructType): StructType = {
val fields = schema($(inputCols).toSet)
fields.foreach { case fieldSchema =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case does nothing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, think it was left over from some previous version, will update


import HashingTFSuite.murmur3FeatureIdx

implicit val vectorEncoder = ExpressionEncoder[Vector]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
}

override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since tags on all public methods (copy, transformSchema, transform)

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that categorical data will be overwhelmed by the Double values in the case of hash collision.
If it's better, we can divide the output vector space into two parts. Since all the real value columns will be mapped to the same feature index, we can just leave out certain range for the real values. E.g., if there're 5 categorical values and 3 Double values, and numFeature=100, then we can just leave the last 3 indexes(97, 98, 99) for Double values and map the categorical values into the first 97 indexes. But I guess there's problem when there're a lot of Double columns and user just want to combine and shrink them.

Another option is that we can just inform user the option in the document and demo it in the example code, where users can use one FeatureHasher for categorical value and another for Doubles and then assemble the output features.

Just to bring up the idea and sorry it's a little late. Other parts LGTM.

f.dataType.isInstanceOf[NumericType]
}.map(_.name).toSet

def getDouble(x: Any): Double = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe val getDouble...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a method not a function - so I don't think it will be faster to do val in this case?

val metadata = outputSchema($(outputCol)).metadata
dataset.select(
col("*"),
hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(col)

@MLnick
Copy link
Contributor Author

MLnick commented Jul 17, 2017

@hhbyyh can you elaborate on your concerns in comment #18513 (review)?

I tend to agree that the hasher is perhaps best used for categorical features, while known real features could be "assembled" onto the resulting hashed feature vector. However, one nice thing about hashing is it can handle everything at once in one pass. In practice even with very high cardinality categorical features and some real features, for the "normal" settings of hash bits, hash collision rate is relatively low, and has very little impact on performance (at least from my experiments). Of course it assumes highly sparse data - if the data is not sparse then it's usually best to use other mechanisms.

@MLnick
Copy link
Contributor Author

MLnick commented Jul 17, 2017

@sethah thanks for reviewing.

For the 1st question:

Yes, currently categorical columns that are numerical would need to be explicitly encoded as strings. I mentioned it as a follow up improvement. It's easy to handle, it's just the API for this I'm not certain of yet, here are the two options I see:

  1. User can specify param categoricalCols to explicitly set categorical cols. But, do we then assume that all other columns not in that list, that are strings, are categorical? i.e. this param is effectively only for numeric columns that must be treated as categorical? Or do we ignore all other non-numerical columns? etc
  2. User can specify param realCols to explicitly set the numeric columns. All other columns are treated as categorical.

We could potentially offer both formats, though I tend to gravitate towards potentially (2) above, since the default use case will be encoding many (usually high cardinality) categorical columns, with maybe a few real columns in there.

For the second issue:

There is no way (at least that I know of) to provide a dropLast feature, since we don't know how many features there are - the whole point of hashing is not to keep the feature <-> index mapping for speed and memory efficiency.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79699 has finished for PR 18513 at commit 990b816.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Jul 18, 2017

Let's make sure to create doc and python JIRAs before this gets merged btw.

@MLnick
Copy link
Contributor Author

MLnick commented Jul 19, 2017

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. The PR looks good to me.

* to map features to indices in the feature vector.
*
* The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
* (representing a real feature) or string (representing a categorical feature). Boolean columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to make the behavior for each type of column clearer here. Specifically for numeric columns that are meant to be categories. Something like:

/**
 * Behavior
 *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
 *                    feature value to its index in the feature vector. Numeric features are never
 *                    treated as categorical, even when they are integers. You must convert
 *                    categorical columns to strings first.
 *  -String columns: ...
 *  -Boolean columns: ...
 */

Anyway, this is a very minor suggestion and I think it's also ok to leave as is.

@sethah
Copy link
Contributor

sethah commented Jul 21, 2017

LGTM!

@MLnick
Copy link
Contributor Author

MLnick commented Jul 25, 2017

Thanks @sethah @hhbyyh for the review. I updated the behavior doc string as suggested.

Any other comments? cc @srowen @jkbradley @yanboliang

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79934 has finished for PR 18513 at commit a91b53f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79961 has finished for PR 18513 at commit d6a3117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally has the same concern with @hhbyyh that categorical data will be overwhelmed by the Double values in the case of hash collision, but the comments convinced me. This looks pretty good now. Thanks.

s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
s"Column $fieldName was $dataType")
}
val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we didn't store Attributes in the AttributeGroup, but we did it in VectorAssembler, and both of FeatureHasher and VectorAssembler can be followed with ML algorithms directly. I'd like to confirm is it intentional?I understand this may be due to performance considerations, and users may not interested to know the attribute of hashed features. We can leave as it is, until we find it affects some scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature hashing doesn't keep the feature -> idx mapping for memory efficiency, so by extension it won't keep attribute info. This is by design, and the tradeoff is speed & efficiency vs. not being able to do the reverse mapping (or knowing the cardinality of each feature, for example).

If users want to keep the mapping & attribute info, then of course they can just use one-hot encoding and vector assembler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Thanks for clarifying.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 16, 2017

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80724 has finished for PR 18513 at commit d6a3117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 16, 2017

Merged to master. Thanks all for reviews.

@asfgit asfgit closed this in 0bb8d1f Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants