[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

hhbyyh wants to merge 6 commits into apache:master from hhbyyh:tfdoc

docs/ml-features.md

-Original file line number
+Diff line change
@@ Expand Up @@
     [Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step.  In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
-    **TF**: `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
-    The algorithm combines Term Frequency (TF) counts with the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+    **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors.
-    **IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`.  The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column.  Intuitively, it down-weights columns which appear frequently in a corpus.
+    `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
+    fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
+    The algorithm combines Term Frequency (TF) counts with the
+    [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+    `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
+    ](ml-features.html#countvectorizer) for more details.
+    **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`.  The
+    `IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.
+    Intuitively, it down-weights columns which appear frequently in a corpus.
     Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
@@ Expand Down @@

examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java

-Original file line number
+Diff line change
@@ Expand Up / @@ -63,6 +63,8 @@ public static void main(String[] args) { @@
           .setOutputCol("rawFeatures")
           .setNumFeatures(numFeatures);
         Dataset<Row> featurizedData = hashingTF.transform(wordsData);
+        // alternatively, CountVectorizer can also be used to get term frequency vectors
         IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
         IDFModel idfModel = idf.fit(featurizedData);
         Dataset<Row> rescaledData = idfModel.transform(featurizedData);
@@ Expand Down @@

examples/src/main/python/ml/tf_idf_example.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -37,6 +37,8 @@ @@
         wordsData = tokenizer.transform(sentenceData)
         hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
         featurizedData = hashingTF.transform(wordsData)
+        # alternatively, CountVectorizer can also be used to get term frequency vectors
         idf = IDF(inputCol="rawFeatures", outputCol="features")
         idfModel = idf.fit(featurizedData)
         rescaledData = idfModel.transform(featurizedData)
@@ Expand Down @@

examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala

-Original file line number
+Diff line change
@@ Expand Up / @@ -43,6 +43,8 @@ object TfIdfExample { @@
         val hashingTF = new HashingTF()
           .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
         val featurizedData = hashingTF.transform(wordsData)
+        // alternatively, CountVectorizer can also be used to get term frequency vectors
         val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
         val idfModel = idf.fit(featurizedData)
         val rescaledData = idfModel.transform(featurizedData)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14635] [ML] Documentation and Examples for TF-IDF only refer to HashingTF #12454

Uh oh!

Diff view

Diff view

There are no files selected for viewing