Skip to content

Conversation

@jkbradley
Copy link
Member

Various ML guide cleanups.

  • ml-guide.md: Make it easier to access the algorithm-specific guides.
  • LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
  • mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
  • Clean up Binarizer user guide a little.
  • Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
  • spark.ml Word2Vec user guide: clean up grammar/writing
  • Chi Sq Feature Selector docs: Improve text in doc.

CC: @mengxr @feynmanliang

LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.

mllib-feature-extraction.html#elementwiseproduct
* “w” parameter should be “scalingVec”

Clean up Binarizer user guide a little.

Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.

spark.ml Word2Vec user guide:
* clean up grammar/writing

Chi Sq Feature Selector docs
* Improve text in doc.
@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42436 has finished for PR 8752 at commit 53d757a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • [ChiSqSelector](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features.ChiSqSelectororders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should include model summaries in this description; I had a mailing list question about where that feature is documented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's not a great place. I'll try sticking a note here.

@jkbradley
Copy link
Member Author

@feynmanliang Thanks for reviewing. Just updated per your comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classname is backticked in ChiSqSelector but not here or in Binarizer, we should choose one and be consistent. I would vote for backticking everything since that's what I've been doing

@feynmanliang
Copy link
Contributor

LGTM after changes

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42500 has finished for PR 8752 at commit 91f4edd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • and then filters (selects) the top features which the class label depends on the most.

@mengxr
Copy link
Contributor

mengxr commented Sep 16, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in b921fe4 Sep 16, 2015
@jkbradley jkbradley deleted the mlguide-fixes-1.5 branch September 16, 2015 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants