Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete implementation of remaining Spark transformers #8

Open
9 tasks
MLnick opened this issue Jun 8, 2018 · 3 comments
Open
9 tasks

Complete implementation of remaining Spark transformers #8

MLnick opened this issue Jun 8, 2018 · 3 comments

Comments

@MLnick
Copy link
Collaborator

MLnick commented Jun 8, 2018

  • OneHotEncoder
  • RFormula
  • PolynomialExpansion
  • Interaction
  • Imputer
  • VectorIndexer
  • Word2Vec

These require MurmurHash3 to be added as a built-in PFA function (refer to related Hadrian issue):

  • HashingTF
  • FeatureHasher
@Paxanator
Copy link

Paxanator commented Nov 18, 2018

Hey @MLnick looking into picking up one of these Transforms to start learning more about aardpfark, starting with OneHotEnoder. For OneHotEncoder, looks like it's reliant on a StringIndexer in order to determine the length of output, but the transformer itself doesn't require it in Spark (i.e. the data tells the OneHotEncoder how to transform it, as opposed to being fit).

As of 2.3 it seems this has been addressed with OneHotEncoderEstimator, which has a fit and returns a OneHotEncoderModel with categorySizes
https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator

Should support be added for 2.3 (i can try and upgrade) and use that instead?

@MLnick
Copy link
Collaborator Author

MLnick commented Nov 20, 2018

Hi @Paxanator thanks for your interest in Aardpfark!

Yes I agree, OneHotEncoder as from Spark 2.3 would be the best way forward for this transformer. Let me know if you need some assistance.

I'll take a look at upgrading Spark version - hopefully shouldn't be much of a problem.

@Paxanator
Copy link

Thank you for putting the library together! I'll wait on the Spark Version bump before trying to tackle it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants