Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically cleaning unicode text #13

Open
dimenwarper opened this issue May 25, 2017 · 2 comments
Open

Automatically cleaning unicode text #13

dimenwarper opened this issue May 25, 2017 · 2 comments

Comments

@dimenwarper
Copy link

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

@rhiever
Copy link
Owner

rhiever commented May 25, 2017

Sounds promising. Please submit a PR with the new functionality along with unit tests to demonstrate how it works.

@dimenwarper
Copy link
Author

dimenwarper commented May 31, 2017

I've implemented a draft of this but realized it may clash with the functionality of converting all text to numerical values. I wonder how to proceed, as I see it there are two options:

  1. Fix the text before applying the encoding: This is what I'm doing right now, so strings like >=50 and >=50'get encoded to the same label.
  2. Make encoding optional: This is tricky, there will be some text-based columns where you want to preserve the text to featurize later (e.g. with a sklearn.feature_extraction.text.TfidfVectorizer) rather than convert them to a label with an encoder. The tricky part is how to specify what columns you want to encode or not.

One way to proceed would be to go with 1 and then tackle 2 in a later issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants