Automatically cleaning unicode text #13

dimenwarper · 2017-05-25T05:01:35Z

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

rhiever · 2017-05-25T13:43:03Z

Sounds promising. Please submit a PR with the new functionality along with unit tests to demonstrate how it works.

dimenwarper · 2017-05-31T17:09:30Z

I've implemented a draft of this but realized it may clash with the functionality of converting all text to numerical values. I wonder how to proceed, as I see it there are two options:

Fix the text before applying the encoding: This is what I'm doing right now, so strings like >=50 and >=50'get encoded to the same label.
Make encoding optional: This is tricky, there will be some text-based columns where you want to preserve the text to featurize later (e.g. with a sklearn.feature_extraction.text.TfidfVectorizer) rather than convert them to a label with an encoder. The tricky part is how to specify what columns you want to encode or not.

One way to proceed would be to go with 1 and then tackle 2 in a later issue.

rhiever added the enhancement label May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically cleaning unicode text #13

Automatically cleaning unicode text #13

dimenwarper commented May 25, 2017

rhiever commented May 25, 2017

dimenwarper commented May 31, 2017 •

edited

Loading

Automatically cleaning unicode text #13

Automatically cleaning unicode text #13

Comments

dimenwarper commented May 25, 2017

rhiever commented May 25, 2017

dimenwarper commented May 31, 2017 • edited Loading

dimenwarper commented May 31, 2017 •

edited

Loading