You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.
The text was updated successfully, but these errors were encountered:
I've implemented a draft of this but realized it may clash with the functionality of converting all text to numerical values. I wonder how to proceed, as I see it there are two options:
Fix the text before applying the encoding: This is what I'm doing right now, so strings like >=50 and >=50'get encoded to the same label.
Make encoding optional: This is tricky, there will be some text-based columns where you want to preserve the text to featurize later (e.g. with a sklearn.feature_extraction.text.TfidfVectorizer) rather than convert them to a label with an encoder. The tricky part is how to specify what columns you want to encode or not.
One way to proceed would be to go with 1 and then tackle 2 in a later issue.
Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.
The text was updated successfully, but these errors were encountered: