-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert all space symbols to one form #278
Comments
Hi @saippuakauppias , this seems like an easy task for regular expressions. Does |
Not all this symbols are replaced by this regular expression :) import re
test = '=\u00A0=\u2000=\u2001=\u2002=\u2003=\u2004=\u2005=\u2007=\u2008=\u2009=\u200A=\u200B=\u2060=\u3000=\uFEFF='
print(re.sub(r"\s+", "+", test, flags=re.UNICODE))
|
Hi @saippuakauppias , sorry about the belated reply. It looks like three of those code points don't match
So, for purposes of "normalizing whitespace", I think the thing to do here is to replace each of these three code points by an empty string, i.e.
Does that seem reasonable to you? |
Yes, I think it's good solution for this case :) |
context
Unicode contains many space symbols: https://www.htmlsymbols.xyz/punctuation-symbols/space-symbols
proposed solution
All space symbols need to convert to one form (default ASCII space)
The text was updated successfully, but these errors were encountered: