Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert all space symbols to one form #278

Closed
saippuakauppias opened this issue Sep 30, 2019 · 4 comments
Closed

Convert all space symbols to one form #278

saippuakauppias opened this issue Sep 30, 2019 · 4 comments

Comments

@saippuakauppias
Copy link

saippuakauppias commented Sep 30, 2019

context

Unicode contains many space symbols: https://www.htmlsymbols.xyz/punctuation-symbols/space-symbols

proposed solution

All space symbols need to convert to one form (default ASCII space)

@bdewilde
Copy link
Collaborator

bdewilde commented Nov 5, 2019

Hi @saippuakauppias , this seems like an easy task for regular expressions. Does re.sub(r"\s+", " ", text, flags=re.UNICODE) work for you?

@saippuakauppias
Copy link
Author

saippuakauppias commented Nov 5, 2019

Not all this symbols are replaced by this regular expression :)

import re
test = '=\u00A0=\u2000=\u2001=\u2002=\u2003=\u2004=\u2005=\u2007=\u2008=\u2009=\u200A=\u200B=\u2060=\u3000=\uFEFF='
print(re.sub(r"\s+", "+", test, flags=re.UNICODE))

=+=+=+=+=+=+=+=+=+=+=+=​=⁠=+==

@bdewilde
Copy link
Collaborator

Hi @saippuakauppias , sorry about the belated reply. It looks like three of those code points don't match r"\s+": \u200B, which is a zero-width space; \u2060, which is a no-break space; and \uFEFF, which is a zero-width no-break space. I'm pretty confident that the zero-width spaces should not be replaced by single space, and according to Wikipedia:

U+2060 WORD JOINER (HTML ⁠ · WJ): encoded in Unicode since version 3.2. The word-joiner does not produce any space, and prohibits a line break at its position

So, for purposes of "normalizing whitespace", I think the thing to do here is to replace each of these three code points by an empty string, i.e.

re.sub(r"[\u200B\u2060\uFEFF]", "", text)

Does that seem reasonable to you?

bdewilde added a commit that referenced this issue Dec 21, 2019
@saippuakauppias
Copy link
Author

Yes, I think it's good solution for this case :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants