Convert all space symbols to one form #278

saippuakauppias · 2019-09-30T20:29:28Z

context

Unicode contains many space symbols: https://www.htmlsymbols.xyz/punctuation-symbols/space-symbols

proposed solution

All space symbols need to convert to one form (default ASCII space)

bdewilde · 2019-11-05T17:21:24Z

Hi @saippuakauppias , this seems like an easy task for regular expressions. Does re.sub(r"\s+", " ", text, flags=re.UNICODE) work for you?

saippuakauppias · 2019-11-05T18:51:43Z

Not all this symbols are replaced by this regular expression :)

import re
test = '=\u00A0=\u2000=\u2001=\u2002=\u2003=\u2004=\u2005=\u2007=\u2008=\u2009=\u200A=\u200B=\u2060=\u3000=\uFEFF='
print(re.sub(r"\s+", "+", test, flags=re.UNICODE))

=+=+=+=+=+=+=+=+=+=+=+==⁠=+==

bdewilde · 2019-12-21T21:39:28Z

Hi @saippuakauppias , sorry about the belated reply. It looks like three of those code points don't match r"\s+": \u200B, which is a zero-width space; \u2060, which is a no-break space; and \uFEFF, which is a zero-width no-break space. I'm pretty confident that the zero-width spaces should not be replaced by single space, and according to Wikipedia:

U+2060 WORD JOINER (HTML ⁠ · WJ): encoded in Unicode since version 3.2. The word-joiner does not produce any space, and prohibits a line break at its position

So, for purposes of "normalizing whitespace", I think the thing to do here is to replace each of these three code points by an empty string, i.e.

re.sub(r"[\u200B\u2060\uFEFF]", "", text)

Does that seem reasonable to you?

saippuakauppias · 2019-12-21T22:28:26Z

Yes, I think it's good solution for this case :)

saippuakauppias added the enhancement label Sep 30, 2019

bdewilde added a commit that referenced this issue Dec 21, 2019

Normalize zero-width spaces; close #278

41d47a7

bdewilde closed this as completed in f1da397 Mar 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert all space symbols to one form #278

Convert all space symbols to one form #278

saippuakauppias commented Sep 30, 2019 •

edited

Loading

bdewilde commented Nov 5, 2019

saippuakauppias commented Nov 5, 2019 •

edited

Loading

bdewilde commented Dec 21, 2019

saippuakauppias commented Dec 21, 2019

Convert all space symbols to one form #278

Convert all space symbols to one form #278

Comments

saippuakauppias commented Sep 30, 2019 • edited Loading

context

proposed solution

bdewilde commented Nov 5, 2019

saippuakauppias commented Nov 5, 2019 • edited Loading

bdewilde commented Dec 21, 2019

saippuakauppias commented Dec 21, 2019

saippuakauppias commented Sep 30, 2019 •

edited

Loading

saippuakauppias commented Nov 5, 2019 •

edited

Loading