-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse a minimal set of fullwidth punctuation as synonyms #5903
Comments
The rabbit hole really is deep with this one. What does the average CJK user do when writing code, just put the IME into English by default and maybe switch back for comments? |
Haha, this one is quite fun. Julia 0.2:
Unicode macros could make for some fun looking code... |
I can't speak for everyone else, but I do switch back and forth between US ANSI and Pinyin constantly to type the proper halfwidth punctuation. I think many people don't even bother to try typing non-Roman characters into their code. |
The idea here is that we only use a small number of ASCII punctuation symbols, and so if other unicode characters are really aliases of those they should be treated the same. For example we already treat 26 unicode characters as whitespace. I think the fullwidth colon and equals are pretty obvious, but it does get murkier. I'm not sure what to do with the large number of quote characters in particular. |
I switch back and forth for this. I offen encounter problems with recognize comma, colon, semicolon, exclamation, question, parenthesis marks between fullwidth and halfwidth. period, quote and other bracket marks seems fine because it's easy to identify them. This is the first time that I realize there is a fullwidh equal mark |
I think equal mark should be dealt with since it seldom be used in strings. Let's just leave the others as they be, maybe you'll need full width marks in string someday |
Here are some Unicode normalization tables that may be useful, particularly the ones for punctuation. |
Rather than starting to add custom exceptions to NFC, my preference would be to start with NFKC (which solves the issue here of multiple input modes in asian languages, as well as e.g. ligatures in Latin scripts or |
Since we settled on NFC, it might be useful to revisit this issue and add a limited set of custom additions to our Unicode normalization. The µ (micro) vs. μ (mu) issue just came up again (Keno/SIUnits.jl#23) for example, and I would tend to include this exception as well simply because µ is so easy to type on MacOS (option-m). |
Bump. The distinction between micro vs mu is pretty annoying. It would be great to have a decision on this for 0.3. |
@IainNZ I just had to try it out for myself, and your use case looks like it's working in 0.3!
The best part about this is that TAB-completion actually works, so I can type |
The lack of attention for several releases makes me think we can probably let this go until some indefinite time in the future. |
Not actually implemented in my PR, though now it's easy to add |
Full-width punctuation characters now give "invalid character" parse errors, so I think adding this would be non-breaking. Can probably be deferred. |
The current Unicode normalization policy (#5576, #5434) is to employ the NFC normalization to canonicalize identifiers. However, NFC is overly conservative as a choice of canonicalization, since it does not alleviate the possibility of writing obfuscated code using, for example, full-width punctuation characters in identifiers.
Example:
While in general we probably don't want to get into the business of building in semantic knowledge of natural languages into the parser, I think at the very least we should support as synonyms the default output produced by standard input method editors. As an example, setting the input method to Pinyin - Simplified IME on OSX 10.9, typing on the keyboard
bing1=3
selects the first Chinese character with phonetic spellingbing
, then continues with=3
as part of the input stream. The result, when typed directly into the Julia REPL, iswhich stems from the full-width
=
being parsed as part of the identifier rather than the assignment operator, which is arguably what the typical user would have intended.The text was updated successfully, but these errors were encountered: