-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fo unicode and octal escapes in string literals. #65
base: master
Are you sure you want to change the base?
Conversation
@richhickey, the absence of unicode escapes in string literals is really limiting. And the reason for that is unclear, given that unicode escapes are supported for characters. |
This is in response to edn-format/edn#65 . This is an extension as string literals as currently documented do not specify support for \uXXXX escapes. https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3 Notes: - Unicode escape must begin with "\u". This is case sensitive "\U" will be rejected. - "\u" must be followed by exactly four hex digits taken from this set: 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F - The digits are not case sensitive. - Each such Unicode escape encodes a single 16-bit Java char. Since Java uses UTF-16 internally (for historical reasons) code points beyond the basic multilingual plane as a pair of unicode escapes. (see also "surrogate pairs")
This is in response to edn-format/edn#65 . This is an extension as string literals as currently documented do not specify support for \uXXXX escapes. https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3 Syntax Notes: - Unicode escape must begin with "\u". This is case sensitive "\U" will be rejected. - "\u" must be followed by exactly four hex digits taken from this set: 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F - The digits are not case sensitive. - Each such Unicode escape encodes a single 16-bit Java char. Since Java uses UTF-16 internally (for historical reasons) code points beyond the basic multilingual plane as a pair of unicode escapes. (see also "surrogate pairs") Disabling: By default \uXXXX escapes are now supported in String literals. Parser.Config (and Parser.Config.Builder) now support a flag which can be set to false to disable support for \uXXXX in string literals. This restores the old behavior of throwing an EdnSyntaxException when such escapes are encountered.
The maintainer of edn-java library kindly agreed to implement unicode escapes in the library. Initially, it was planned as an option, disabled by default. After implementing it that way it was discovered that https://github.com/clojure/tools.reader supports unicode escapes by default, so edn-java finally implemented unicode escapes enabled by default. Turns out https://github.com/clojure/tools.reader also supports octal escapes in string and character literals, same as in the clojure languate. (The current edn spec includes unicode escapes for characters, but misses octal escapes). @richhickey IMHO clarity is needed in the spec. It's strange unicode escapes are not specified for strings while they are specified for characters. And what about octal escapes? @wagjo, if your pull requests includes octal escapes for string litertals, makes sense to include them for characters tool (the clojure language and the tools.reader support them in the form \oNNN). As for backwards compatibility, I would suggest to include the escapes into the spec and add a comment: "Unicode and octal escapes in string literals and octal escapes in character literals were only added to the spec in 2020. Some implementations supported them before that. For compatibility, consumers of EDN documents (including parsing libraries) should always support the escapes. The suppliers of EDN documents should avoid the escapes, unless they verified all the consumers of their documents support the escapes" |
BTW, in Java octal escapes in string literals can contain up to 3 digits (https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), while the clojure reader and the clojure.tools.reader.edn require exactly 3 digits after backlash. So @wagjo, the wording "as in Java" in the pull request does not match precisely the current implementations. |
Specs do not mention whether unicode and octal escapes are supported or not. As clojure.edn supports it [1], I've added an explicit mention in the specs. I'm a registered clojure contributor (signed CA).
[1] https://github.com/clojure/clojure/blob/c6756a8bab137128c8119add29a25b0a88509900/src/jvm/clojure/lang/EdnReader.java#L580