Change definition of string to remove "Printable" and discourage use … #485

timbray · 2019-08-14T16:37:22Z

…of control characters.

Signed-off-by: Tim Bray timbray@amazon.com

…of control characters. Signed-off-by: Tim Bray <timbray@amazon.com>

timbray · 2019-08-14T18:05:55Z

BTW I'd be happy to change SHOULD NOT to MUST NOT.

clemensv · 2019-08-14T18:46:47Z

Great. How about TAB?

timbray · 2019-08-14T20:36:18Z

Oh yeah, U+0009, you're right.

evankanderson · 2019-08-14T23:28:16Z

This is definitely an improvement, thanks!

A previously-ambiguous example is: 🏃‍♀️, which seems handled pretty clearly. That glyph is:

U+1F3C3 : RUNNER {running, marathon, track and fields}
U+200D : ZERO WIDTH JOINER [ZWJ]
U+2640 : FEMALE SIGN {Venus; alchemical symbol for copper}
U+FE0F : VARIATION SELECTOR-16 [VS16] {emoji variation selector}

It's not clear that either ZWJ or VS16 is a "printable character", even though the resulting glyph is printable. So, thanks for writing this up more precisely!

Nits incoming:

I'm wondering if we want to define this either in terms of unicode character classes or some other way that will adjust to Unicode as it changes over time.
Alternatively, we should support all the unicode characters, even ones that don't make sense in this context like <DEL>: "A�B". It looks like JSON describes how to encode "problematic" Unicode characters in values; perhaps each of the transport encodings should explicitly do the same?

Nits:

Why are we allowing U+000A and U+000D? At least for HTTP, it seems like these would break header parsing particularly badly (though we probably need to update the HTTP spec to describe how to encode non-ASCII String attribute values, so that may be separate work).
I think we might want to exclude at least all characters in the Cc category, which includes U+0000 through U+001F, but also U+007F (<DEL>) through U+009F:

https://www.compart.com/en/unicode/category/Cc

(As further evidence, the start/end selected area and other device control characters don't seem to have a good use as event metadata...)

Similarly, there are a set of reserved "noncharacters" which are only intended for internal use which we might want to exclude: http://www.unicode.org/faq/private_use.html#nonchar4

cneijenhuis · 2019-08-15T07:18:46Z

Thanks for addressing #483 👍

Assuming we adopt #484 and don't enforce these rules on data, I agree with @evankanderson that we probably shouldn't allow multi-line strings because of the HTTP headers.

timbray · 2019-08-15T15:16:07Z

I'm wondering if we want to define this either in terms of unicode character classes or some other way that will adjust to Unicode as it changes over time.

That would certainly be fairly conventional. A variety of protocols exclude certain character classes. I'm particularly in favor of excluding C0 controls for a variety of reasons.

Alternatively, we should support all the unicode characters, even ones that don't make sense in this context like ~~: "A�B". It looks like JSON describes how to encode "problematic" Unicode characters in values; perhaps each of the transport encodings should explicitly do the same?~~

I think we should stay away from encoding and keep our discussion strictly in terms of Unicode characters. Every transport I know of has well-established practices for turning them into bits on the wire and we shouldn't get in the way.

Why are we allowing U+000A and U+000D? At least for HTTP, it seems like these would break header parsing particularly badly (though we probably need to update the HTTP spec to describe how to encode non-ASCII String attribute values, so that may be separate work).

Well, in the data payload you obviously need them. Other than that I can't see why.

I think we might want to exclude at least all characters in the Cc category, which includes U+0000 through U+001F, but also U+007F () through U+009F:
https://www.compart.com/en/unicode/category/Cc
(As further evidence, the start/end selected area and other device control characters don't seem to have a good use as event metadata...)
Similarly, there are a set of reserved "noncharacters" which are only intended for internal use which we might want to exclude: http://www.unicode.org/faq/private_use.html#nonchar4

I'm in general in favor of excluding trash characters.

duglin · 2019-08-15T15:19:55Z

BTW I'd be happy to change SHOULD NOT to MUST NOT.

Just a personal preference, but I'd prefer SHOULD NOT. I'm not a fan of being too parental. In general I think people who care about interop will "do the right thing" or people will stop using them. But because there are times when people need to do things that are not best for interop, but are necessary for their case, I like to give people the option to do so. That's why I think SHOULD NOT in this case is better - gives a strong best-practice but also an "out".

evankanderson · 2019-08-15T20:26:30Z

HTTP headers are defined in US-ASCII, so I think the HTTP binary transport needs to define a way to take a unicode string value and map it to an ASCII string:

https://tools.ietf.org/html/rfc7230#section-3.2.6

On the other hand, it seems like this is something which is HTTP's fault, and we should make HTTP handle it rather than doing so generically.

I put together #488 for this.

…of control characters. Signed-off-by: Tim Bray <timbray@amazon.com>

cneijenhuis

Minor comments, overall LGTM 👍

cneijenhuis · 2019-08-16T06:47:05Z

spec.md

-  meaning and predictably cause interoperability problems.
+- `String` - Sequence of allowable Unicode characters. The following characters
+  are disallowed:
+  - the "control characters" in the ranges U+0000-U+001F and U+007F-009F (both


Either U+0000-U+001F and U+007F-U+009F or U+0000-001F and U+007F-009F 😉

I had to look at the comment for an endless time to spot the fix but I did eventually …

cneijenhuis · 2019-08-16T06:48:36Z

spec.md

+  are disallowed:
+  - the "control characters" in the ranges U+0000-U+001F and U+007F-009F (both
+	ranges inclusive), since most have no agreed-on meaning, and some, such as
+	U+000A (newline), are not usable in contexts such as HTTP headers. 


Indentation (both lines should start with 4 spaces)

…hristoph's nits

duglin · 2019-08-22T18:15:35Z

Approved on the 8/22 call.

However, @timbray there's a merge conflict - can you resolve that? Then I can merge it.

duglin · 2019-08-27T00:37:30Z

replaced by #490

Change definition of string to remove "Printable" and discourage use …

f8181a1

…of control characters. Signed-off-by: Tim Bray <timbray@amazon.com>

duglin added the v1.0 label Aug 14, 2019

duglin mentioned this pull request Aug 14, 2019

String type: "printable Unicode characters" and follow-on problems #483

Closed

Tim Bray added 2 commits August 15, 2019 15:21

Change definition of string to remove "Printable" and discourage use …

8259e18

…of control characters. Signed-off-by: Tim Bray <timbray@amazon.com>

Merge branch 'master' of https://github.com/timbray/spec

c61827a

duglin mentioned this pull request Aug 16, 2019

How to deal with odd chars in strings #473

Closed

cneijenhuis reviewed Aug 16, 2019

View reviewed changes

Tim Bray added 2 commits August 19, 2019 11:11

Merge branch 'master' of https://github.com/timbray/spec and adress C…

a6fa5f6

…hristoph's nits

Merge branch 'master' of https://github.com/timbray/spec

2366db3

Merge branch 'master' into master

a59d2fe

timbray closed this Aug 26, 2019

duglin mentioned this pull request Aug 27, 2019

Remove "printable", disallow dodgy Unicode #490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change definition of string to remove "Printable" and discourage use … #485

Change definition of string to remove "Printable" and discourage use … #485

timbray commented Aug 14, 2019

timbray commented Aug 14, 2019

clemensv commented Aug 14, 2019 •

edited

Loading

timbray commented Aug 14, 2019

evankanderson commented Aug 14, 2019

cneijenhuis commented Aug 15, 2019

timbray commented Aug 15, 2019

duglin commented Aug 15, 2019

evankanderson commented Aug 15, 2019 •

edited

Loading

cneijenhuis left a comment

cneijenhuis Aug 16, 2019

timbray Aug 19, 2019

cneijenhuis Aug 16, 2019

duglin commented Aug 22, 2019

duglin commented Aug 27, 2019

Change definition of string to remove "Printable" and discourage use … #485

Change definition of string to remove "Printable" and discourage use … #485

Conversation

timbray commented Aug 14, 2019

timbray commented Aug 14, 2019

clemensv commented Aug 14, 2019 • edited Loading

timbray commented Aug 14, 2019

evankanderson commented Aug 14, 2019

cneijenhuis commented Aug 15, 2019

timbray commented Aug 15, 2019

duglin commented Aug 15, 2019

evankanderson commented Aug 15, 2019 • edited Loading

cneijenhuis left a comment

Choose a reason for hiding this comment

cneijenhuis Aug 16, 2019

Choose a reason for hiding this comment

timbray Aug 19, 2019

Choose a reason for hiding this comment

cneijenhuis Aug 16, 2019

Choose a reason for hiding this comment

duglin commented Aug 22, 2019

duglin commented Aug 27, 2019

clemensv commented Aug 14, 2019 •

edited

Loading

evankanderson commented Aug 15, 2019 •

edited

Loading