Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change definition of string to remove "Printable" and discourage use … #485

Closed
wants to merge 6 commits into from

Conversation

timbray
Copy link

@timbray timbray commented Aug 14, 2019

…of control characters.

Signed-off-by: Tim Bray timbray@amazon.com

…of control characters.

Signed-off-by: Tim Bray <timbray@amazon.com>
@timbray
Copy link
Author

timbray commented Aug 14, 2019

BTW I'd be happy to change SHOULD NOT to MUST NOT.

@clemensv
Copy link
Contributor

clemensv commented Aug 14, 2019

Great. How about TAB?

@timbray
Copy link
Author

timbray commented Aug 14, 2019

Oh yeah, U+0009, you're right.

@evankanderson
Copy link
Contributor

This is definitely an improvement, thanks!

A previously-ambiguous example is: 🏃‍♀️, which seems handled pretty clearly. That glyph is:

U+1F3C3 : RUNNER {running, marathon, track and fields}
U+200D : ZERO WIDTH JOINER [ZWJ]
U+2640 : FEMALE SIGN {Venus; alchemical symbol for copper}
U+FE0F : VARIATION SELECTOR-16 [VS16] {emoji variation selector}

It's not clear that either ZWJ or VS16 is a "printable character", even though the resulting glyph is printable. So, thanks for writing this up more precisely!

Nits incoming:

  • I'm wondering if we want to define this either in terms of unicode character classes or some other way that will adjust to Unicode as it changes over time.
  • Alternatively, we should support all the unicode characters, even ones that don't make sense in this context like <DEL>: "A�B". It looks like JSON describes how to encode "problematic" Unicode characters in values; perhaps each of the transport encodings should explicitly do the same?

Nits:

  • Why are we allowing U+000A and U+000D? At least for HTTP, it seems like these would break header parsing particularly badly (though we probably need to update the HTTP spec to describe how to encode non-ASCII String attribute values, so that may be separate work).

  • I think we might want to exclude at least all characters in the Cc category, which includes U+0000 through U+001F, but also U+007F (<DEL>) through U+009F:

    https://www.compart.com/en/unicode/category/Cc

    (As further evidence, the start/end selected area and other device control characters don't seem to have a good use as event metadata...)

    Similarly, there are a set of reserved "noncharacters" which are only intended for internal use which we might want to exclude: http://www.unicode.org/faq/private_use.html#nonchar4

@cneijenhuis
Copy link
Contributor

Thanks for addressing #483 👍

Assuming we adopt #484 and don't enforce these rules on data, I agree with @evankanderson that we probably shouldn't allow multi-line strings because of the HTTP headers.

@timbray
Copy link
Author

timbray commented Aug 15, 2019

  • I'm wondering if we want to define this either in terms of unicode character classes or some other way that will adjust to Unicode as it changes over time.

That would certainly be fairly conventional. A variety of protocols exclude certain character classes. I'm particularly in favor of excluding C0 controls for a variety of reasons.

  • Alternatively, we should support all the unicode characters, even ones that don't make sense in this context like : "A�B". It looks like JSON describes how to encode "problematic" Unicode characters in values; perhaps each of the transport encodings should explicitly do the same?

I think we should stay away from encoding and keep our discussion strictly in terms of Unicode characters. Every transport I know of has well-established practices for turning them into bits on the wire and we shouldn't get in the way.

  • Why are we allowing U+000A and U+000D? At least for HTTP, it seems like these would break header parsing particularly badly (though we probably need to update the HTTP spec to describe how to encode non-ASCII String attribute values, so that may be separate work).

Well, in the data payload you obviously need them. Other than that I can't see why.

  • I think we might want to exclude at least all characters in the Cc category, which includes U+0000 through U+001F, but also U+007F () through U+009F:
    https://www.compart.com/en/unicode/category/Cc
    (As further evidence, the start/end selected area and other device control characters don't seem to have a good use as event metadata...)
    Similarly, there are a set of reserved "noncharacters" which are only intended for internal use which we might want to exclude: http://www.unicode.org/faq/private_use.html#nonchar4

I'm in general in favor of excluding trash characters.

@duglin
Copy link
Collaborator

duglin commented Aug 15, 2019

BTW I'd be happy to change SHOULD NOT to MUST NOT.

Just a personal preference, but I'd prefer SHOULD NOT. I'm not a fan of being too parental. In general I think people who care about interop will "do the right thing" or people will stop using them. But because there are times when people need to do things that are not best for interop, but are necessary for their case, I like to give people the option to do so. That's why I think SHOULD NOT in this case is better - gives a strong best-practice but also an "out".

@evankanderson
Copy link
Contributor

evankanderson commented Aug 15, 2019

HTTP headers are defined in US-ASCII, so I think the HTTP binary transport needs to define a way to take a unicode string value and map it to an ASCII string:

https://tools.ietf.org/html/rfc7230#section-3.2.6

On the other hand, it seems like this is something which is HTTP's fault, and we should make HTTP handle it rather than doing so generically.

I put together #488 for this.

Copy link
Contributor

@cneijenhuis cneijenhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, overall LGTM 👍

spec.md Outdated
meaning and predictably cause interoperability problems.
- `String` - Sequence of allowable Unicode characters. The following characters
are disallowed:
- the "control characters" in the ranges U+0000-U+001F and U+007F-009F (both
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either U+0000-U+001F and U+007F-U+009F or U+0000-001F and U+007F-009F 😉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look at the comment for an endless time to spot the fix but I did eventually …

spec.md Outdated
are disallowed:
- the "control characters" in the ranges U+0000-U+001F and U+007F-009F (both
ranges inclusive), since most have no agreed-on meaning, and some, such as
U+000A (newline), are not usable in contexts such as HTTP headers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation (both lines should start with 4 spaces)

@duglin
Copy link
Collaborator

duglin commented Aug 22, 2019

Approved on the 8/22 call.

However, @timbray there's a merge conflict - can you resolve that? Then I can merge it.

@timbray timbray closed this Aug 26, 2019
@duglin
Copy link
Collaborator

duglin commented Aug 27, 2019

replaced by #490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants