-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define several tricky cases for encoding based on spec. #456
Conversation
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
d6eeaf6
to
151dc4b
Compare
Looking at the AMQP bindings, it's not clear that "Map" values can be represented as requested in AMQP, because the application-properties section does not allow values of the map type.
|
Hey, reading RFC7159 sections 4 and 8, I see the following productions:
I think that means the following is a valid JSON object:
Let me update the extension-map tests. |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Looking further at maps, RFC7159 allows for duplicate keys, but cautions:
Our spec does not require that "Map" keys be unique, only that they be "String"-indexed. This is unlikely to be a problem in practice, but might be worth calling out duplicate "Map" keys as invalid CloudEvents. |
One thing that's not clear to me is what we do with these files. At a minimum, we should reference them from the README or Primer so that people notice them - otherwise they're just orphaned. But, beyond that, calling them testcases might imply some kind of testing infrastructure using them - which we don't have. Do you expect that to be developed at some point? If not, I wonder if the same point could be made via examples in the Primer? For example, testcases/unicode-strings.http-json is really just showing that funky chars are ok as values. So would it make more sense to actually have text in the Primer (or SDK.md) to call out these tricky cases and just give an example. For example: The set of value characters for attribute values may include non-ASCII characters. For example, This has the advantage of not just showing the thing to look out for but adds some explanatory text around it with some guidance. |
@evankanderson I tagged this PR as |
testcases/contenttype.http
Outdated
Host: handler.example.com | ||
Content-Type: application/xml | ||
Content-Length: 24 | ||
ce-specversion: 0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these needs to be 0.4-wip
to align with all of the other samples in our docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@evankanderson any response to the comments ^^^ |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
…into making-a-mess
I'd like to be able to remove them, by improving the spec. 😁 Here's how I'd address the three cases:
My thought was that these would provide some canonical test cases which could be incorporated into various SDK implementations -- i.e. for python, you might write: with open(json_file), open(http_file) as canonical, http_format:
event = marshaller.NewDefaultJSONMarshaller().FromRequest(json_file)
as_http = marshaller.NewDefaultHTTPMarshaller().ForEvent(event, target_url="/")
canonical_http = ReadReqFromString(http_format)
self.assertEqual(as_http, canonical_http) # Verify intermediate representation
as_read = marshaller.NewDefaultHTTPMarshaller().FromRequest(as_http)
self.assertEqual(event, as_read) # Verify round-trip
Actually, I just remembered the problem with the unicode case -- it's not clear whether |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
One of the things I think we should add to the spec is something to explain how these values are to be used. For example, I thought of this because of the comment above about spaces. Spaces appearing in one of our attributes is only a potential problem when someone doesn't look at it as a random char, it's a problem when they try to look at those chars and try to interpret them to have some meaning - and that's where things go funky because we don't define those meaning. We need to make that clear. @evankanderson the DCO checker is not happy @evankanderson if you want to make any of the changes to the spec that you mentioned in your previous comment can you open PRs for those so we can discuss them. Or you can wait for feedback first if you want - but we do want to move quickly. My 2 cents... I'm ok with the direction of #467, but I think URI-reference for |
Host: handler.example.com | ||
Content-Length: 0 | ||
ce-specversion: 0.4-wip | ||
ce-id: %22%F0%9F%98%89%03%3c.is%2BFine%E7%81%AB%22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after looking at the http spec, I'm not sure this is correct. While someone could URL-encode the header, unless some spec tells both sides about it, I don't think it's supposed to be. By that I mean, I couldn't find anything in the http spec to indicate people are supposed to URL encode things - and therefore, there's no mechanism for a receiver to know if they're suppose to try to decode it or not.
If these are printable chars, then no encoding would be done. If they're non-printable chars then the spec bans them as part of our string definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://tools.ietf.org/html/rfc7230#section-3.2.4
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Note that this conflicts with our definition of String. Maybe we should add some text on how to deal with unicode characters in the HTTP binary encoding (and maybe other binary encodings, depending on their allowed charset)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on this, it turns out that someone back in April of 2018 spelled out how this is supposed to work, and it is by URL-encoding the strings:
Non-printable ASCII characters and non-ASCII characters MUST first be encoded according to UTF-8, and then each octet of the corresponding UTF-8 sequence MUST be percent-encoded to be represented as HTTP header characters, in compliance with RFC7230, sections 3, 3.2, 3.2.6. The rules for encoding of the percent character ('%') apply as defined in RFC 3986 Section 2.4..
testcases/unicode-strings.http-json
Outdated
"id": "\"😉\u033c.is+Fine火\u0022", // \u22\u1F609\u033c\u2e\u69\u73\u2b\u46\u69\u6e\u65\u706b\u22 | ||
"source": "https://cloudevents.io/unicode-strings", | ||
"type": "\"😉\u033c.is+Fine火\u0022", | ||
"subject": "\"😉\u033c.is+Fine火\u0022", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove trailing ,
testcases/urls.json
Outdated
// URI-reference: https://tools.ietf.org/html/rfc3986#appendix-A | ||
// Verify that % decoding is done properly. This is a 'path-noscheme' | ||
// that percent-decodes to a '"//" authority path-abempty'. | ||
"schemaurl": "%2f%2f%anonymous@example.com/a&b;?x'=%2f//#//", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove trailing ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to say something in our spec about doing a percent-decoding?
I want to say 'no', but some kind of "heads-up" might be good, not from a strcmp
perspective, but rather to remind people that per RFC3986 someone might URL-encoding them even if they're not meant to be deference-able. Perhaps something in the primer?
what do people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky case, because it is a valid URL, but if you percent-decode it, it is not a valid URL, but instead a hier-part
, which requires a scheme ":"
prefix to be a valid URI
production.
Clarified the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it looks like I didn't look at this quite closely enough (but may perhaps need to clarify the HTTP spec as well):
Non-printable ASCII characters and non-ASCII characters MUST first be encoded according to UTF-8, and then each octet of the corresponding UTF-8 sequence MUST be percent-encoded to be represented as HTTP header characters, in compliance with RFC7230, sections 3, 3.2, 3.2.6. The rules for encoding of the percent character ('%') apply as defined in RFC 3986 Section 2.4.
However, the characters I encoded are not in the set (non-printable ASCII characters and non-ASCII characters), so it's not clear whether it is allowed for them to be percent-encoded or not.
Thinking about this from a software-implementer point of view, it would be a nightmare to scan through each percent-encoded string and determine whether or not the character was from the (non-printable ASCII or non-ASCII) character set before deciding whether or not to process the escape sequence.
I'll clarify the HTTP transport bindings to make it clear that all HTTP Header Values should be decoded through a single round of percent-decoding, and update this PR when that lands.
testcases/urls.json
Outdated
"id": "123", | ||
|
||
// Source is a URI-reference, https://tools.ietf.org/html/rfc3986#appendix-A | ||
"source": "", // relative-ref -> relative-part -> path-empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #478 goes thru then this will need to be changed
@evankanderson left a few minor comments. But, could you edit the main README.md to add a sentence that points to your new README? W/o that your stuff would be, kind of, orphaned and hard for people to find. |
We recently created the |
…review. Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
These are all intended to be valid CloudEvents, so it should be possible to read in the event in one format, and output it in another format. |
testcases/contenttype.http
Outdated
ce-id: 123 | ||
ce-source: https://cloudevents.io/contenttype | ||
ce-type: io.cloudevents.contenttype-test | ||
ce-datacontentencoding: Base64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be Content-Encoding
? the normal http header
I think we need to update the http spec to say this if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#488 for this, and to clarify percent-encoding to be a bit more coherent (the current text does not indicate whether ASCII characters are allowed to be percent-encoded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evankanderson can you update hits per the latest stuff in the spec?
testcases/contenttype.mqtt
Outdated
id: 123 | ||
source: https://cloudevents.io/contenttype | ||
type: io.cloudevents.contenttype-test | ||
datacontentencoding: Base64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clemensv is there an MQTT equivalent property for datacontentencoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Searching for "encoding" in https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html produces 25 hits, all of which cover UTF-8 or property encoding, rather than payload encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duglin No, there is no equivalent in MQTT. For MQTT 3.1.x, you even need to know the datacontenttype apriori as a convention on the topic. This particular field missing is not an issue, though, because the body is always a byte sequence.
testcases/unicode-strings.json
Outdated
"type": "\"😉\u033c.is+Fine火\u0022", | ||
|
||
// OPTIONAL field | ||
"subject": "\"😉\u0.4-wipc.is+Fine火\u0022" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the subject value here doesn't match the previous example. Perhaps a global search-n-replace gone wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, fixed.
@evankanderson wanna add some kind of pointer in the main README so people can find this? |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
CI failure is ok and will be fixed if/when merged into master |
@evankanderson rebase needed. And there are some outstanding questions and tweaks needed based on recent merges. |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
cloudevents#488 escaping in URLs as well. Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated -- for URL encoding, I assumed current #488 state for the set of characters that should be percent-encoded in HTTP headers.
testcases/contenttype.http
Outdated
ce-id: 123 | ||
ce-source: https://cloudevents.io/contenttype | ||
ce-type: io.cloudevents.contenttype-test | ||
ce-datacontentencoding: Base64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"id": "123", | ||
|
||
// Source is a URI-reference, https://tools.ietf.org/html/rfc3986#appendix-A | ||
"source": "/", // relative-ref -> relative-part -> path-absolute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to a non-empty URL, "/".
@evankanderson do we need to wait fir #488 before we can merge this one? |
I've updated these to include percent-encoding all the characters not explicitly allowed in #488, but look at the URL examples and see this comment to weigh in on the character set choice: #488 (comment) |
@evankanderson I think if you make the last few updates (per the comments) then we can merge this next week. |
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Updated again, but this may need further corrections after #505 Also, it's not clear that the HTTP Binary encoding header value rules are completely clear; take a look at the results ( |
it would help to have the user provided data in the examples to understand where the data comes from. |
@evankanderson since #508 was approved, go ahead and make the edits you mentioned that might need to be done. Also, see @n3wscott's suggestion too - it sounded like an interesting idea. |
I recently noticed https://github.com/CloudEvents/conformance Given that we've managed to clear out map value attributes and simplify data, I think the remaining cases can be moved to that repo (which looks like it will need an update for 1.0). |
Fixes #381
(At least, makes a start for MQTT and HTTP; I didn't cover AMQP, NATS, or Protocol buffers yet.)
A few interesting things to note:
data
and nodatacontenttype
. It looks like MQTT and HTTP/1.1 do not require a Content-Type header, though it is SHOULD in HTTP, with a default assumption ofapplication/octet-stream
.