Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define several tricky cases for encoding based on spec. #456

Closed
wants to merge 18 commits into from

Conversation

evankanderson
Copy link
Contributor

Fixes #381

(At least, makes a start for MQTT and HTTP; I didn't cover AMQP, NATS, or Protocol buffers yet.)

A few interesting things to note:

  1. I attempted to create the minimal JSON objects required by the spec. This meant, in several cases, creating events with empty data and no datacontenttype. It looks like MQTT and HTTP/1.1 do not require a Content-Type header, though it is SHOULD in HTTP, with a default assumption of application/octet-stream.
  2. I'm pretty sure that the "default string encoding" with extensions (particularly of the "Map" and "Integer" type) may not be unambiguously retrievable. I suspect this will simply mean that they will be less-favored compared with "String" and "Timestamp" formats.
  • Map type extensions also allow header names which violate the HTTP spec when using binary encoding. I think these should go back to being a single header with a natural string encoding format, rather than expanded on map keys.
  1. "String" is defined as a "Sequence of printable Unicode characters", but Unicode does not define which characters are printable. Golang has one reasonable definition in their unicode package, we may to use that to clarify requirements between implementers. Note that Go's definition allows for space characters; it wasn't clear from the spec whether " " is an allowed String character or not.

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@evankanderson
Copy link
Contributor Author

Looking at the AMQP bindings, it's not clear that "Map" values can be represented as requested in AMQP, because the application-properties section does not allow values of the map type.

The keys of this map are restricted to be of type string (which excludes the possibility of a null key) and the values are restricted to be of simple types only, that is, excluding map, list, and array types.

testcases/contenttype.http Outdated Show resolved Hide resolved
testcases/contenttype.http Outdated Show resolved Hide resolved
@evankanderson
Copy link
Contributor Author

Hey, reading RFC7159 sections 4 and 8, I see the following productions:

object = begin-object [ member *( value-separator member ) ]
         end-object
member = string name-separator value
string = quotation-mark *char quotation-mark

I think that means the following is a valid JSON object:

{
  "" = "💩"
}

Let me update the extension-map tests.

testcases/urls.json Outdated Show resolved Hide resolved
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@evankanderson
Copy link
Contributor Author

Looking further at maps, RFC7159 allows for duplicate keys, but cautions:

An object whose names are all unique is interoperable in the sense
that all software implementations receiving that object will agree on
the name-value mappings. When the names within an object are not
unique, the behavior of software that receives such an object is
unpredictable.
Many implementations report the last name/value pair
only. Other implementations report an error or fail to parse the
object, and some implementations report all of the name/value pairs,
including duplicates.

Our spec does not require that "Map" keys be unique, only that they be "String"-indexed. This is unlikely to be a problem in practice, but might be worth calling out duplicate "Map" keys as invalid CloudEvents.

@duglin
Copy link
Collaborator

duglin commented Jun 18, 2019

One thing that's not clear to me is what we do with these files. At a minimum, we should reference them from the README or Primer so that people notice them - otherwise they're just orphaned. But, beyond that, calling them testcases might imply some kind of testing infrastructure using them - which we don't have. Do you expect that to be developed at some point? If not, I wonder if the same point could be made via examples in the Primer?

For example, testcases/unicode-strings.http-json is really just showing that funky chars are ok as values. So would it make more sense to actually have text in the Primer (or SDK.md) to call out these tricky cases and just give an example. For example:


The set of value characters for attribute values may include non-ASCII characters. For example, "type": "\"😉\u033c.is+Fine火\u0022",. Care should be taken to ensure that the set of characters used can be accurately serialized, and deserialized, in the format and transport that will be used to transmit the events.


This has the advantage of not just showing the thing to look out for but adds some explanatory text around it with some guidance.

@duglin
Copy link
Collaborator

duglin commented Jun 18, 2019

@evankanderson I tagged this PR as try-for-v1.0 instead of v1.0 because you're not proposing any normative changes to any specs. This implies that you don't see anything wrong with the spec, rather just some additional guidance might be needed - is that correct? If so, I'll tag the issue as try-for-v1.0 as well.

Host: handler.example.com
Content-Type: application/xml
Content-Length: 24
ce-specversion: 0.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these needs to be 0.4-wip to align with all of the other samples in our docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@duglin
Copy link
Collaborator

duglin commented Jun 25, 2019

@evankanderson any response to the comments ^^^

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@evankanderson
Copy link
Contributor Author

One thing that's not clear to me is what we do with these files.

I'd like to be able to remove them, by improving the spec. 😁

Here's how I'd address the three cases:

  • extensions-map: Remove Map from allowed Context Attribute types #467

  • URL: Change source to be an absolute-URI (URI would also be acceptable), rather than a URI-reference. This would change the ABNF case from RFC 3986 to:

     hier-part     = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty
    
    URI-reference = URI / relative-ref
    
    absolute-URI  = scheme ":" hier-part [ "?" query ]
    
    relative-ref  = relative-part [ "?" query ] [ "#" fragment ]
    
    relative-part = "//" authority path-abempty
                  / path-absolute
                  / path-noscheme
                  / path-empty
    

    Which would require a scheme, and the scheme ":" hier-part would then presumably map to requiring either an authority, or a scheme which was likely unique.

    My concern here is two independent event creators accidentally choosing overlapping source values and violating the id uniqueness requirements. We could keep schemaurl as a URI-reference, it would be interpreted relative to source if it were a relative URI.

At a minimum, we should reference them from the README or Primer so that people notice them - otherwise they're just orphaned. But, beyond that, calling them testcases might imply some kind of testing infrastructure using them - which we don't have. Do you expect that to be developed at some point? If not, I wonder if the same point could be made via examples in the Primer?

My thought was that these would provide some canonical test cases which could be incorporated into various SDK implementations -- i.e. for python, you might write:

with open(json_file), open(http_file) as canonical, http_format:
  event = marshaller.NewDefaultJSONMarshaller().FromRequest(json_file)
  as_http = marshaller.NewDefaultHTTPMarshaller().ForEvent(event, target_url="/")
  canonical_http = ReadReqFromString(http_format)
  self.assertEqual(as_http, canonical_http)  # Verify intermediate representation
  as_read = marshaller.NewDefaultHTTPMarshaller().FromRequest(as_http)
  self.assertEqual(event, as_read)  # Verify round-trip

For example, testcases/unicode-strings.http-json is really just showing that funky chars are ok as values. So would it make more sense to actually have text in the Primer (or SDK.md) to call out these tricky cases and just give an example. For example:

The set of value characters for attribute values may include non-ASCII characters. For example, "type": ""😉\u033c.is+Fine火\u0022",. Care should be taken to ensure that the set of characters used can be accurately serialized and deserialized, in the format and transport that will be used to transmit the events.

This has the advantage of not just showing the thing to look out for but adds some explanatory text around it with some guidance.

Actually, I just remembered the problem with the unicode case -- it's not clear whether this is fine is allowed as a String value (or only as Binary), because it contains space characters, while String is limited to "printable Unicode characters" (I selected \u033c (COMBINING SEAGULL BELOW) because it generates a mark, however it's not clear if it's actually a "printable Unicode character", because it's a combining character rather than stand-alone).

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@duglin
Copy link
Collaborator

duglin commented Jul 3, 2019

One of the things I think we should add to the spec is something to explain how these values are to be used. For example, source, while defined as a URI-reference, has no real meaning to it. By this I mean, the CE spec does not indicate that people should parse the string and try to make sense of it. For our purposes strcmp, or some variant of strcmp when doing some kind of filtering/querying, is all people should do with it. As a concrete example, if there are two CEs with sources of com.ibm and com.ibm.cloud then there is no relationship between those two sources that our spec says people can make - with the exception that they are from different sources. Our spec does not suggest that someone can look at one as a subset or even related to the other just because they share some characters. Likewise, just because its a reverse DNS does not mean it can be turned into a dereferenceable URL.

I thought of this because of the comment above about spaces. Spaces appearing in one of our attributes is only a potential problem when someone doesn't look at it as a random char, it's a problem when they try to look at those chars and try to interpret them to have some meaning - and that's where things go funky because we don't define those meaning. We need to make that clear.

@evankanderson the DCO checker is not happy

@evankanderson if you want to make any of the changes to the spec that you mentioned in your previous comment can you open PRs for those so we can discuss them. Or you can wait for feedback first if you want - but we do want to move quickly.

My 2 cents... I'm ok with the direction of #467, but I think URI-reference for source is ok as-is because I think people will only use short (potentially conflicting) values in very scoped environments where they may not need the formality of absolute URIs. And I like the spec not to force unnecessary restrictions to enable as broad adoption as possible. I think the community/market will be self correcting if a "real world producer" actually produces a source of just foo.

Host: handler.example.com
Content-Length: 0
ce-specversion: 0.4-wip
ce-id: %22%F0%9F%98%89%03%3c.is%2BFine%E7%81%AB%22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after looking at the http spec, I'm not sure this is correct. While someone could URL-encode the header, unless some spec tells both sides about it, I don't think it's supposed to be. By that I mean, I couldn't find anything in the http spec to indicate people are supposed to URL encode things - and therefore, there's no mechanism for a receiver to know if they're suppose to try to decode it or not.

If these are printable chars, then no encoding would be done. If they're non-printable chars then the spec bans them as part of our string definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tools.ietf.org/html/rfc7230#section-3.2.4

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

Note that this conflicts with our definition of String. Maybe we should add some text on how to deal with unicode characters in the HTTP binary encoding (and maybe other binary encodings, depending on their allowed charset)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this, it turns out that someone back in April of 2018 spelled out how this is supposed to work, and it is by URL-encoding the strings:

https://github.com/cloudevents/spec/blob/master/http-transport-binding.md#3132-http-header-values

Non-printable ASCII characters and non-ASCII characters MUST first be encoded according to UTF-8, and then each octet of the corresponding UTF-8 sequence MUST be percent-encoded to be represented as HTTP header characters, in compliance with RFC7230, sections 3, 3.2, 3.2.6. The rules for encoding of the percent character ('%') apply as defined in RFC 3986 Section 2.4..

"id": "\"😉\u033c.is+Fine火\u0022", // \u22\u1F609\u033c\u2e\u69\u73\u2b\u46\u69\u6e\u65\u706b\u22
"source": "https://cloudevents.io/unicode-strings",
"type": "\"😉\u033c.is+Fine火\u0022",
"subject": "\"😉\u033c.is+Fine火\u0022",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove trailing ,

// URI-reference: https://tools.ietf.org/html/rfc3986#appendix-A
// Verify that % decoding is done properly. This is a 'path-noscheme'
// that percent-decodes to a '"//" authority path-abempty'.
"schemaurl": "%2f%2f%anonymous@example.com/a&b;?x'=%2f//#//",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove trailing ,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to say something in our spec about doing a percent-decoding?
I want to say 'no', but some kind of "heads-up" might be good, not from a strcmp perspective, but rather to remind people that per RFC3986 someone might URL-encoding them even if they're not meant to be deference-able. Perhaps something in the primer?

what do people think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky case, because it is a valid URL, but if you percent-decode it, it is not a valid URL, but instead a hier-part, which requires a scheme ":" prefix to be a valid URI production.

Clarified the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it looks like I didn't look at this quite closely enough (but may perhaps need to clarify the HTTP spec as well):

https://github.com/cloudevents/spec/blob/master/http-transport-binding.md#3132-http-header-values

Non-printable ASCII characters and non-ASCII characters MUST first be encoded according to UTF-8, and then each octet of the corresponding UTF-8 sequence MUST be percent-encoded to be represented as HTTP header characters, in compliance with RFC7230, sections 3, 3.2, 3.2.6. The rules for encoding of the percent character ('%') apply as defined in RFC 3986 Section 2.4.

However, the characters I encoded are not in the set (non-printable ASCII characters and non-ASCII characters), so it's not clear whether it is allowed for them to be percent-encoded or not.

Thinking about this from a software-implementer point of view, it would be a nightmare to scan through each percent-encoded string and determine whether or not the character was from the (non-printable ASCII or non-ASCII) character set before deciding whether or not to process the escape sequence.

I'll clarify the HTTP transport bindings to make it clear that all HTTP Header Values should be decoded through a single round of percent-decoding, and update this PR when that lands.

"id": "123",

// Source is a URI-reference, https://tools.ietf.org/html/rfc3986#appendix-A
"source": "", // relative-ref -> relative-part -> path-empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #478 goes thru then this will need to be changed

@duglin
Copy link
Collaborator

duglin commented Aug 6, 2019

@evankanderson left a few minor comments. But, could you edit the main README.md to add a sentence that points to your new README? W/o that your stuff would be, kind of, orphaned and hard for people to find.

@duglin
Copy link
Collaborator

duglin commented Aug 6, 2019

We recently created the cloudevents/conformance repo. Once that's populated we may want to consider moving these files into there. But until then,if people are ok with this PR, I think it's ok for this info to live in our main repo for now.

…review.

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@evankanderson
Copy link
Contributor Author

@evankanderson Nice PR!
All of these test cases expect success after processing against any SDK impl?

If not, I suggest creating kind o expect a return for each test case.

These are all intended to be valid CloudEvents, so it should be possible to read in the event in one format, and output it in another format.

ce-id: 123
ce-source: https://cloudevents.io/contenttype
ce-type: io.cloudevents.contenttype-test
ce-datacontentencoding: Base64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be Content-Encoding? the normal http header

I think we need to update the http spec to say this if so

Copy link
Contributor Author

@evankanderson evankanderson Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#488 for this, and to clarify percent-encoding to be a bit more coherent (the current text does not indicate whether ASCII characters are allowed to be percent-encoded).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't apply #488 here yet, and the rules in #488 right now would also require percent-encoding "/" and ":" in URLs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evankanderson can you update hits per the latest stuff in the spec?

id: 123
source: https://cloudevents.io/contenttype
type: io.cloudevents.contenttype-test
datacontentencoding: Base64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clemensv is there an MQTT equivalent property for datacontentencoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching for "encoding" in https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html produces 25 hits, all of which cover UTF-8 or property encoding, rather than payload encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duglin No, there is no equivalent in MQTT. For MQTT 3.1.x, you even need to know the datacontenttype apriori as a convention on the topic. This particular field missing is not an issue, though, because the body is always a byte sequence.

"type": "\"😉\u033c.is+Fine火\u0022",

// OPTIONAL field
"subject": "\"😉\u0.4-wipc.is+Fine火\u0022"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the subject value here doesn't match the previous example. Perhaps a global search-n-replace gone wild?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, fixed.

@duglin
Copy link
Collaborator

duglin commented Aug 13, 2019

@evankanderson wanna add some kind of pointer in the main README so people can find this?

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@duglin
Copy link
Collaborator

duglin commented Aug 14, 2019

CI failure is ok and will be fixed if/when merged into master

@duglin
Copy link
Collaborator

duglin commented Aug 19, 2019

@evankanderson rebase needed. And there are some outstanding questions and tweaks needed based on recent merges.

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
cloudevents#488 escaping in URLs as well.

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
Copy link
Contributor Author

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated -- for URL encoding, I assumed current #488 state for the set of characters that should be percent-encoded in HTTP headers.

ce-id: 123
ce-source: https://cloudevents.io/contenttype
ce-type: io.cloudevents.contenttype-test
ce-datacontentencoding: Base64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't apply #488 here yet, and the rules in #488 right now would also require percent-encoding "/" and ":" in URLs.

"id": "123",

// Source is a URI-reference, https://tools.ietf.org/html/rfc3986#appendix-A
"source": "/", // relative-ref -> relative-part -> path-absolute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to a non-empty URL, "/".

@duglin
Copy link
Collaborator

duglin commented Aug 22, 2019

@evankanderson do we need to wait fir #488 before we can merge this one?

@evankanderson
Copy link
Contributor Author

I'd like to see #488 in, but I think (other than the CE-Datacontentencoding vs Content-Encoding) this can go in before #488.

@evankanderson
Copy link
Contributor Author

I've updated these to include percent-encoding all the characters not explicitly allowed in #488, but look at the URL examples and see this comment to weigh in on the character set choice: #488 (comment)

@duglin
Copy link
Collaborator

duglin commented Sep 14, 2019

@evankanderson I think if you make the last few updates (per the comments) then we can merge this next week.

Signed-off-by: Evan Anderson <evan.k.anderson@gmail.com>
@evankanderson
Copy link
Contributor Author

Updated again, but this may need further corrections after #505

Also, it's not clear that the HTTP Binary encoding header value rules are completely clear; take a look at the results (.http files) and see if the results are as expected, for those who care about such things. @n3wscott had strong opinions here at one point.

@n3wscott
Copy link
Member

it would help to have the user provided data in the examples to understand where the data comes from.

@duglin
Copy link
Collaborator

duglin commented Sep 19, 2019

@evankanderson since #508 was approved, go ahead and make the edits you mentioned that might need to be done. Also, see @n3wscott's suggestion too - it sounded like an interesting idea.
The goal is try to approve it on next week's call since there were no objections or questions on today's call - the group is just waiting for the final version.

@evankanderson
Copy link
Contributor Author

I recently noticed https://github.com/CloudEvents/conformance

Given that we've managed to clear out map value attributes and simplify data, I think the remaining cases can be moved to that repo (which looks like it will need an update for 1.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Establish a set of canonical "hard" event encodings for testing.
6 participants