v3.2: Sequential/streaming media types (take 2) #4554

handrews · 2025-04-28T00:30:10Z

@ThomasRooney @disintegrator @gregsdennis @philsturgeon @hudlow here is a new attempt at supporting sequential and/or streaming media types.

The fundamental approach in this change was proposed by @karenetheridge as a comment on the previous attempt at solving this problem.

This adds itemSchema as a schema to apply to each entry in a sequential media type instance. It also defines how to map sequential media tyes for use with schema, and explains that schema applies to complete content only.

JSON-based and text/event-stream media types are included.

schema changes are included in this pull request
schema changes are needed for this pull request but not done yet
no schema changes are needed for this pull request

handrews · 2025-04-28T15:04:12Z

@ThomasRooney this does not address your ideas around the Encoding Object and sentinel events, mostly to avoid further complexity. I'm happy to discuss that idea further if you want to start a new thread on the discussion (#4171). We are also talking there about how streaming multipart/mixed would work, which would involve the Encoding Object, so I'll take on Encoding Object-related changes separately. I'm not entirely sold that a "sentinel event" is an encoding, but I might be convinced of it.

src/oas.md

handrews · 2025-04-29T05:28:20Z

@ThomasRooney added more details in this most recent commit from reading that part of the WHATWG spec, including a minimum: 0 on retry because of the ignore-unless-ASCII digits restriction, which I had not previously noticed.

hudlow

Great work on this! I only had time for a partial review — it will take more time to get to the rest of it. Feel free to shoot down suggestions — I'm making them to put options on the table. 🙂

src/oas.md

ThomasRooney

I really like this, and highly appreciate the second iteration.

Structured streaming of data is rapidly growing in popularity (primarily due to the AI API wave), and I think the examples you put in here are going to help a lot of vendors grok and build tooling around these formats.

src/oas.md

handrews · 2025-05-01T04:41:32Z

@hudlow I merged a couple of your suggestions and will sort out the other updates I mentioned soon.

handrews · 2025-05-01T15:40:48Z

@hudlow I think I have addressed all of the comments that I did not mark as resolved in this latest commit. I particularly want to thank you for pointing out the awkwardness of the text/event-stream section. I ended up deleting a lot of it and I think it's both easier to read and much more clear now.

mikekistler

I found one minor wording issue that should be easy to fix.

And I left a few comments that may just reveal my lack of understanding, but I thought I'd make them anyway in case they are helpful.

src/oas.md

mikekistler · 2025-05-02T13:27:45Z

src/oas.md

+###### Streaming Sequential Media Types
+
+The `itemSchema` field is provided to support streaming use cases for sequential media types.
+Unlike `schema`, which is applied to the complete content (treated as an array as described in the [sequential media types](#sequential-media-types) section), `itemSchema` MUST be applied to each item in the stream independently, which supports processing each item as it is read from the stream.


I feel like there is still something important missing here. Maybe it is just missing in my mental model. I don't think that using schema prevents processing content as it comes in -- I believe there are streaming JSON parsers that already do this. And I'm not aware (but will be happy to be informed) of a requirement that if the "complete content" does not satisfy the schema that none of it may be accepted/processed.

So from that perspective (which may be misinformed, and if so I'm happy to be corrected), the purpose of itemSchema is to signal that, not only is it "possible" to process items as they are read, but it is "expected" that each item is completely independent and should be processed regardless of whether past or future items satisfy the itemSchema.

@hudlow this is why I had that awkward caveat about a streaming JSON Schema processor in here.

Every time someone wants me to take something out, someone else wants me to put it in. idk what to do here. We can't put every caveat in the text is so long as it is.

There are streaming JSON parsers, but streaming validation isn't really a thing (I know of one partial implementation). So you don't have to load the whole content to parse it, and in theory you might be able to validate it, but in practice you need to load the whole thing to validate it.

How much do we really need to explain here in the text?

Also some JSON Schema keywords (e.g maxProperties on a root object) do require that you parse the whole document, although at least in theory you can do those without holding the whole document in memory. Although with uniqueItems on a root array, you might end up needing to hold it all in memory, or at least hold a hash of each item... but we are getting extremely deep in the weeds here.

@mikekistler it's not clear to me what the problem is with the current wording. The wording doesn't stop someone from implementing streaming JSON Schema validation, it just says that itemSchema supports processing item-by-item. This is not true of schema because in general JSON documents are not composed of "items" (unless they are arrays, but... do we really need to call that out? It would take a paragraph or two even to set up the distinction)

@mikekistler Yeah, I'm also confused. Do you actually object to anything in this paragraph or do you object to something on lines 1662 or 1663?

I'm not objecting to anything currently here. There is nothing wrong with the current wording except that I think it does not explain the purpose if itemSchema fully. To say that itemSchema "supports processing each item as it is read from the stream" is accurate but does not distinguish it from schema. I think the essential difference that that itemSchema indicates that items are expected to be processed as they are received and the validity of any item, preceding or following, does not change that.

Unlike schema, which is applied to the complete content...

Under this proposal, OpenAPI:

Defines what it means for schema to be applied to the whole of the content in a message (represented as a JSON array).

Defines what it means for itemSchema to be applied to any individual item in a sequence within a message (represented in JSON without an array wrapper representing the sequence itself).

And does not:

Define what it means for a schema to be applied to a subset of the items in a sequence within a message (represented as a JSON array or any other way).

Define what it means for some part of schema to be applied to an individual item (or any subset of the items) in a sequence within a message (represented in JSON—with or without an array wrapper representing the sequence itself—or any other way).

If an implementor wants to unpack schema and apply some kind of reasoning to it to determine that it knows how to safely apply it or part of it (say, its items keyword) to a part of a sequence (e.g., an individual item), this proposal doesn't really go out of its way to stop them from doing that but it certainly doesn't suggest it's a good idea or give any hints as to how to do it.

itemSchema is a way for OpenAPI to give implementors a clear, safe way to apply a schema to individual items in a sequence within a message.

@mikekistler if adding the word "expected" is what's holding up approval, sure, go ahead and add it where you think it helps. I don't understand the distinction you're making but I don't see any problem with putting that word in somewhere if that's what you're trying to get at. (I'm still not sure TBH).

@mikekistler coming back to this, I still do not understand your objection. You say:

To say that itemSchema "supports processing each item as it is read from the stream" is accurate but does not distinguish it from schema.

schema does not support processing as a stream. That is the entire reason this PR exists at all. If schema supported this, we wouldn't need a change here.

@mikekistler we don't want to require processing each item individually for itemSchema. I expect that will be common, and is the easiest option. But implementations might choose to process them in small batches. Or even not as a stream (although why they would do that I don't know). The point is that itemSchema gives implementations options that they do not have with schema.

handrews · 2025-05-07T20:13:49Z

The force-push is to resolve conflicts with #4562. The only conflicts involved git not knowing what order to put the added sections in, as both PRs added in the same place. No text within any commit on this PR was changed.

mikekistler

Looks good. 👍

handrews · 2025-05-08T17:10:43Z

From the TDC call today, @mikekistler and @hudlow have concerns about how much of an expectation for users the OAS sets, which @hudlow notes is a larger topic than just this one PR. They will follow-on with a discussion (#4570) and additional PR as needed. The consensus of the call was that this PR as written does not prevent applications from offering various additional guarantees, it just does not mandate any of them. #4570 will address whether an additional mandate is required here.

The fundamental approach in this change was proposed by Karen Etheridge. This adds "itemSchema" as a schema to apply to each entry in a sequential media type instance. It also defines how to map sequential media tyes for use with "schema", and explains that "schema" applies to complete content only. JSON-based and text/event-stream media types are included. Co-authored-by: Karen Etheridge <ether@cpan.org>

Co-authored-by: Karen Etheridge <ether@cpan.org>

Co-authored-by: Greg Dennis <gregsdennis@yahoo.com>

Add a few more details.

Co-authored-by: Thomas Rooney <thomas@resiliencycollective.com>

Tighten up the wording a bit to ensure the scope of this technique is immediately clear.

Co-authored-by: Dan Hudlow <dan@hudlow.org>

@hudlow

This streamlines the text/event-stream discussion and makes a few other tweaks based on feedback from @hudlow.

When we did not have `itemSchema`, we needed more guidance about improper use of `schema`, but we don't really need it now.

Co-authored-by: Mike Kistler <mikekistler@microsoft.com>

It was wrapped in an array left over from the previous attempt at writing up this feature.

The format registry currently doesn't specify that double works with strings, so pick something that works with strings as of now.

handrews · 2025-05-08T18:06:47Z

Had to fix the format example (new commit) as @hudlow noted that the format registry does not apply float to strings, and I don't have the bandwidth to debate changing it. Use int64 instead.

The subsequent force-push just rebases onto v3.2-dev with the latest syntax highlighting included, which ensures the build succeeds without warnings.

@mikekistler my apologies but you'll have to hit approve again.

handrews · 2025-05-11T00:15:55Z

OK since the change that reset @mikekistler 's approval was trivial (and easily fixed in the unlikely event that I somehow messed it up), I'm going to go ahead and merge this so I can build on it in subsequent PRs.

handrews added the media and encoding Issues regarding media type support and how to encode data (outside of query/path params) label Apr 28, 2025

handrews added this to the v3.2.0 milestone Apr 28, 2025

handrews requested review from a team as code owners April 28, 2025 00:30

handrews changed the title ~~Seq json2~~ v3.2: Sequential/streaming media types (take 2) Apr 28, 2025

This comment was marked as spam.

Sign in to view

gregsdennis reviewed Apr 28, 2025

View reviewed changes