You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:
UserspaceMessage
EventMessage
LifecycleMessage
One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.
This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.
Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the contentMediaType property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.
With all of these, it makes sense to first try to use established headers - Content-Type is a common header. If we can't find a common header, we can use an X-Intersect-SDK- prefix value in the header.
Protocols which do NOT support protocol-level headers
Not a complete list, may be inaccurate.
Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.
MQTT 3.x - user defined properties were introduced in MQTT v5
Redis
Protocols with limited support for protocol-level headers
Not a complete list, may be inaccurate
HTTP Server-Sent Events - the server would first need to send back a custom event type (i.e. event: metadata). The associated data with this can look like whatever we want. If we want to send the metadata WITH the data, we must use custom encoding logic.
WebSockets - the server will send headers back with the initial handshake, but will not send headers per message.
Proposed action items
Only use Pydantic for serializing and deserializing application/json Content-Types in our own library. Otherwise, we just verify that the output value is in bytes/bytearray format.
Rework how we use Pydantic message classes. These are still okay for validating and serializing protocol-level messages, but there needs to be custom logic for each protocol we support.
Either drop support for protocols which don't support protocol-level headers, or write our own encoder/decoder (do NOT use JSON to do this).
Add some custom validation logic for Content-Types - this is currently the only message header field where we need to allow complete flexibility. I would generally suggest that for any Content-Type other than application/json, we require the input/output fields to be either byte or bytearray (note that str assumes a UTF-8 encoding, and valid UTF-8 objects should always be serializable as JSON already). Users will need to perform the appropriate conversions with their preferred library - I do not think it would be a good idea to include tons of different libraries in the INTERSECT-SDK for binary formats. This still allows us to have a valid JSON schema which is generated. Do not allow users to specify non-printable characters in any Content-Type definition. (This is an interesting discussion regarding a media type regex, if we want to further restrict Content Types.)
Note that these changes should be considered breaking.
The text was updated successfully, but these errors were encountered:
Note that most of this discussion applies only to the workflows where we send the data directly through the message instead of MINIO or another data mechanism. However, we still need to address the Only use Pydantic for serializing and deserializing application/json Content-Types in our own library bulletpoint for MINIO or other data mechanisms to fully work.
Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:
One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.
This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.
Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the
contentMediaType
property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.Protocols
We have a full list of protocols we want to support here. Here is a list of protocols supported by AsyncAPI officially,, note that the protocols specification in AsyncAPI is extensible and not limited to their definitions.
Protocols which support protocol-level headers
Not a complete list, may be inaccurate.
With all of these, it makes sense to first try to use established headers -
Content-Type
is a common header. If we can't find a common header, we can use anX-Intersect-SDK-
prefix value in the header.pika.BasicProperties
paho.mqtt.properties.Properties
Protocols which do NOT support protocol-level headers
Not a complete list, may be inaccurate.
Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.
Protocols with limited support for protocol-level headers
Not a complete list, may be inaccurate
event: metadata
). The associated data with this can look like whatever we want. If we want to send the metadata WITH the data, we must use custom encoding logic.Proposed action items
application/json
Content-Types in our own library. Otherwise, we just verify that the output value is in bytes/bytearray format.application/json
, we require the input/output fields to be eitherbyte
orbytearray
(note thatstr
assumes a UTF-8 encoding, and valid UTF-8 objects should always be serializable as JSON already). Users will need to perform the appropriate conversions with their preferred library - I do not think it would be a good idea to include tons of different libraries in the INTERSECT-SDK for binary formats. This still allows us to have a valid JSON schema which is generated. Do not allow users to specify non-printable characters in any Content-Type definition. (This is an interesting discussion regarding a media type regex, if we want to further restrict Content Types.)Note that these changes should be considered breaking.
The text was updated successfully, but these errors were encountered: