April 2024, by Clemens Vasters, Microsoft Corp.
- Notational Conventions
- Interoperability issues of the Avro JSON Encoding with common JSON usage
- The "Plain JSON" encoding
The Apache Avro project defines a JSON Encoding, which is optimized for encoding data in JSON, but primarily aimed at exchanging data between implementations of the Apache Avro specification. The choices made for this encoding severely limit the interoperability with other JSON serialization frameworks. This document defines an alternate, additional mode for Avro JSON Encoders, preliminarily named "Plain JSON", that specifically addresses identified interoperability blockers.
While this document is a proposal for a set of new features in Apache Avro, the extensibility of Avro's schema model allows for the implementation of these features separately from the Avro project. Out of the available and popular schema languages for data exchange, Avro schema provides the cleanest foundation for mapping wire representations to programming language types and database tables, which is why interoperability of Avro with the most popular text encoding format for structured data, JSON, is very desirable.
With Avro's strength and focus being its binary encoding, supporting JSON is specifically desireable in interoperability scenarios where either the producer or the consumer of the encoded data is using a different JSON encoding framework, or where JSON is crafted or evaluated directly by the application.
As most JSON document instances can be structurally described by Avro Schema, the interoperability case is for JSON data, described by Avro Schema, to be accepted by an Apache Avro messaging application, and for that data then to be forwarded onwards using Avro binary encoding. Reversely, it needs to be possible for an application to transform an Avro binary encoded data structure into JSON data that is understood by parties that expect to handle JSON. The kinds applications requiring such transformation capabilities are stream processing frameworks, API gateways and (reverse) proxies, and integration brokers.
The intent of this proposal is for the Avro "JsonEncoder" implementations to have a new mode parameter, accepting an enumeration choice out of the options "Avro Json" (AVRO_JSON, AvroJson, etc), which is Avro's default JSON encoding, and "Plain JSON" (PLAIN_JSON, PlainJson, etc). The rules for the "Plain JSON" mode are described herein.
The "Plain JSON" mode is a selector for enabling set of features that are described below. Implementations MAY also choose for these features to be individually selectable for the "Avro JSON" mode, for instance letting the user use the "Avro JSON" mode primarily, but opting into the binary data handling or date-time handling features described here. However, the "Plain JSON" mode that combines these features MUST be implemented to ensure interoperability.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
There are several distinct issues in the Avro JSON Encoding that cause conflicts with common usage of JSON and many serialization frameworks. It needs to be emphasized that none of these issues are conformance issues with the JSON specification (RFC8259), but rather stem from the JSON specification's inherent limitations. JSON does not define binary data, date or time types. JSON also has no concept of a type-hint for data structures (i.e. objects), which would allow serialization frameworks to establish an unambiguous mapping between a data type in a programming language or schema and the encoded type in JSON.
There are, however, commonly used conventions to address these shortcomings of the core JSON specification:
- Binary data: Binary data is commonly encoded using the base64 encoding and stored in string-typed values.
- Date and time data: Date and time data is commonly encoded using the RFC3339 profile of ISO8601 and stored in string-typed values.
- Type hints: In its native type system, JSON value types are distinguished by notation where 'null' values, strings, numbers, booleans, arrays, and objects are identifiable through the syntax. While JSON has no further data type concepts, several serialization frameworks and even some standards leaning on JSON (e.g. OpenAPI) introduce the notion of a "discriminator" property, which is inside the encoded object and unambiguously identifies the type such that the decoding stage can instantiate and populate the correct type in cases where multiple candidate types exist.
On each of these items, the Avro JSON encoding's choices are in direct conflict with predominant practice:
- Binary data: Binary data is encoded in strings using Unicode escape sequences (example: "\u00DE\u00AD\u00BE\u00EF"), which leads to a 500% overhead compared to the encoded bytes vs. a 33% overhead when using Base64.
- Date and time data: Avro handles date and time as logical types, extending either long or int, using the UNIX epoch as the baseline. Durations are expressed using a bespoke data structure. As there are no handling rules for logical types in the JSON encoding, the encoded results are therefore epoch numbers without annotations like time zone offsets.
- Type-hints: Whenever types can be ambiguous in Avro, which is the case with
type unions, the Avro JSON encoding prescribes encoding the value wrapped
inside an object with a single property where the property's name is the type
name, e.g.
"myprop": {"string": "value"}
. 'null' values are encoded as 'null', e.g."myprop": null
. For primitive types, this is in conflict with JSON's native type model that already makes the distinction syntactically. For object types (Avro records), the wrapper is in conflict with standing practice where the discriminator is inlined.
In addition, there are three general limitations of Avro's type and schema model that result in potential interoperability blockers:
- Avro represents decimal numeric types as a logical type annotating
fixed
orbyte
, which results in an encoded byte sequence in the JSON encoding that cannot be interpreted without the Avro schema and is therefore undecipherable for regular JSON consumers. name
fields in Avro are limited to a character set that can be easily mapped to mostly any programming language and database, but JSON object keys are not.- JSON documents may have top-level arrays and maps, while Avro schemas only
allow
record
andenum
as independent types and therefore at the top-level of a schema.
As a consequence of this, the current implementations of the Avro JSON Encoding do not interoperate well with "plain JSON" as input and often do not yield useful plain JSON as output. There is a "happy path" on which the Avro JSON Encoding does line up with common usage, but it's easy to stray off from it.
The Plain JSON encoding mode of Apache Avro consists of a combination of 7 distinct features that are defined in this section. The design is grounded in the relevant IETF RFCs and provides the broadest interoperability with common usage of JSON, while yet preserving type integrity and precision in all cases where the Avro Schema is known to the decoding party.
The features are designed to be orthogonal and can be implemented separately.
- 1: Alternate names
- 2: Avro
binary
andfixed
type data encoding - 3: Avro
decimal
logical type data encoding - 4: Avro time, date, and duration logical types
- 5: Handling unions with primitive type values and enum values
- 6: Handling unions of record values and of maps
- 7: Document root records
Features 2, 3, 4, and 5 are trivial on all platforms and frameworks that handle JSON. Features 1 and 7 are hints for the JSON encoder and decoder to be able to handle JSON data that is not conforming to Avro's naming and structure constraints. Feature 6 provides a mechanism to handle unions of record types that is aligned with common JSON encodation frameworks and JSON Schema's "oneOf" type composition.
JSON objects allow for keys with arbitrary unicode strings, with the only restriction being uniqueness of keys within an object. Uniqueness is a "SHOULD" rule in RFC8259, Section 4, which is interpreted as REQUIRED for this specification since it is common practice.
The character set permitted for Avro names is constrained by the regular
expression [A-Za-z_][A-Za-z0-9_]*
, which poses an interoperability problem
with JSON, especially in scenarios where internationalization is a concern.
While English is the dominant language in most developer scenarios, metadata
might be defined by end-users and in their own language. It's also fairly common
for JSON object keys to contain word-separator characters other than '_' and
keys may quite well start with a number.
As the Avro project will presumably want to avoid introducing schema attributes that are JSON-specific and will want to use new schema constructs for additional needs as they arise, the alternate names feature introduces a map of alternate names of which the plain JSON feature reserves a key:
Wherever Avro Schema requires a name
field, an altnames
map MAY be defined
alongside the name
field, which provides a map of alternate names. Those names
may be local-language identifiers, display names, or names that contain
characters disallowed in Avro. The map key identifies the context in which the
alternate name is used.
This specification reserves the json
key in the altnames
map.
A display-name feature might reserve
display:{IANA-subtag}
as keys. This assumed convention is used in the following example just for illustration of thealtnames
feature.
Assume the following JSON input document with German-language keys that represents a row in commercial order document:
{
"Artikelschlüssel": "1234",
"Stückzahl": 42,
"Größe": "Extragroß"
}
Without the alternate names feature, the Avro schema would not be able to match
the keys in the JSON document since ü
and ß
are not allowed. With the
alternate names feature, the schema can be defined as follows:
{
"type": "record",
"namespace": "com.example",
"name": "Article",
"fields": [
{
"name": "articleKey",
"type": "string",
"altnames": {
"json": "Artikelschlüssel",
"display:de": "Artikelschlüssel",
"display:en": "Article Key"
}
},
{
"name": "quantity",
"type": "int",
"altnames": {
"json": "Stückzahl",
"display:de": "Stückzahl",
"display:en": "Quantity"
}
},
{
"name": "size",
"type": "sizeEnum",
"altnames": {
"json": "Größe",
"display:de": "Größe",
"display:en": "Size"
}
}
]
}
When the JSON decoder (de-)encodes a named item, the encoder MUST use the
value from the altnames
entry with the json
key as the name for the
corresponding JSON element, when present.
The altsymbols
map is a similar feature to altnames
, but it is used for
alternate names of enum symbols. The altsymbols
map provides alternate names
for symbols. As with altnames
, the altsymbols
map key identifies the context
in which the alternate name is used. The values of the altsymbols
map are maps
where the keys are symbols as defined in the symbols
field and the values are
the corresponding alternate names.
Any symbol key present in the altsymbols
map MUST exist in the symbols
field. Symbols in the symbols
field MAY be omitted from the altsymbols
map.
{
"type": "enum",
"name": "sizeEnum",
"symbols": ["S", "M", "L", "XL"],
"altsymbols": {
"json": {
"S": "Klein",
"M": "Mittel",
"L": "Groß",
"XL": "Extragroß"
},
"display:en": {
"S": "Small",
"M": "Medium",
"L": "Large",
"XL": "Extra Large"
}
}
}
When the JSON decoder (de-)encodes an enum symbol, the encoder MUST use the
value from the altsymbols
entry with the json
key as the string representing
the enum value, when present.
When encoding data typed with the Avro binary
or fixed
types, the byte
sequence is encoded into and from Base64 encoded string values, conforming with
IETF RFC4648, Section 4.
When encoding data typed with the Avro logical decimal
type, the numeric
value is encoded into a from a JSON number
value. JSON numbers are represented
as text and do not lose precision as IEEE754 floating points do.
When using a JSON library to implement the encoding, decimal values MUST NOT be converted through an IEEE floating point type (e.g. double or float in most programming languages) but must use the native decimal data type.
When encoding data typed with one of Avro's logical data types for dates and
times, the data is encoded into and from a JSON string
value, which is an
expression as defined in IETF RFC3339.
Specifically, the logical types are mapped to certain grammar elements defined in RFC3339 as defined in the following table:
logicalType | RFC3339 grammar element |
---|---|
date |
RFC3339 5.6. “full-date” |
time-millis |
RFC3339 5.6. “date-time” |
time-micros |
RFC3339 5.6. “partial-time” |
timestamp-millis |
RFC3339 5.6 “date-time” |
timestamp-micros |
RFC3339 5.6 “date-time” |
local-timestamp-millis |
RFC3339 5.6 “date-time”, ignoring offset (note RFC 3339 4.4) |
local-timestamp-micros |
RFC3339 5.6 “date-time” , ignoring offset (note RFC 3339 4.4) |
duration |
RFC3339 Appendix A “duration” |
Unions of primitive types and of enum values are handled through JSON values' (RFC8259, Section 3) ability to reflect variable types.
Given a type union of [string, null]
and a string value "test", a encoded
field named "example" is encoded as "example": null
or "example": "test"
.
For null-valued fields, the JSON encoder MAY omit the field entirely. During
decoding, missing fields are set to null. If a default value is defined for the
field, decoding MUST set the field value to the default value.
For a type union of [string,int]
and string values "2" and the int value 2, a
encoded field named "example" is encoded as "example": "2"
or "example":2
.
For a type union of [null, myEnum]
with myEnum being an enum type having
symbols "test1" and "test2", a encoded field named "example" is encoded as
"example": null
or "example": "test1"
or "example": "test2"
.
Instances of unions of primitive types with arrays and records or maps can also be distinguished through the JSON grammar and type model. Unions of multiple records are discussed in Feature 6 below.
For completeness, these are the updated type mappings of Avro types to JSON types for the plain JSON encoding.
Avro type | JSON type | Notes |
---|---|---|
null | null | The field MAY be omitted |
boolean | boolean | |
int,long | integer | |
float,double | number | |
bytes | string | Base64 string, see Feature 2 |
string | string | |
record | object | |
enum | string | |
array | array | |
map | object | |
fixed | string | Base64 string, see Feature 2 |
date/time | string | See Feature 4 |
UUID | string | |
decimal | number | See Feature 3 |
As discussed in the overview, JSON does not have an inherent concept of a
type-hint that allows distinguishing object data types. Indeed, it has no
concept of constraining and further specifying the object
type, at all.
The JSON Schema project has defined a schema language specifically for JSON data
and provides a type concept for object
. In JSON interoperability scenarios,
JSON Schema, or frameworks that infer their type concepts from JSON Schema, will
often play a role on the producer or consumer side due to its popularity.
JSON Schema is primarily a schema model that serves to validate JSON documents. Its "oneOf" type composition construct is equivalent to Avro's union concept in function. Out of a choice of multiple type options, exactly one option MUST match the JSON element that is being validated, otherwise the validation fails. Any implementation of a JSON Schema validator must therefore be able to test the given JSON element against all available options and then determine the matching type option. Any implementation of a schema driven decoder can use the same strategy to select which type to instantiate and populate.
JSON Schema does not define a type-hint for this purpose, but makes it the
schema designer's task to create type definitions that are structurally distinct
such that the "oneOf" test always yields one of the types when given JSON
element instances. Schema designers then occasionally resort to introducing
their own type-hints by either defining a discriminator property with a
single-value enum
or with a const
value, where the discriminator property
name is the same across the type options, but the values of the enum
or
const
are different. We will lean on this practice in the following.
Consider the following Avro schema with a type union of two record types:
{
"type": "record",
"name": "ContactList",
"fields": [
{
"name": "contacts",
"type": "array",
"items": [
{
"type": "record",
"name": "CustomerRecord",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
{"name": "customerId", "type": "string"}
]
},
{
"type": "record",
"name": "EmployeeRecord",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "employeeId", "type": "string"}
]
}
]
}
]
}
Now consider the following JSON document:
{
"contacts": [
{"name": "Alice", "age": 42, "customerId": "1234"},
{"name": "Bob", "age": 43, "employeeId": "5678"}
]
}
We can clearly distinguish the two record types by the presence of the
respectively required customerId
or employeeId
field.
When decoding a type union, the JSON decoder MUST test the JSON element against all available type options. A JSON element matches if it can be correctly and completely decoded given the type-union candidate schema, including all applicable nested or referenced definitions. If more than one of the options matches, decoding MUST fail. The JSON decoder MUST select the type option that matches the JSON element and instantiate and populate the corresponding type.
For performance reasons, it is highly desirable to avoid having to test a JSON element against all possible type options in a union and instead have a single property that can be tested first and short-circuits the type matching process. We discuss that next.
When we assume the Avro schema to be slightly different, we might end up with an
ambiguity that is not as easy to resolve. Let the employeeId
and customerId
fields
be optional in the schema above, both typed as ["string", "null"]
.
When we now consider the following JSON document, we can't decide on the type and will fail decoding:
{
"contacts": [
{"name": "Alice", "age": 42},
{"name": "Bob", "age": 43}
]
}
To resolve this ambiguity, we can introduce a discriminator property that clearly identifies the type of the record.
Instead of introducing a schema attribute that is specific to JSON, we instead
introduce a new Avro schema attribute const
that defines a constant value for
the field it is defined on.
The value of the const
field must match the field type. The value of the field
MUST always match the const
value. During decoding, decoding MUST fail if the
field value is not equal to the const
value. This rule ensures the function of
const
as a discriminator. The const
field is only allowed on fields of
primitive types and enum types.
Consider this Avro schema:
{
"type": "record",
"name": "ContactList",
"fields": [
{
"name": "contacts",
"type": "array",
"items": [
{
"type": "record",
"name": "CustomerRecord",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "customerId", "type": ["string", "null"]},
{"name": "type", "type": "string", "const": "customer"}
]
},
{
"type": "record",
"name": "EmployeeRecord",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "employeeId", "type": ["string", "null"]},
{"name": "type", "type": "string", "const": "employee"}
]
}
]
}
]
}
The JSON document MUST now include the discriminator:
{
"contacts": [
{"name": "Alice", "age": 42, "type": "customer"},
{"name": "Bob", "age": 43, "type": "employee"}
]
}
The const
field MAY otherwise be used for any other purpose. The binary
decoder MAY skip encoding and decoding a field with a const
attribute and
instead always return the constant value for the field similar to how the
default
field is handled. The const
value overrides the default
value.
During encoding, the binary encoder SHOULD check that the field value matches
the const
value and MAY fail encoding if it does not.
Avro schemas are defined as a single record or enum type at the top level or as
a top-level type union. JSON documents, however, may have top-level arrays and
maps. Without changing the fundamental Avro schema model, the plain JSON
encoding mode uses an annotation on array
and map
types defined inside
record
types to allow for top-level arrays and maps in the JSON document.
The annotation is a boolean flag named root
that is set to true
on one
record field's array or map type. The root
flag is only defined for array
and map
types. If the root
flag is present and has the value true
, the
enclosing record
type MUST have exactly this one field.
Given a JSON document with a top-level array like this:
[
{"name": "Alice", "age": 42},
{"name": "Bob", "age": 43}
]
The Avro schema would be defined as follows:
{
"type": "record",
"name": "PersonDocument",
"fields": [
{
"name": "persons",
"type": {
"type": "array",
"root": true,
"items": {
"type": "record",
"name": "PersonRecord",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
}
}
]
}
When the JSON decoder encounters a top-level array or map, it MUST match the
array or map to the field with the root
flag set to true
. When the root
flag is present on a field, the JSON encoder MUST yield the encoding of the
field as the encoding of the entire record. The JSON encoder MUST fail if the
root
flag is set to true
and if there is more than one field in the record.
When such a record type is used as a field type inside another record, it
consequently is always represented equivalent to a map
or array
type in the
JSON document.
In type structure matching scenarios, a set root
on a map
type causes the record type to be a candidate for the type matching
of JSON object
values. The root
flag on an array
type causes the record
type to be a candidate for the type matching of JSON array
values.
The Avro binary encoding is not functionally affected by this feature, but the
structural constraint imposed by the root
flag MAY be enforced by the encoder.