Description
The Problem: "format" is frequently re-implemented using "pattern" because it is unreliable
The "format" keyword is currently defined as an optional feature of JSON Schema. This frees implementations from the relatively burdensome requirements of performing the specified semantic validations, but also intentionally makes the feature unreliable. As a result, schema authors frequently re-define validation schemas for fields that could be completely described with the "format" keyword were its implementation consistent.
This places an undue burden on schema writers who wish to both take advantage of any full implementations and work around any minimal implementations.
Here is an example of a document (written in YAML for human-friendliness) the provides JSON Schemas for ipv4 and ipv6 addresses for use in other schemas from the same product in place of the "format" keyword:
https://support.riverbed.com/apis/sh.common/1.0/service.yml
The Proposal
JSON Schema can provide a standard "pattern"-based schema for each format value in its meta-schema, which will provide a documented level of purely syntactical validation for instances. This requires only trivial additional work from implementations as shown below under "Mechanism".
Each such schema MUST successfully validate against all possible valid instances. They MAY also successfully validate invalid instances due to the limits of regular expressions or the decision of the JSON Schema standard that the full pattern is too complex or has too much of a performance impact to support at all.
Mechanism
A "formats" section would be added to the "definitions" within the meta-schema:
{
"definitions": {
"formats": {
"definitions": {
"ipv4": {
"minLength": 7,
"maxLength": 15,
"pattern": "^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$"
},
"email": {
"pattern": "you-get-the-idea"
}
}
}
}
}
The purpose of the nested "definitions" section is to clearly differentiate between definitions used only for format validation and definitions used to build the actual meta-schema.
If an implementation does not handle "format": "ipv4" directly, then the schema:
{
"$schema": "http://json-schema.org/schema#",
"type": "string",
"readOnly": true,
"format": "ipv4"
}
should be interpreted as:
{
"$schema": "http://json-schema.org/schema#",
"allOf": [
{
"type": "string",
"readOnly": true
},
{ "$ref": "http://json-schema.org/schema#/definitions/formats/definitions/ipv4" }
]
}
combining the fallback schema with whatever schema elements beyond "format" were already present.
Correctness Concerns
While all of the formats can be at least somewhat validated by regular expressions, several are either extremely complex to fully validate or cannot be entirely validated by a regex. Is this a problem? I argue that it is not, because properly implemented this provides substantial validation assistance that schema authors are otherwise writing each time themselves. Schema authors may examine the supplied regexes and determine whether or not they are sufficient for the given application, and re-implement them accordingly if they are not. This is no worse than what currently happens.
Performance Concerns
Due to the complexity of the regular expressions involved, the performance impact of using them is a valid concern. However, the "format" specification already states that implementations SHOULD provide an option to disable the keyword. That requirement should be left as-is. Disabling the "format" keyword should disable it entirely, including the fallback validation.
Metadata
Metadata
Assignees
Type
Projects
Status