Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntactic fallback validation for "format" #54

Closed
handrews opened this issue Sep 17, 2016 · 5 comments
Closed

syntactic fallback validation for "format" #54

handrews opened this issue Sep 17, 2016 · 5 comments
Labels
Milestone

Comments

@handrews
Copy link
Contributor

handrews commented Sep 17, 2016

The Problem: "format" is frequently re-implemented using "pattern" because it is unreliable

The "format" keyword is currently defined as an optional feature of JSON Schema. This frees implementations from the relatively burdensome requirements of performing the specified semantic validations, but also intentionally makes the feature unreliable. As a result, schema authors frequently re-define validation schemas for fields that could be completely described with the "format" keyword were its implementation consistent.

This places an undue burden on schema writers who wish to both take advantage of any full implementations and work around any minimal implementations.

Here is an example of a document (written in YAML for human-friendliness) the provides JSON Schemas for ipv4 and ipv6 addresses for use in other schemas from the same product in place of the "format" keyword:

https://support.riverbed.com/apis/sh.common/1.0/service.yml

The Proposal

JSON Schema can provide a standard "pattern"-based schema for each format value in its meta-schema, which will provide a documented level of purely syntactical validation for instances. This requires only trivial additional work from implementations as shown below under "Mechanism".

Each such schema MUST successfully validate against all possible valid instances. They MAY also successfully validate invalid instances due to the limits of regular expressions or the decision of the JSON Schema standard that the full pattern is too complex or has too much of a performance impact to support at all.

Mechanism

A "formats" section would be added to the "definitions" within the meta-schema:

{
    "definitions": {
        "formats": {
            "definitions": {
                "ipv4": {
                    "minLength": 7,
                    "maxLength": 15,
                    "pattern": "^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$"
                },
                "email": {
                    "pattern": "you-get-the-idea"
                }
            }
        }
    }
}

The purpose of the nested "definitions" section is to clearly differentiate between definitions used only for format validation and definitions used to build the actual meta-schema.

If an implementation does not handle "format": "ipv4" directly, then the schema:

{
    "$schema": "http://json-schema.org/schema#",
    "type": "string",
    "readOnly": true,
    "format": "ipv4"
}

should be interpreted as:

{
    "$schema": "http://json-schema.org/schema#",
    "allOf": [
        {
            "type": "string",
            "readOnly": true
        },
        { "$ref": "http://json-schema.org/schema#/definitions/formats/definitions/ipv4" }
    ]
}

combining the fallback schema with whatever schema elements beyond "format" were already present.

Correctness Concerns

While all of the formats can be at least somewhat validated by regular expressions, several are either extremely complex to fully validate or cannot be entirely validated by a regex. Is this a problem? I argue that it is not, because properly implemented this provides substantial validation assistance that schema authors are otherwise writing each time themselves. Schema authors may examine the supplied regexes and determine whether or not they are sufficient for the given application, and re-implement them accordingly if they are not. This is no worse than what currently happens.

Performance Concerns

Due to the complexity of the regular expressions involved, the performance impact of using them is a valid concern. However, the "format" specification already states that implementations SHOULD provide an option to disable the keyword. That requirement should be left as-is. Disabling the "format" keyword should disable it entirely, including the fallback validation.

@epoberezkin
Copy link
Member

Performance concern can be addressed by providing several alternative implementations, e.g. 'fast' date validation would consider '33/13/2016' as valid relying on /\d{2}\/\d{2}\/\d{4}/ format and more thorough validation would test ranges. That is probably the reason why it was left to implementations but there is some benefit in standardising these things for cross-platform compatibility.

@handrews
Copy link
Contributor Author

there is some benefit in standardising these things for cross-platform compatibility.

Yes, it just seemed like establishing a consistent minimum level would be good, and could be done without adding much of a burden.

@handrews handrews changed the title v6 validation: syntactic fallback validation for "format" validation: syntactic fallback validation for "format" Nov 24, 2016
@handrews handrews modified the milestone: draft-07 (wright-*-02) May 16, 2017
@handrews
Copy link
Contributor Author

handrews commented Sep 8, 2017

This could also be implemented in the meta-schema in such a way that the regex check always occurs, but that seems like a bad idea to impose on either resource-constrained implementations, or on implementations that offer alternative validation mechanisms that may be both more correct and more performant.

@handrews handrews changed the title validation: syntactic fallback validation for "format" syntactic fallback validation for "format" Sep 28, 2017
@handrews
Copy link
Contributor Author

handrews commented Oct 8, 2017

I don't see this getting addressed in draft-07. If anyone wants to make that happen, please speak up.

@handrews
Copy link
Contributor Author

This has more or less been addressed by how format is handled as a vocabulary (PR #764), which makes the consistent default behavior to be that format is not validated at all.

I'm hoping to push for an alternative approach to the open-ended nature of format which would eliminate this problem entirely, and I don't see the need to keep this old idea around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants