Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format should not change behavior based on its vocabulary value #1020

Closed
gregsdennis opened this issue Nov 18, 2020 · 29 comments · Fixed by #1027
Closed

format should not change behavior based on its vocabulary value #1020

gregsdennis opened this issue Nov 18, 2020 · 29 comments · Fixed by #1027

Comments

@gregsdennis
Copy link
Member

The true or false value of the vocabulary declaration governs the implementation requirements necessary to process a schema that uses "format", and the behaviors on which schema authors can rely.

This is just wrong. It precludes an ability to assert format when supported but not require assertion when not supported. The spec still allows for configuration of this, but the practical side of that configuration becomes quite difficult and confusing:

  • For : false, my configuration needs to say "format should be an assertion."
  • For : true, my configuration needs to say "format should be an annotation."

This is hard to do and confusing for clients (users of the implementation). The configuration should always work one direction. If my implementation offers a "format behavior" configuration with values of "assert" and "annotate," setting either value only works for one of the vocab cases. The only way to get the desired behavior is to have a configuration that says "use the non-default behavior," which changes its meaning depending on the schema it's processing.

What we SHOULD have is format as an annotation ALWAYS, but configurable to be assertion.

I agree with @karenetheridge that it'd be nice to have a way for the schema itself to indicate how format should be processed, and I think that changing it based on the vocab value was an attempt at that. But the vocab value and the behavior of format are orthogonal concerns. The spec is conflating them unnecessarily.

@handrews
Copy link
Contributor

I'm not going to get involved in the resolution of this, but just to make sure we're all on the same page (because everything about or related to format is a confusing dumpster fire):

The egregiously complex interplay of the boolean $vocabulary value for format and the configuration of the implementation processing the schema is 100% the result of trying to preserve the prior behavior (as typically implemented rather than as specified) of format. Specifically, that most implementations up through draft-07 do a best-effort validation on format.

The only combination directly related to that is:

  • implementation only supports best effort, not full syntax validation (whatever full syntax validation even means, as people argue over that)
  • the value for the format vocabulary in $vocabulary is false
  • the implementation is configured to support format validation, which is not the default configuration

This allows for best effort validation of format. It is a weird case where, due to the implementation configuration flag, a false value actually sort-of turns on some functionality. This never comes up for any other vocabulary as nothing else has ever specified an implementation flag for functionality control.

What we SHOULD have is format as an annotation ALWAYS, but configurable to be assertion.

@gregsdennis I believe this is the case? The table in the section of the release notes to which I linked above should make that clear, please let me know if there is confusion on this point.

Since we changed the behavior for unrecognized keywords to be "collect as annotation" instead of "ignore", format always behaves as an annotation even if it is not recognized (either not supported, or someone used the keyword without declaring the vocabulary). Declaring the vocabulary with true works like any other vocabulary- in practice it's weird because very few implementations fully support format syntax validation, which is why it's false in the default meta-schema.

Really, if we get rid of the half-ass "best effort" case, then format stops being so damn weird. It would probably also stop being supported, but I would not consider that a bad thing. Either way, it's up to y'all now :-) As everyone's probably tired of hearing by now, I'd have ripped it out entirely and told people to make vocabularies if I thought I could have gotten away with it.

@gregsdennis
Copy link
Member Author

Declaring the vocabulary with true works like any other vocabulary...

But for format (as it's currently written), it doesn't behave like other vocabularies. format explicitly states that it changes from annotation to validation when the vocabulary is required (value is true).

From https://github.com/json-schema-org/json-schema-spec/blob/master/jsonschema-validation.xml#L569

The true or false value of the vocabulary declaration governs the implementation requirements necessary to process a schema that uses "format", and the behaviors on which schema authors can rely.

From https://github.com/json-schema-org/json-schema-spec/blob/master/jsonschema-validation.xml#L599

The assertion evaluation behavior when the option is not explicitly specified depends on the vocabulary declaration's boolean value.

This is what I have a problem with. This implies that format automatically acts as an assertion when the vocab value is true. This is bad. It should always be an annotation unless configured to act as an assertion.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 19, 2020

Regarding the (2019-09) table, the rows that concerns me are the

best-effort/full-syntax | default (off) | true | vocabulary error

If the configuration is off but there is some kind of support, annotations should be generated; the implementation shouldn't error.

@gregsdennis
Copy link
Member Author

I'll try to rework things to make the "annotation unless configured" behavior a bit more clear.

@handrews
Copy link
Contributor

But for format (as it's currently written), it doesn't behave like other vocabularies. format explicitly states that it changes from annotation to validation when the vocabulary is required (value is true).

No, it changes from best-effort validation (if available and configured) to complete validation.

If the configuration is off but there is some kind of support, annotations should be generated; the implementation shouldn't error.

The goal was to make true function normally and consistently, so what I wrote is correct.

The config option, which is only present for historical reasons, is only relevant for false. You have to break one or the other: either the config option is ignored for true or format is the one and only vocabulary that doesn't generate an error on true if it is not supported.

In my view ALL of the bizarre inconsistent behavior should be in the backwards compatibility case. Everything else should function normally, so that means that with true the config option is irrelevant. It's just ignored.

Because the semantics of true are this MUST function as validation or else the schema will not function correctly. So forget the configuration on the implementation side. It's not relevant. You can't turn vocabularies off like that, ever, and it would be a really bad idea to set a precedent allowing that.

Implementation-side configuration options are a bad idea in general because they change the behavior of the schema to something other than how it was written. Don't make that problem worse, please. (although if you do I'm not going to do anything about it).

@handrews
Copy link
Contributor

Honestly, the simplest option is to get rid of the legacy bullshit- no config option, and you either implement it fully or you don't. But people will complain about that so you have to decide whether you're ok with it. I'm not sure more people would complain about that than complain about format already TBH.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 19, 2020

The config option... is only relevant for false.

My argument is that this is wrong. It's relevant for both true and false.

The problem space has three variables: configuration of the library (assert or annotate), whether the library knows about the vocabulary (yes or no), and the value of the vocabulary entry in $vocabulary (true or false). This gives 8 scenarios.

Here's a new table:

configuration vocab known vocab value result
annotate yes false annotation
annotate no false annotation
annotate yes true annotation
annotate no true error
assert yes false assertion
assert no false assertion
assert yes true assertion
assert no true error

Implementations should be configured to annotate by default.

I don't see how this is affected by a best-effort or full validation. The implementation either supports the keyword or it doesn't. The degree to which it implements the keyword is irrelevant.

In my view ALL of the bizarre inconsistent behavior should be in the backwards compatibility case. Everything else should function normally, so that means that with true the config option is irrelevant. It's just ignored.

How can either the true or false case have been an issue of backward compatibility when $vocabulary was being written? There was no precedence for either value because the keyword didn't exist.

Because the semantics of true are this MUST function as validation or else the schema will not function correctly.

Where is this sematic defined? $vocabulary doesn't define this anywhere:

If the value is true, then implementations that do not recognize the vocabulary MUST refuse to process any schemas that declare this meta-schema with "$schema". If the value is false, implementations that do not recognize the vocabulary SHOULD proceed with processing such schemas.

This is the only text that defines what the behavior should be for a vocabulary with value of true (aside from the previously quoted text for format, which is currently under scrutiny).


format is merely an assertion annotation keyword that has a concession that implementations MAY be configured to be an assertion. This configuration is orthogonal to whether the implementation knows about the keyword's vocab or whether the implementation knows about the vocab at all.

gregsdennis added a commit that referenced this issue Nov 19, 2020
@Relequestual
Copy link
Member

format is merely an assertion keyword that
@gregsdennis

I assume you meant "format is merely an ANNOTATION keyword that..."?

@Relequestual
Copy link
Member

After reading the above, 3 times, and the current preview of the spec format vocabulary, at least 4 times, I'm leaning more towards as it currently stands. Hear me out.

99.99 % of schema authors will never write their own vocabulary or use a non-standard dialect (Bar the OAS dialect).

As such, we should make the upgrade path of previous schemas as easy as possible, and the current docs and articles on JSON Schema as least out of date as possible.

I'm not advocating we should just do things as we've always done because "reasons". That would be bad. However, we should provide minimal friction for users, which is what the current state of 2020-11 preview and 2019-09 affords.

As an aside, 2019-09 is over a year old, and NO ONE has asked about this. Not one person has come and asked us, at least.

As such, we SHOULD keep the status quo for now, and tackle format all in one go. In the next draft.

It SOUNDS confusing from an implementation perspective, but from a schema author perspective, it actually makes a lot of sense.


format, now (2019-09 and 2020-11-rc-0), by default is an annotation only. If I want SOME best effort validation, I can enable it using some kind of implementation option.
I can post-process the annotations to do validation specific to me if I want.
I haven't had to create a new dialect or meta-schema.
No more surprises.

If I want others using my schema to always use format as an assertion, and perform "FULL" validation, I create a new dialect which sets the format vocab to true. An implementation receiving that schema (with a new dialect), now knows it MUST perform "FULL" validation if it can, or throw an error. The config option is moot because the schema author has defined "this is really really required, so just do it or else".
No more surprises.

The goal is no more surprises (removing ambiguity) and ease of use, and I think that was met in the approach laid out, as per 2019-09, and as per 2020-11 preview.


Remember, the vocabulary object boolean doesn't enable or disable a vocabulary, but asserts if the implementation MUST understand the vocabulary in order to correctly process the schema. If it's set to false, it MAY still process using the vocabulary if it understands it.

The config option is really a sticky plaster over a problem we need to solve, so, let's assume we're going to remove it at some point.

With the config option removed, a vocabulary object boolean of true for format is clear, do the really good validation, or die. However, if it's false, you MIGHT get just annotations (because that's the default behaviour of all keywords), or you MIGHT get full validation... This is now "ta-da surprise" land, and we've established this is bad for everyone.

You could argue that in such an event as we properly define format, that it should be default be enabled, and YES I think we all agree... buttttt it's super hard to agree on exactly how that sort of validation works. I mean look at regex for email... (Some would argue it's easy, others not).

So, given we can't fully qualify the exact validation for all the format values right now, and we want to avoid surprises, can you agree that the current situation is best-effort given constraints and logically derived?

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 19, 2020

99.99 % of schema authors will never write their own vocabulary or use a non-standard dialect (Bar the OAS dialect).

99.99% of schema authors aren't even using 2019-09 (all the questions I see have draft 7 schemas, despite the suggestions to use 2019-09, and only a handful of validators even support it; even VS Code doesn't support it), and the ones that do aren't using format, so they have no set expectations around the vocab being true influencing the behavior of format. For them, vocabs don't exist, and this is a new concept. There are no questions about how format works in 2019-09 because no one is using 2019-09.

We need to fix this now when adoption is at its lowest; it'll be much harder to fix later.

With the config option removed, a vocabulary object boolean of true for format is clear, do the really good validation, or die.

Why do you assume that the default behavior for format should be assertion without the configuration option? My proposal is to leave the vocab true/false meaning intact and maintain format as an annotation, just like any other annotation keyword (e.g. title). The option leaves an opening for implementations to be configured (separately from the schema) to treat it as an assertion completely separately from the vocab value.

If you want the schema to specify whether it should be an assertion...

If I want others using my schema to always use format as an assertion, and perform "FULL" validation, I create a new dialect which sets the format vocab to true.

This is the wrong way to do this. (I described it in the slack thread.) If you want the schema to be able to specify that format is an assertion, that's fine, but conflating the meaning of specifying a vocab as true is the wrong way to do it. (e.g. Another option is the introduction of a new "configuration" type of keyword that can set this behavior.) Furthermore it sets a precedence that this is an okay thing to do, suggesting that other vocab authors may do the same. That is bad.

The current definition conflates two separate issues: vocab value meaning and configurability. These ideas need to stay separate.

The goal is no more surprises (removing ambiguity) and ease of use, and I think that was met in the approach laid out, as per 2019-09, and as per 2020-11 preview.

"No more surprises" would mean that format isn't special. This whole dependent behavior is a hard gotcha. It's the only keyword that behaves this way, and we will be better off (now and in the long run) aligning it with the behavior of the other keywords.

"Ease of use" would mean that this keyword behaves like any other keyword with respect to the vocab value.

you MIGHT get just annotations (because that's the default behaviour of all keywords), or you MIGHT get full validation... This is now "ta-da surprise" land, and we've established this is bad for everyone.

This is what documentation is for. Literally no one is going to have 100% validation on all of these keywords. No one can. There will always be edge cases, no matter how strengently a format validation is coded. You are therefore relegated to mandating that everyone documenting their level of support.

Secondarily, if you must configure the implementation separately, you'll never have that "ta-da suprise" moment because you have to configure it properly. You'll never get assertions when you want annotation, and vice versa. As it's written, that could happen. As it's written, I can't require format annotations or opt into format assertion. As it's written I have two options: MAYBE get annotations (which I can look up support for format in the docs) or DEFINITELY get format assertions.

This is the gotcha that we have to remove. My PR does exactly that. With my PR, you can opt in to whether you get assertion or annotation behavior through configuration, and the schema specifies whether format support (at all) is a hard or soft requirement.

@handrews
Copy link
Contributor

@Relequestual you have captured exactly what I was going for and I agree with every word you wrote about how to move forward.

@gregsdennis

I can't require format annotations

There is no need for this. Unknown keywords are now collected as annotations, so there's no way not to get format as an annotation. If it's recognized, you always get it as an annotation (regardless of assertion behavior- the annotation doesn't go away). If it's unrecognized, you get it as an annotation because that's how unrecognized keywords work now.

So I don't see a use case here that needs addressing. For the rest of it, something will be confusing/surprising and I still think we got that choice right in 2019-09 as explained by @Relequestual. The "what if I only want annotations" case would have been worth addressing, but it is no longer an issue because of the changed behavior of unknown keywords.

@jdesrosiers
Copy link
Member

jdesrosiers commented Nov 19, 2020

I'm in complete agreement with @gregsdennis. The format vocabulary should work like any other vocabulary. Right now there are two different semantics for format. Allowing $vocabulary to switch between those semantics would make format a special case, which is bad.

@handrews

Unknown keywords are now collected as annotations, so there's no way not to get format as an annotation.

If format is collected as an unknown keyword, it loses it's semantics. If I write a library that wants to do something with format annotations, I don't know the difference between format that was intended to have semantics of the format vocabulary and format that was used as an unknown keyword to mean something completely different. There's no way to have annotations-only and keep the format semantics.

I think the only way to get all the behaviors users expect is to define two format vocabularies: annotation-only and assertion. There would have to be two dialect schemas as those two format vocabularies use the same keyword and therefore can't be in the same dialect. The annotation-only one can be required (it's easy enough, so why not) and the assertion one can be optional providing the behavior most users probably expect.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 19, 2020

I can't require format annotations (me)

There is no need for this. (@handrews)

There's absolutely a need for this.

Suppose I write an application where I require format validation, but I want to provide my own validation. I would need a validator that MUST understand the format vocab (<vocab>: true) so that the keyword is validated (per the meta-schema) but also will return annotations. Currently there's no way to do this.

Conversely, if I don't require validation (<vocab>: false) but would like it if the library supports it (configured on), I can't get that either.

These two cases are valid scenarios that the current state CANNOT support. It is patently incorrect to say that an application will only ever want unvalidated annotations or full sematic validation.

The table I posted above shows how my proposal supports ALL of the valid scenarios. Yes, there may be some overlap given unknown keywords as annotations, but that's worlds better than intentionally not supporting some of them.

If format is collected as an unknown keyword, it loses it's semantics. @jdesrosiers

@handrews you argued this in Slack regarding the content* keywords. We validate that title is a string. Why should format be any different?


Still, no one has argued why the configurability of format (annotate or assert) should be conflated with the vocab setting. Again, these are two separate concepts, and they should remain as such. Trying to wrestle the vocab value into somehow configuring the semantics of the keyword is at the root of the problem here. If we keep these ideas separate, all of the other arguments go away, and you're left with my proposal.

@handrews
Copy link
Contributor

@jdesrosiers

If format is collected as an unknown keyword, it loses it's semantics. If I write a library that wants to do something with format annotations, I don't know the difference between format that was intended to have semantics of the format vocabulary and format that was used as an unknown keyword to mean something completely different. There's no way to have annotations-only and keep the format semantics.

That is not the way this works. By the time an application gets the annotation output it does not have any sort of vocabulary mapping. It just has keyword names and values, and schema and instance locations. There is no detectable difference in these scenarios whatsoever. The whole point is that the application looks at annotations named format and imposes its own semantics on them. Those semantics ought to have something to do with format validation but that's outside of JSON Schema's specification or concern. An application could use them as random number generator seeds for all we care.

This is the point of the "compatible semantics" clause. If you mix format semantics then you get a mess, and the compatible semantics clause ensures that JSON Schema is not responsible for that mess. Applications are responsible for interpreting annotations correctly, period.


@gregsdennis

Suppose I write an application where I require format validation, but I want to provide my own validation. I would need a validator that MUST understand the format vocab (: true) so that the keyword is validated (per the meta-schema) but also will return annotations. Currently there's no way to do this.

Yes there is- if you only care about syntax validation of format you can put that directly in the meta-schema without declaring the format vocab at all. The default semantics for unknown keywords (including those described in the meta-schema but not declared in a vocabulary) is "collect these as annotations and the application will figure out what to do with them." As far as the application is concerned, this looks identical to collecting annotations as a known format vocabulary, just without any validation. You can even still allOf the format vocabulary's meta-schema since including such a meta-schema does not force you to declare the relevant vocabulary. That is one of the reasons that $vocabulary outside of the entry-point object is ignored.

Basically, take the existing default meta-schema, remove https://.../format from $vocabulary but leave it in the allOf and you have exactly what you want.

The reason meta-schemas are separate from vocabularies is to handle odd cases like this where semantic and syntactic constraints might not completely align. Whether that's because there are keywords without formal semantics (so they have syntax, but the semantics are entirely application-determined) or because the syntax needs to be more strict (or more loose), or whatever.

@handrews you argued this in Slack regarding the content* keywords. We validate that title is a string. Why should format be any different?

I don't know what this means- perhaps I did not understand what was going on with that content* thread- this is what happens when I'm paged into a 100+-message thread and no one explains why I'm supposed to be responding.

Anyway... the value of title, in the schema, MUST be a string. The instance value ought to be a string, but that is not validated. It's not even a SHOULD in the spec. If your application is Perl you could put literally anything there and Perl would find some way to make it a string when you read the variable (it may or may not be a useful string, and I would not recommend doing this, but goodness knows it would not even make my top 100 ill-advised things I've seen done with Perl).

Conversely, if I don't require validation (: false) but would like it if the library supports it (configured on), I can't get that either.

Do you mean full validation or best-effort validation? Here is what true and false mean for ALL VOCABULARIES:

  • false and not recognized/supported: just give me the keywords as annotations (because you wont' recognize the keywords, so it would be impossible to separate them from any other unknown keywords anyway)
  • false and recognized/supported: if you know what else to do with these keywords, do that as well
  • true if you don't know what to do with this, error out

The rationale with the format vocabulary and the configuration option is that false has unpredictable behavior already when the vocabulary in question has assertion behavior. You might get the assertions, you might not, presumably you wrote your schema with care to ensure that that's OK, e.g. you did not condition a oneOf on the assertion behavior in question, and you plan to double-check the validation somehow. I doubt many people will set assertion vocabularies to false, because it's a bizarre thing to do. In most cases, you want deterministic validation.

The false option was really intended for complex annotation vocabularies like hyper-schema, where collecting the annotation is a lot more complicated than just copying the keyword value. Collecting the links annotation means filling out link templates, for example. So it's complex, but also you could process validation with one implementation and then use the default annotation collection output to then feed link instantiation through a separate implementation. That was the main use case I had in mind for false.

true, on the other hand, has absolutely deterministic behavior. If the vocabulary is not known or supported, then the implementation MUST fail. That is the single most important guarantee offered by $vocabulary and nothing should be allowed to change it, at all, ever.

Now, let's consider format.

There are two things to handle with format:

Problem 1: It can't be validated by default for historical reasons. You have to enable the assertion behavior on the implementation. In my view, this is a horrible idea, but we didn't want to change it. So we kept that configuration option. It interacts as follows:

  • If set to the default (off/annotation), then the format vocabulary behaves like a totally normal annotation vocabulary, just like the meta-data vocabulary or the content vocabulary (both of which use the keyword value as the annotation value for all of their keywords). This is extremely important! The format vocabulary behaves like a 100% normal, nothing strange here at all, annotation vocabulary if you use it out-of-the-box without setting any weird config options.

  • If set to on/assertion, and the implementation fully supports format, then the format vocabulary behaves like a totally normal assertion vocabulary, just like the validation vocabulary. This is also extremely important! When the config option is on and format is implemented like the spec actually specifies, then the format vocabulary behaves like a 100% normal, nothing strange here at all, assertion vocabulary if you use it out-of-the-box without setting any weird config options.

Which brings us to problem 2: Many if not most implementations half-ass format, and always have. This has been a source of constant confusion and complaints for as long as I've known of JSON Schema (much less been involved with the spec). It was a big problem for us on the first project where I ever used JSON Schema, because of the unreliability. However, it was pretty clear that if we suddenly tried to force everyone to fully support format (or not support it at all), then there would be howls of anguish from all over.

So I thought "there is one specific case here that is already non-deterministic, and that is when the vocabulary is known and set to false. Let's make that slightly more non-deterministic, in a way that people already experience, because that's what they expect anyway."

In hopefully all other vocabularies ever designed, they are either supported or not. There is no half-assing. But if a vocabulary were to allow half-assing (intentionally or as a practical matter), then the correct thing would be to allow half-assing on the vocabulary set to false. But never when set to true. The point of true is to be deterministic. The point of false is to be forgiving.

To summarize

supported/known configuration vocab behavior
no, half-assed, or full off/annotate standard annotation vocabulary behavior
no or full on/assert standard assertion+annotation vocabulary behavior
half-assed on/assert false: best effort annotation; true: standard assertion+annotation

There is exactly one case where half-assing ever makes a difference, and that is when assertion is configured and the $vocabulary value is false. That case was already non-deterministic, it's just even more non-deterministic for format because instead of support or not support being a boolean condition, it's a vague continuum.

In theory you could do the same with content*, I just think it's a horrible idea so I said "let's not" and took it out. But really, if you look at this as two separate problems it makes much more sense:

  • Configuring whether the format vocabulary asserts or not just flips it between a totally normal annotation-only keyword and a totally normal assertion+annotation keyword. Nothing strange aside from the config existing in the first place.
  • The need to allow for half-assed support, which exploits the already non-deterministic behavior of setting the vocabulary to false, so it's not just "it might work, it might not" but "it might work, it might not, or it might kinda-sorta work but not entirely."

As much as I complain about all of this, the way I set it up stays very much within the confines of the intended semantics of $vocabulary. There is no reason to go outside of this and make things more strange.


What all of this keeps coming back to as far as I can tell (and I mean "all of this" as in "this whole family of issues and PRs around format and vocab and content etc.") is that various folks here do not understand:

  • how false vocabularies are supposed to work
  • how the format config operation is a legacy piece on top of vocabularies that MUST NOT violate the normal vocabulary behavior (specifically, an unknown/unsupported true vocabulary MUST cause an error regardless of other configurations)
  • the "compatible semantics" clause- why it's there, what it means, what it's used for
  • the separation of syntax and semantics- why it's there, what it means, what it's used for
  • what happens with annotations when there is also other functionality, and how that interacts with the default annotation collection behavior for unknown keywords
  • what information is an is not available to an application for a given collected annotation

I'm not quite sure what to do about this. We could do a call and try to hash this out so we can stop playing whack-a-mole with the latest misunderstanding in this area. I could just leave y'all to discard my intentions on these and come up with your own system- that's a sincere offer, as I can't stay vigilant on this constantly on an ongoing basis, so if we can't get on the same page it would be better for you to come up with your own system than theoretically working with mine but running into problems with much of it.

@jdesrosiers I could see some variation on your two-vocabulary proposal being helpful, but there are several specifics as written that indicate the problems I have listed above, and really I still think that, as I have explained above, it works and is as compliant as possible with then normal behavior of $vocabulary.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 20, 2020

@handrews

Vocabularies

Here is what true and false mean for ALL VOCABULARIES:

  • false and not recognized/supported: just give me the keywords as annotations (because you wont' recognize the keywords, so it would be impossible to separate them from any other unknown keywords anyway)
  • false and recognized/supported: if you know what else to do with these keywords, do that as well
  • true if you don't know what to do with this, error out

I agree with you 100% on what the values of vocabularies mean. ✔️

format problem 1

  • If set to the default (off/annotation), then the format vocabulary behaves like a totally normal annotation vocabulary, just like the meta-data vocabulary or the content vocabulary (both of which use the keyword value as the annotation value for all of their keywords). This is extremely important! The format vocabulary behaves like a 100% normal, nothing strange here at all, annotation vocabulary if you use it out-of-the-box without setting any weird config options.

  • If set to on/assertion, and the implementation fully supports format, then the format vocabulary behaves like a totally normal assertion vocabulary, just like the validation vocabulary. This is also extremely important! When the config option is on and format is implemented like the spec actually specifies, then the format vocabulary behaves like a 100% normal, nothing strange here at all, assertion vocabulary if you use it out-of-the-box without setting any weird config options.

Again, I agree with this. ✔️

format problem 2

Many if not most implementations half-ass format, and always have. This has been a source of constant confusion and complaints for as long as I've known of JSON Schema (much less been involved with the spec). It was a big problem for us on the first project where I ever used JSON Schema, because of the unreliability. However, it was pretty clear that if we suddenly tried to force everyone to fully support format (or not support it at all), then there would be howls of anguish from all over.

Still in agreement. ✔️

Your conclusion

But if a vocabulary were to allow half-assing (intentionally or as a practical matter), then the correct thing would be to allow half-assing on the vocabulary set to false. But never when set to true. The point of true is to be deterministic. The point of false is to be forgiving.

I don't follow how you got here. ❌

A library's degree of implementation should be its own concern, not a concern of this spec. In ALL cases, it should document its level of support for any keyword. For example, .Net has a hard limit of 64-bit integers. I literally can't process a schema which declares that an integer must be higher than the 64-bit limit. But his doesn't mean that I have "half-assed" the implementation. It's merely a limitation of my support, and I should document that.

Similarly with format, if I've done the best I can with attempting to implement the various formats, I should document the limitations of what that means.

This is going to be true for ALL implementations. No implementation is going to get it 100% right, so why do we care? Require (SHOULD?) that the implementation documents its deviations and limitations, then let the spec behave as it should.

This

The need to allow for half-assed support, which exploits the already non-deterministic behavior of setting the vocabulary to false, so it's not just "it might work, it might not" but "it might work, it might not, or it might kinda-sorta work but not entirely."

is not our concern. We expect "it might kinda-sorta work but not entirely" because computing has limitations.

Secondarily, I don't see the connection between "half-assing" a feature and the vocabulary requirements.

Review

If an implementation understands a vocabulary, it must process the associated keywords according to the spec, regardless of the $vocabulary value. (We agree on this.)

If an implementation doesn't understand a vocabulary, and the $vocabulary value for it is false, no harm, no foul. It'll be picked up as annotations. If the keyword is an annotation anyway, no change (except for syntactic validation). (We agree on this.)

If an implementation doesn't understand a vocabulary, and the $vocabulary value for it is true, fail. (We agree on this.)

format is an annotation by default. (We agree on this.)

Let's look at the scenarios and see the behavior with only what we agree on so far. I'll include title and maxLength as example keywords for comparison.

vocab value vocab known format title maxLength
false yes annotation annotation assertion
false no annotation annotation assertion
true yes annotation annotation assertion
true no error error error

They behave the same! Cool!

Now let's throw in the configuration rule (off for default annotation behavior, and on for assertion):

vocab value vocab known config format title maxLength
false yes off annotation annotation assertion
false no off annotation annotation assertion
true yes off annotation annotation assertion
true no off error error error
false yes on assertion annotation assertion
false no on assertion annotation assertion
true yes on assertion annotation assertion
true no on error error error

Still no problems; same behavior between keywords of the same type based on the configuration.

And finally format as currently defined. The default behavior is defined by the spec:

When the vocabulary is declared with a value of false, an implementation:

  • MUST NOT evaluate "format" as an assertion unless it is explicitly configured to do so;

...

When the vocabulary is declared with a value of true, an implementation that supports this form of the vocabulary:

  • MUST evaluate "format" as an assertion unless it is explicitly configured not to do so;

Because of these two rules, the default behavior changes from annotation to assertion based on the $vocabulary value.

vocab value vocab known config format title maxLength
false yes off (default) annotation annotation assertion
false no off (default) annotation annotation assertion
true yes off (NOT default) annotation annotation assertion
true no off (NOT default) error error error
false yes on (NOT default) assertion annotation assertion
false no on (NOT default) assertion annotation assertion
true yes on (default) assertion annotation assertion
true no on (default) error error error

I've edited this table since first publish. Behaviors are the same (I think), but the defaults are different. This is the bad thing.

This change in default presents an inconsistent behavior relative to ALL OTHER KEYWORDS. This is a gotcha. It's unexpected. There is no reason for format to behave differently in these two scenarios. It's damn near impossible to implement. And it will confuse whoever uses an implementation that does this.

A default behavior needs to remain constant. An application needs to be able to set "treat format as an assertion" and expect that it works that way always.

Configuring for assertions looks like

options.TreatFormatAsAnnotation = true;

Configuring for annotations looks like


because this is the default.

The spec, therefore, is stating that an application must explicitly configure the implementation it's using to use the default, which is sometimes non-default, based on the value in the schema which, presumably, it knows nothing about.

I hear you in my head saying that an application should always know what the schema has because the application defines the schema. Not always. I have clients who are reading schemas dynamically from files, network locations, and even databases. There's no guarantee that the schema is going to require or not require any vocabulary. Therefore, they have to be able to set the behavior of format once without knowing the content of the schema in accordance with the requirements of the application.

If you remove this dynamic behavior, format just works like whatever kind of keyword it's configured for, and $vocabulary works exactly like it's intended to.

@handrews
Copy link
Contributor

@gregsdennis you are changing the requirements from what was agreed for 2019-09 when you dismiss the "half-assed" use case, which was about implementor choice to barely implement some formats, not library limitations. This paragraph:

It is RECOMMENDED that implementations use a common parsing library
for each format, or a well-known regular expression. Implementations
SHOULD clearly document how and to what degree each format attribute
is validated.

I think applies to the case when the $vocabulary value is set to true (this could probably be made more clear, particularly as I can't even remember anymore). Of course there may be platform-specific limitations, just as with regular expressions, even in "full" support.

This part:

a minimal validation is
sufficient. For example, an instance string that does not contain an
"@" is clearly not a valid email address, and an "email" or
"hostname" containing characters outside of 7-bit ASCII is likewise
clearly invalid.

is definitely about the "half-assed" part, which I hope is clearly not based on hard limits of libraries or environments.

Of course if you change the requirements, you can do it differently! And better! But I wasn't allowed to do that at the time- that's why it doesn't do what you want. It does what I had to make it do regardless of what I wanted (nothing about format has anything to do with what I wanted, really).

@Relequestual made a comment about not changing 2019-09 for the next draft in this area. I think that's a reasonable position. But since 2019-09 hasn't seen much real use, and format is a mess anyway, I think the proposal to change the requirements is also a reasonable position. It's up to you two (and other stakeholders) to decide that, but it needs to be an explicit decision on what the 2020-NN requirements should be.

The reasons for the 2019-09 requirements are somewhat lost to time. OpenAPI at one point wanted much more continuity around format. I think @Julian has opinions on whether his implementation will ever support all of the formats to a near-complete degree, and what we needed to do about that. Evgeny might have also had objections, but now that he's doing a big rewrite of AJV those objections may no longer apply.

There may have been other people involved but I don't recall. As noted, Evgeny is rewriting everything and trying to hire people for it, so he's probably more likely to be OK with change. OpenAPI seems to be more relaxed about format these days, and are fine with it being an annotation by default. And @Julian can speak for himself if he wants to :-)


If you are going to change the requirements at all to remove the awkward compatibility aspects, I recommend stepping back and making the boldest possible change to bring format into alignment with normal behavior.

From that perspective, while @gregsdennis has good ideas about tidying up the mess a little bit, @jdesrosiers' idea about two vocabularies is probably a stronger proposal. My variation of that would be:

  • No client-side configuration at all
  • Two vocabularies, one that declares annotation semantics and one that declares validation semantics
  • The format-annotations vocabulary is in the standard meta-schema. The format-assertions is not, and schema authors will need to write a custom meta-schema for it (which will probably be confusing, but what about format is not confusing?)
  • Like the meta-data and content vocabularies, setting the format-annotations vocabulary to true doesn't have much of an effect because these vocabularies are simple annotations already, but that's just fine.
  • Setting the format-assertions vocabulary to true requires full-as-possible implementations (within documented restrictions due to underlying libraries, but the assumption is that those libraries come pretty close to full validation)
  • You can declare both vocabularies, because their semantics are compatible. I wouldn't recommend it as it's confusing, but it's important to understand that it is not an error.

I don't know where that leaves libraries that do minimal validation of format, but y'all can decide how you want to handle that.

But feel free to choose whatever new requirements you want as far as I'm concerned. @gregsdennis's are more reasonable than what I had to work with in 2019-09. As long as you intentionally choose some new requirements, the question of whether the spec is correct will be one you can answer better than me!


Since we have resolved this to a question of differing requirements, I don't plan to comment on this further as I'll be fine with whatever you choose there. My objections had to do with the 2019-09 requirements only.

@Relequestual
Copy link
Member

I'm still playing catch up here, and I'll respond to the above properly later, but I mentioned this issue on the OAS call last night.
In terms of validation for format, I got zero interest or concerns. I may not have been able to clearly present the issue, but invited people to come read. People were more concerned about, by default, getting annotations, to be used elsewhere.

@darrelmiller
Copy link

@gregsdennis

First let me say that I 'm coming into this conversation with only a fairly superficial understanding of vocabularies. However, I have fully experienced the pain of the format keyword. Moving to a place where we can be clear whether or not a JSON Schema document requires format to be validated seems like a good place to be. Having a vocabulary say that a keyword must be treated as an assertion but then still leave the door open for libraries to only partially implement the validation seems like we in the same bad place. That doesn't exist for other assertions does it?

A library's degree of implementation should be its own concern, not a concern of this spec. In ALL cases, it should document its level of support for any keyword. For example, .Net has a hard limit of 64-bit integers. I literally can't process a schema which declares that an integer must be higher than the 64-bit limit. But his doesn't mean that I have "half-assed" the implementation. It's merely a limitation of my support, and I should document that.

Personally, I don't want to use tooling that was selective about what parts of a spec it chose to implement. The fact that .NET only has native support for 64-bit integers doesn't mean a .NET application can't validate numbers that are bigger than 64 bit when presented with a JSON serialization of those numbers. I don't believe the JSON Schema spec says that validators must map values to the native types of the implementation language. I don't need a .NET email type to validate an email value.

Having the JSON Schema spec say that format may or may not be validated seems consistent with the world as it is today. Having a way for a schema writer to say this format value MUST be respected by tooling is something that we can't do today and selectively it would be useful to have.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 20, 2020

Personally, I don't want to use tooling that was selective about what parts of a spec it chose to implement.

@darrelmiller this is the whole idea! The market is competitive. If you don't want a partial implementation of format, then you (should) have other libraries to choose from. Additionally, it should be open source, so you have the ability to push for wider support or implement that support yourself and submit a pull request.

I (and @handrews) am tired of this partial implementation allowance in the specification, but the reality is that it's impossible for an implementation to get them 100% correct. There will always be edge cases. To that end, the spec should say, "these are the requirements, but we realize that's a tall order, so document the extents of your support." This is literally the best we can do.

What I do in Manatee.Json and JsonSchema.Net (which supercedes Manatee) is provide a default validation but also provide mechanisms for clients to define and use their own validation logic. In my mind, this provides the best of both worlds: I implement what I think is reasonably sufficient for the library, but if clients want more they can have it. My docs

As an aside, JsonSchema.Net uses the System.Text.Json serializer and extracts numbers as decimal to get the best possible precision. However determining if a very large number is an integer is next to impossible due to still finite precision. This is a limitation with the serializer that I've chosen to use. Yes, the native JSON text is capable of representing larger values, but my design decisions have precluded reading them.

Having a way for a schema writer to say this format value MUST be respected by tooling is something that we can't do today and selectively it would be useful to have.

I completely agree with this. But hijacking the vocabulary value is not the way to do it. A couple alternatives have been suggested:

  • create a separate "format as an assertion" vocabulary and allow users to reference that, or
  • create a new "configuration"-type keyword like assertFormat that can be set to true to enable assertion behavior.

I'm open to other solutions as well so long as they're consistent and don't interfere with the definitions of other keywords.


Again, this issue isn't about partial vs full validation logic, it's about changing the default behavior based on the vocabulary.

@handrews
Copy link
Contributor

I'm really trying to stay out of this but I feel the need to note:

I (and @handrews) am tired of this partial implementation allowance in the specification, but the reality is that it's impossible for an implementation to get them 100% correct. There will always be edge cases. To that end, the spec should say, "these are the requirements, but we realize that's a tall order, so document the extents of your support." This is literally the best we can do.

While it's true I'm tired of the partial implementation allowance, and would love to see it go away, that is separate from whether I think it's feasible. For 2020-NN, I am not offering an opinion and will support whatever emerges.

That said, what @gregsdennis is not the problematic partial implementation in my view. Almost any keyword can have limitations passed through from the environment. There are different limitations on numeric size and precision, on regular expression dialects, etc. If the only available library for validating email syntax in a given system doesn't quite support the whole spec, that, to me, is a normal sort of keyword limitation and nothing requiring special treatment at all. Most formats will not have that problem.

The problematic partial implementation case that we determined we needed to support has nothing to do with hard limits. It is about implementors (the humans involved) consciously choosing to not validate the entire spec, including some obvious things that could have been validated. The canonical example is an email address validator that checks for an @ sign and if one is present calls it valid.

Please, please, please keep these two cases clearly separate in this discussion.

  1. limits inherited from the environment- normal. format may be more likely to have vexing ones than anything except pattern and patternProperties, but this is still normal and requires no special handling
  2. limits consciously chosen by implementors who definitely could have implemented more but chose not to

"fixing" the type 1 limitations would at most take the form of how regexps are managed- referencing some other standard that defines a minimum interoperability threshold.

Type 2 limitations are effort+performance vs functionality tradeoffs made by implementors. JSON Schema can either continue to support these essentially arbitrary decision that were never formally part of the spec, or it can decide to forbid them, in the sense of declaring such implementations to be out of conformance (and enforced via the test suite). Note that when I say "essentially arbitrary" I don't mean capricious or malicious, I just mean that based on discussions over the years, the exact choices made had to do with personal decisions on how much to invest in the keyword rather than hard technical thresholds.

In 2019-09, we decided that we needed to continue to support type 2 limitations, but that we wanted to encourage either not implementing assertion support at all, or implementing full (subject only to type 1 limitations) support.

2020-NN can make a different choice and end support for type 2 limitations, but the question here is whether or not to make that choice. All of the rest of this arguing about decision tables and how conditions interact is irrelevant unless everyone agrees on the requiremens:

  • Should type 2 limitations still be supported?
  • Should there still be a configuration option set on the processing implementation?

Those are the legacy requirements, and we said "yes" to both in 2019-09, and the spec as written reflects that.

@Relequestual
Copy link
Member

I think I see a satisfactory path forward, so I'm going to take some time to review and make a proposal. Please refrain from more walls of text if possible 😅😬
Im hoping to come back at this tomorrow.

@Relequestual
Copy link
Member

Relequestual commented Nov 21, 2020

I think the only way to get all the behaviors users expect is to define two format vocabularies: annotation-only and assertion. There would have to be two dialect schemas as those two format vocabularies use the same keyword and therefore can't be in the same dialect. The annotation-only one can be required (it's easy enough, so why not) and the assertion one can be optional providing the behavior most users probably expect. - @jdesrosiers

[This approach is probably OK] - @handrews


I think our only consensus-based path forward here is to do pretty much what @jdesrosiers said above.

A format-annotation vocabulary, true by default, defines the allowed values and known semantics, but only ever results in annotations.

A format-assertion vocabulary, not included in any of our meta-schemas (for now, and for release) and as such not part of our current dialect, requires full-syntax based assertions.

For implementations that support the format-assertion vocabulary, there is NOTHING to stop them also providing a layer, which is off by default, that provides SOME FORM of validation for SOME of the format values, but any such validation MUST not impact the validation result of applying the Schema to the Instance. We should add a cref this effect, in light of implementations expectations of being able to provide some level of simple validation for format.

My feeling is this approach will match the expectation of the majority of users, schema authors, and implementers, with the minimal level of changes, while also fixing some problems we clearly have.

@gregsdennis My apologies, I haven't read the updated PR, but I doubt it implements the above suggestion.
If you can agree to this suggestion and can muster some time to make a new PR, then please do ASAP.

If you agree, we have a consensus. A reasonable and logical approach and one of least resistance.

I'm going to address something else in a separate comment...

In retrospect, I don't think the other thing I was going to address even needs to be said if we follow the above.

@darrelmiller
Copy link

A format-assertion vocabulary, true by default, defines the allowed values and known semantics, but only ever results in annotations.

@Relequestual I'm really hoping that there is a typo in that sentence, or I'm more lost than I thought.

@Relequestual
Copy link
Member

Yikes. sorry. I'll amend!

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 21, 2020

This is acceptable. I'll try to work something up.

Would it makes sense for both vocabularies to share the same meta-schema but have different vocab URIs? There's nothing about the meta-schema that changes. It's only the sematics that change.

@Relequestual
Copy link
Member

Yeah I think that's fine. Have at it =]

@Relequestual
Copy link
Member

I think there may have been a finer point I probably didn't communicate in my above comment.
I'm happy to do away with the config option for the annotation based vocab with these changes.
Any such further processing a library may or may not offer its users is only of our concern in terms of making sure it doesn't impact JSON Schema validation results.

I'm in the process of a review.

@Relequestual
Copy link
Member

As an aside on the "language processing limitations". Perl has no native boolean. Seems strange I know, but are approaches for processing JSON in Perl so as to differentiate between true and 1 parsed from JSON data.
In Perl, you generally use 0 and 1.
We will have to form some tests which detail exactly what's required for the assertion format library, which may by quite interesting given some languages native limitations.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 21, 2020

I think the configuration option should remain. Idealistically, an implementation can deviate from a specification all it wants with options so long as its default behavior adheres to said specification. But given the history around this specific keyword, I think it bears mentioning.


Pertinent to the aside: C/C++ is the same. You generally find

#define false 0
#define true 1

somewhere in the code.

karenetheridge added a commit to karenetheridge/JSON-Schema-Modern that referenced this issue Dec 16, 2020
This reflects new understandings of how "$vocabulary": [ <vocab uri>: false ]
should work, as discussed in
json-schema-org/json-schema-spec#1020 (comment)
karenetheridge added a commit to karenetheridge/JSON-Schema-Modern that referenced this issue Dec 18, 2020
This reflects new understandings of how "$vocabulary": [ <vocab uri>: false ]
should work, as discussed in
json-schema-org/json-schema-spec#1020 (comment)
and json-schema-org/json-schema-spec#1019
@gregsdennis gregsdennis moved this from Closed to Merged in Proposal: `format` update Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment