-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format
should not change behavior based on its vocabulary value
#1020
Comments
I'm not going to get involved in the resolution of this, but just to make sure we're all on the same page (because everything about or related to The egregiously complex interplay of the boolean The only combination directly related to that is:
This allows for best effort validation of
@gregsdennis I believe this is the case? The table in the section of the release notes to which I linked above should make that clear, please let me know if there is confusion on this point. Since we changed the behavior for unrecognized keywords to be "collect as annotation" instead of "ignore", Really, if we get rid of the half-ass "best effort" case, then |
But for
This is what I have a problem with. This implies that |
Regarding the (2019-09) table, the rows that concerns me are the
If the configuration is off but there is some kind of support, annotations should be generated; the implementation shouldn't error. |
I'll try to rework things to make the "annotation unless configured" behavior a bit more clear. |
No, it changes from best-effort validation (if available and configured) to complete validation.
The goal was to make The config option, which is only present for historical reasons, is only relevant for In my view ALL of the bizarre inconsistent behavior should be in the backwards compatibility case. Everything else should function normally, so that means that with Because the semantics of Implementation-side configuration options are a bad idea in general because they change the behavior of the schema to something other than how it was written. Don't make that problem worse, please. (although if you do I'm not going to do anything about it). |
Honestly, the simplest option is to get rid of the legacy bullshit- no config option, and you either implement it fully or you don't. But people will complain about that so you have to decide whether you're ok with it. I'm not sure more people would complain about that than complain about format already TBH. |
My argument is that this is wrong. It's relevant for both The problem space has three variables: configuration of the library (assert or annotate), whether the library knows about the vocabulary (yes or no), and the value of the vocabulary entry in Here's a new table:
Implementations should be configured to annotate by default. I don't see how this is affected by a best-effort or full validation. The implementation either supports the keyword or it doesn't. The degree to which it implements the keyword is irrelevant.
How can either the
Where is this sematic defined?
This is the only text that defines what the behavior should be for a vocabulary with value of
|
I assume you meant "format is merely an ANNOTATION keyword that..."? |
After reading the above, 3 times, and the current preview of the spec 99.99 % of schema authors will never write their own vocabulary or use a non-standard dialect (Bar the OAS dialect). As such, we should make the upgrade path of previous schemas as easy as possible, and the current docs and articles on JSON Schema as least out of date as possible. I'm not advocating we should just do things as we've always done because "reasons". That would be bad. However, we should provide minimal friction for users, which is what the current state of 2020-11 preview and 2019-09 affords. As an aside, 2019-09 is over a year old, and NO ONE has asked about this. Not one person has come and asked us, at least. As such, we SHOULD keep the status quo for now, and tackle It SOUNDS confusing from an implementation perspective, but from a schema author perspective, it actually makes a lot of sense.
If I want others using my schema to always use The goal is no more surprises (removing ambiguity) and ease of use, and I think that was met in the approach laid out, as per 2019-09, and as per 2020-11 preview. Remember, the vocabulary object boolean doesn't enable or disable a vocabulary, but asserts if the implementation MUST understand the vocabulary in order to correctly process the schema. If it's set to The config option is really a sticky plaster over a problem we need to solve, so, let's assume we're going to remove it at some point. With the config option removed, a vocabulary object boolean of You could argue that in such an event as we properly define So, given we can't fully qualify the exact validation for all the |
99.99% of schema authors aren't even using 2019-09 (all the questions I see have draft 7 schemas, despite the suggestions to use 2019-09, and only a handful of validators even support it; even VS Code doesn't support it), and the ones that do aren't using We need to fix this now when adoption is at its lowest; it'll be much harder to fix later.
Why do you assume that the default behavior for If you want the schema to specify whether it should be an assertion...
This is the wrong way to do this. (I described it in the slack thread.) If you want the schema to be able to specify that The current definition conflates two separate issues: vocab value meaning and configurability. These ideas need to stay separate.
"No more surprises" would mean that "Ease of use" would mean that this keyword behaves like any other keyword with respect to the vocab value.
This is what documentation is for. Literally no one is going to have 100% validation on all of these keywords. No one can. There will always be edge cases, no matter how strengently a format validation is coded. You are therefore relegated to mandating that everyone documenting their level of support. Secondarily, if you must configure the implementation separately, you'll never have that "ta-da suprise" moment because you have to configure it properly. You'll never get assertions when you want annotation, and vice versa. As it's written, that could happen. As it's written, I can't require This is the gotcha that we have to remove. My PR does exactly that. With my PR, you can opt in to whether you get assertion or annotation behavior through configuration, and the schema specifies whether format support (at all) is a hard or soft requirement. |
@Relequestual you have captured exactly what I was going for and I agree with every word you wrote about how to move forward.
There is no need for this. Unknown keywords are now collected as annotations, so there's no way not to get So I don't see a use case here that needs addressing. For the rest of it, something will be confusing/surprising and I still think we got that choice right in 2019-09 as explained by @Relequestual. The "what if I only want annotations" case would have been worth addressing, but it is no longer an issue because of the changed behavior of unknown keywords. |
I'm in complete agreement with @gregsdennis. The
If I think the only way to get all the behaviors users expect is to define two format vocabularies: annotation-only and assertion. There would have to be two dialect schemas as those two format vocabularies use the same keyword and therefore can't be in the same dialect. The annotation-only one can be required (it's easy enough, so why not) and the assertion one can be optional providing the behavior most users probably expect. |
There's absolutely a need for this. Suppose I write an application where I require Conversely, if I don't require validation ( These two cases are valid scenarios that the current state CANNOT support. It is patently incorrect to say that an application will only ever want unvalidated annotations or full sematic validation. The table I posted above shows how my proposal supports ALL of the valid scenarios. Yes, there may be some overlap given unknown keywords as annotations, but that's worlds better than intentionally not supporting some of them.
@handrews you argued this in Slack regarding the Still, no one has argued why the configurability of |
That is not the way this works. By the time an application gets the annotation output it does not have any sort of vocabulary mapping. It just has keyword names and values, and schema and instance locations. There is no detectable difference in these scenarios whatsoever. The whole point is that the application looks at annotations named This is the point of the "compatible semantics" clause. If you mix
Yes there is- if you only care about syntax validation of Basically, take the existing default meta-schema, remove The reason meta-schemas are separate from vocabularies is to handle odd cases like this where semantic and syntactic constraints might not completely align. Whether that's because there are keywords without formal semantics (so they have syntax, but the semantics are entirely application-determined) or because the syntax needs to be more strict (or more loose), or whatever.
I don't know what this means- perhaps I did not understand what was going on with that Anyway... the value of
Do you mean full validation or best-effort validation? Here is what
The rationale with the format vocabulary and the configuration option is that The
Now, let's consider There are two things to handle with Problem 1: It can't be validated by default for historical reasons. You have to enable the assertion behavior on the implementation. In my view, this is a horrible idea, but we didn't want to change it. So we kept that configuration option. It interacts as follows:
Which brings us to problem 2: Many if not most implementations half-ass So I thought "there is one specific case here that is already non-deterministic, and that is when the vocabulary is known and set to In hopefully all other vocabularies ever designed, they are either supported or not. There is no half-assing. But if a vocabulary were to allow half-assing (intentionally or as a practical matter), then the correct thing would be to allow half-assing on the vocabulary set to To summarize
There is exactly one case where half-assing ever makes a difference, and that is when assertion is configured and the In theory you could do the same with
As much as I complain about all of this, the way I set it up stays very much within the confines of the intended semantics of What all of this keeps coming back to as far as I can tell (and I mean "all of this" as in "this whole family of issues and PRs around format and vocab and content etc.") is that various folks here do not understand:
I'm not quite sure what to do about this. We could do a call and try to hash this out so we can stop playing whack-a-mole with the latest misunderstanding in this area. I could just leave y'all to discard my intentions on these and come up with your own system- that's a sincere offer, as I can't stay vigilant on this constantly on an ongoing basis, so if we can't get on the same page it would be better for you to come up with your own system than theoretically working with mine but running into problems with much of it. @jdesrosiers I could see some variation on your two-vocabulary proposal being helpful, but there are several specifics as written that indicate the problems I have listed above, and really I still think that, as I have explained above, it works and is as compliant as possible with then normal behavior of |
Vocabularies
I agree with you 100% on what the values of vocabularies mean. ✔️
|
vocab value | vocab known | format |
title |
maxLength |
---|---|---|---|---|
false |
yes | annotation | annotation | assertion |
false |
no | annotation | annotation | assertion |
true |
yes | annotation | annotation | assertion |
true |
no | error | error | error |
They behave the same! Cool!
Now let's throw in the configuration rule (off for default annotation behavior, and on for assertion):
vocab value | vocab known | config | format |
title |
maxLength |
---|---|---|---|---|---|
false |
yes | off | annotation | annotation | assertion |
false |
no | off | annotation | annotation | assertion |
true |
yes | off | annotation | annotation | assertion |
true |
no | off | error | error | error |
false |
yes | on | assertion | annotation | assertion |
false |
no | on | assertion | annotation | assertion |
true |
yes | on | assertion | annotation | assertion |
true |
no | on | error | error | error |
Still no problems; same behavior between keywords of the same type based on the configuration.
And finally format
as currently defined. The default behavior is defined by the spec:
When the vocabulary is declared with a value of false, an implementation:
- MUST NOT evaluate "format" as an assertion unless it is explicitly configured to do so;
...
When the vocabulary is declared with a value of true, an implementation that supports this form of the vocabulary:
- MUST evaluate "format" as an assertion unless it is explicitly configured not to do so;
Because of these two rules, the default behavior changes from annotation to assertion based on the $vocabulary
value.
vocab value | vocab known | config | format |
title |
maxLength |
---|---|---|---|---|---|
false |
yes | off (default) | annotation | annotation | assertion |
false |
no | off (default) | annotation | annotation | assertion |
true |
yes | off (NOT default) | annotation | annotation | assertion |
true |
no | off (NOT default) | error | error | error |
false |
yes | on (NOT default) | assertion | annotation | assertion |
false |
no | on (NOT default) | assertion | annotation | assertion |
true |
yes | on (default) | assertion | annotation | assertion |
true |
no | on (default) | error | error | error |
I've edited this table since first publish. Behaviors are the same (I think), but the defaults are different. This is the bad thing.
This change in default presents an inconsistent behavior relative to ALL OTHER KEYWORDS. This is a gotcha. It's unexpected. There is no reason for format
to behave differently in these two scenarios. It's damn near impossible to implement. And it will confuse whoever uses an implementation that does this.
A default behavior needs to remain constant. An application needs to be able to set "treat format
as an assertion" and expect that it works that way always.
Configuring for assertions looks like
options.TreatFormatAsAnnotation = true;
Configuring for annotations looks like
because this is the default.
The spec, therefore, is stating that an application must explicitly configure the implementation it's using to use the default, which is sometimes non-default, based on the value in the schema which, presumably, it knows nothing about.
I hear you in my head saying that an application should always know what the schema has because the application defines the schema. Not always. I have clients who are reading schemas dynamically from files, network locations, and even databases. There's no guarantee that the schema is going to require or not require any vocabulary. Therefore, they have to be able to set the behavior of format
once without knowing the content of the schema in accordance with the requirements of the application.
If you remove this dynamic behavior, format
just works like whatever kind of keyword it's configured for, and $vocabulary
works exactly like it's intended to.
@gregsdennis you are changing the requirements from what was agreed for 2019-09 when you dismiss the "half-assed" use case, which was about implementor choice to barely implement some formats, not library limitations. This paragraph:
I think applies to the case when the This part:
is definitely about the "half-assed" part, which I hope is clearly not based on hard limits of libraries or environments. Of course if you change the requirements, you can do it differently! And better! But I wasn't allowed to do that at the time- that's why it doesn't do what you want. It does what I had to make it do regardless of what I wanted (nothing about @Relequestual made a comment about not changing 2019-09 for the next draft in this area. I think that's a reasonable position. But since 2019-09 hasn't seen much real use, and The reasons for the 2019-09 requirements are somewhat lost to time. OpenAPI at one point wanted much more continuity around There may have been other people involved but I don't recall. As noted, Evgeny is rewriting everything and trying to hire people for it, so he's probably more likely to be OK with change. OpenAPI seems to be more relaxed about If you are going to change the requirements at all to remove the awkward compatibility aspects, I recommend stepping back and making the boldest possible change to bring From that perspective, while @gregsdennis has good ideas about tidying up the mess a little bit, @jdesrosiers' idea about two vocabularies is probably a stronger proposal. My variation of that would be:
I don't know where that leaves libraries that do minimal validation of But feel free to choose whatever new requirements you want as far as I'm concerned. @gregsdennis's are more reasonable than what I had to work with in 2019-09. As long as you intentionally choose some new requirements, the question of whether the spec is correct will be one you can answer better than me! Since we have resolved this to a question of differing requirements, I don't plan to comment on this further as I'll be fine with whatever you choose there. My objections had to do with the 2019-09 requirements only. |
I'm still playing catch up here, and I'll respond to the above properly later, but I mentioned this issue on the OAS call last night. |
First let me say that I 'm coming into this conversation with only a fairly superficial understanding of vocabularies. However, I have fully experienced the pain of the format keyword. Moving to a place where we can be clear whether or not a JSON Schema document requires format to be validated seems like a good place to be. Having a vocabulary say that a keyword must be treated as an assertion but then still leave the door open for libraries to only partially implement the validation seems like we in the same bad place. That doesn't exist for other assertions does it?
Personally, I don't want to use tooling that was selective about what parts of a spec it chose to implement. The fact that .NET only has native support for 64-bit integers doesn't mean a .NET application can't validate numbers that are bigger than 64 bit when presented with a JSON serialization of those numbers. I don't believe the JSON Schema spec says that validators must map values to the native types of the implementation language. I don't need a .NET email type to validate an email value. Having the JSON Schema spec say that format may or may not be validated seems consistent with the world as it is today. Having a way for a schema writer to say this format value MUST be respected by tooling is something that we can't do today and selectively it would be useful to have. |
@darrelmiller this is the whole idea! The market is competitive. If you don't want a partial implementation of I (and What I do in Manatee.Json and JsonSchema.Net (which supercedes Manatee) is provide a default validation but also provide mechanisms for clients to define and use their own validation logic. In my mind, this provides the best of both worlds: I implement what I think is reasonably sufficient for the library, but if clients want more they can have it. My docs As an aside, JsonSchema.Net uses the System.Text.Json serializer and extracts numbers as
I completely agree with this. But hijacking the vocabulary value is not the way to do it. A couple alternatives have been suggested:
I'm open to other solutions as well so long as they're consistent and don't interfere with the definitions of other keywords. Again, this issue isn't about partial vs full validation logic, it's about changing the default behavior based on the vocabulary. |
I'm really trying to stay out of this but I feel the need to note:
While it's true I'm tired of the partial implementation allowance, and would love to see it go away, that is separate from whether I think it's feasible. For 2020-NN, I am not offering an opinion and will support whatever emerges. That said, what @gregsdennis is not the problematic partial implementation in my view. Almost any keyword can have limitations passed through from the environment. There are different limitations on numeric size and precision, on regular expression dialects, etc. If the only available library for validating email syntax in a given system doesn't quite support the whole spec, that, to me, is a normal sort of keyword limitation and nothing requiring special treatment at all. Most formats will not have that problem. The problematic partial implementation case that we determined we needed to support has nothing to do with hard limits. It is about implementors (the humans involved) consciously choosing to not validate the entire spec, including some obvious things that could have been validated. The canonical example is an email address validator that checks for an Please, please, please keep these two cases clearly separate in this discussion.
"fixing" the type 1 limitations would at most take the form of how regexps are managed- referencing some other standard that defines a minimum interoperability threshold. Type 2 limitations are effort+performance vs functionality tradeoffs made by implementors. JSON Schema can either continue to support these essentially arbitrary decision that were never formally part of the spec, or it can decide to forbid them, in the sense of declaring such implementations to be out of conformance (and enforced via the test suite). Note that when I say "essentially arbitrary" I don't mean capricious or malicious, I just mean that based on discussions over the years, the exact choices made had to do with personal decisions on how much to invest in the keyword rather than hard technical thresholds. In 2019-09, we decided that we needed to continue to support type 2 limitations, but that we wanted to encourage either not implementing assertion support at all, or implementing full (subject only to type 1 limitations) support. 2020-NN can make a different choice and end support for type 2 limitations, but the question here is whether or not to make that choice. All of the rest of this arguing about decision tables and how conditions interact is irrelevant unless everyone agrees on the requiremens:
Those are the legacy requirements, and we said "yes" to both in 2019-09, and the spec as written reflects that. |
I think I see a satisfactory path forward, so I'm going to take some time to review and make a proposal. Please refrain from more walls of text if possible 😅😬 |
I think our only consensus-based path forward here is to do pretty much what @jdesrosiers said above. A A For implementations that support the My feeling is this approach will match the expectation of the majority of users, schema authors, and implementers, with the minimal level of changes, while also fixing some problems we clearly have. @gregsdennis My apologies, I haven't read the updated PR, but I doubt it implements the above suggestion. If you agree, we have a consensus. A reasonable and logical approach and one of least resistance.
In retrospect, I don't think the other thing I was going to address even needs to be said if we follow the above. |
@Relequestual I'm really hoping that there is a typo in that sentence, or I'm more lost than I thought. |
Yikes. sorry. I'll amend! |
This is acceptable. I'll try to work something up. Would it makes sense for both vocabularies to share the same meta-schema but have different vocab URIs? There's nothing about the meta-schema that changes. It's only the sematics that change. |
Yeah I think that's fine. Have at it =] |
I think there may have been a finer point I probably didn't communicate in my above comment. I'm in the process of a review. |
As an aside on the "language processing limitations". Perl has no native boolean. Seems strange I know, but are approaches for processing JSON in Perl so as to differentiate between |
I think the configuration option should remain. Idealistically, an implementation can deviate from a specification all it wants with options so long as its default behavior adheres to said specification. But given the history around this specific keyword, I think it bears mentioning. Pertinent to the aside: C/C++ is the same. You generally find #define false 0
#define true 1 somewhere in the code. |
This reflects new understandings of how "$vocabulary": [ <vocab uri>: false ] should work, as discussed in json-schema-org/json-schema-spec#1020 (comment)
This reflects new understandings of how "$vocabulary": [ <vocab uri>: false ] should work, as discussed in json-schema-org/json-schema-spec#1020 (comment) and json-schema-org/json-schema-spec#1019
This is just wrong. It precludes an ability to assert format when supported but not require assertion when not supported. The spec still allows for configuration of this, but the practical side of that configuration becomes quite difficult and confusing:
This is hard to do and confusing for clients (users of the implementation). The configuration should always work one direction. If my implementation offers a "format behavior" configuration with values of "assert" and "annotate," setting either value only works for one of the vocab cases. The only way to get the desired behavior is to have a configuration that says "use the non-default behavior," which changes its meaning depending on the schema it's processing.
What we SHOULD have is format as an annotation ALWAYS, but configurable to be assertion.
I agree with @karenetheridge that it'd be nice to have a way for the schema itself to indicate how format should be processed, and I think that changing it based on the vocab value was an attempt at that. But the vocab value and the behavior of format are orthogonal concerns. The spec is conflating them unnecessarily.
The text was updated successfully, but these errors were encountered: