Should we redefine our concept of "keyword independence"? #204

jdesrosiers · 2022-07-22T01:50:16Z

jdesrosiers
Jul 22, 2022
Maintainer

The various discussions lately about the architecture of JSON Schema got me thinking about our principle of "keyword independence", why we value it, and why we seem to constantly violate it.

For those who aren't familiar, the keyword independence principle is the idea that keyword behaviors should be self contained. You shouldn't need information from other keywords in order to evaluate it. For example, minimum is ok, but additionalProperties is a violation because it depends on properties and patternProperties.

Keywords that break this principle can have negative properties for schema design, but the overwhelming consensus is that the downsides in most cases are minor and better than the alternatives if there are any.

For me, the big win from keyword independence is the simplicity of the processing model it allows. Keywords can be evaluated in any order. No state needs to be maintained and passed around. This means you can evaluate keywords in parallel if the implementation's programming language supports it, which could mean big performance improvements in many cases.

It turns out that as long as the information we need from the depended on keywords can be determined statically (without evaluating the keyword), we can have keyword dependence without losing all those nice properties in the previous paragraph. For example, additionalProperties can look at the keys in the properties and patternProperties objects and have all the information it needs to evaluate against an instance independently. This can even be done in a compile step.

So, if we consider keyword independence to be whether a keyword can be evaluated independently of the evaluation of other keywords, there are really only a few problematic keywords. These include the then and else keywords that depend on the evaluation result of if and the unevaluatedProperties and unevaluatedItems keywords that depend on the evaluation of sub-schemas. Technically, these violating keywords can be implemented to be independent, but it would result in evaluating some parts of the schema more than once in some cases.

I think switching to this definition of keyword independence gives us a more clear and compelling reason for the principle than we had before. It also allows us to stop considering certain keywords in violation that haven't really been problematic. With a clearly defined reason for the principle, we can focus on trying to improve the few keywords that are in violation or explain why we think it's worth the violation.

gregsdennis · 2022-07-22T02:18:46Z

gregsdennis
Jul 22, 2022
Maintainer

additionalProperties can look at the keys in the properties and patternProperties objects and have all the information it needs to evaluate against an instance independently.

This then requires the keyword to have knowledge of the schema in which they reside. While your concept grants that additionalProperties doesn't need the evaluation result of these other keywords, it does depend on their presence (insofar as to say their absence is like they're empty).

This means you can evaluate keywords in parallel if the implementation's programming language supports it, which could mean big performance improvements in many cases.

While you may not be fully processing the dependent keywords in your static analysis, you still need to minimally consider them multiple times, which can subvert gains you get from parallel processing.

My opinion is that we should have keyword independence as a goal, but not a hardfast requirement. Supposing that a keyword can be designed to work in such a way that allows it to be independent of other keywords, it should be. However we should recognize that sometimes keywords need the results from other keywords, and we need to be okay with this.

Additionally, you still get the benefit of parallel processing if you group by dependency depth. For example, additionalProperties depends on properties and patternProperties, but those two are independent, and unevaluatedProperties depends on all three. So properties and patternProperties have a dependency depth of 0, additionalProperties has a dependency depth of 1, and unevaluatedDependencies has a dependency depth of 2 (and that's really as deep as it goes). Grouped like this, all 0-depth keywords can be processed in parallel, followed by all 1-depth keywords, etc.

1 reply

jdesrosiers Jul 22, 2022
Maintainer Author

This then requires the keyword to have knowledge of the schema in which they reside.

Yes, the point of this redefinition is to expand the context that the keyword as access to from just the keyword value to the schema the keyword is in. The input to the keyword handler would include a schema and a pointer to the keyword in the schema. Getting information about other keywords is a matter of manipulating the pointer. Things like looking at a sibling keyword is a trivial operation.

While you may not be fully processing the dependent keywords in your static analysis, you still need to minimally consider them multiple times, which can subvert gains you get from parallel processing.

It depends on the level of static analysis necessary. That would have to be taken into account when designing keywords. Generally I think it's unlikely that it would outweigh the ability to parallelize, but I don't have numbers to prove that. In any case, that's not an issue with implementations that compile schemas because that minimal amount of extra work happens in the compile step and doesn't affect validation cost.

Additionally, you still get the benefit of parallel processing if you group by dependency depth.

Yes, but one of the benefits of the redefinition is to simplify the processing model. This requires adding a concept of dependency depth to the model which isn't necessary otherwise.

Supposing that a keyword can be designed to work in such a way that allows it to be independent of other keywords, it should be.

I agree with this, but as one factor to consider. There's always a way to define keywords to be independent, but there are other factors that might lead us to choose otherwise. I think fully independent is preferred and if that doesn't make sense, then schema independent is preferred. We may still decide to add keywords that aren't even schema independent (like unevaluated*) if we think there is sufficient value in doing so, but we should be cautious in these cases because they will be more difficult to implement.

Therefore, maybe it makes sense to add this concept rather than replace the old one. There's still value to the current definition, it's just not as strong as the value of the proposed new definition. It's possible to have both.

gregsdennis · 2022-07-26T21:07:45Z

gregsdennis
Jul 26, 2022
Maintainer

Related to json-schema-org/json-schema-spec#701

12 replies

jdesrosiers Sep 12, 2022
Maintainer Author

With dynamic references, the question is whether they're hard because dynamic stuff is hard, or hard because they're extra-complicated.

They are extra complicated, but for me the extra-complicated bits weren't what made it hard to implement. It was thinking through how things behaved dynamically during evaluation and figuring out how to represent that. But that's more a "hard to reason about" thing than a "hard to implement" thing. So, maybe it's only hard to implement because it's hard to reason about.

Can you go into a little more detail?

I never had the time to profile it in detail, but here's what I think the problem is. State can be difficult to manage because it's not just one global state. It can grow and shrink as you go down one branch and return to follow another branch. Rather than trying to figure out what state needs to be removed in which cases, I use immutable state objects. When I go down one path and it's state needs to be dropped, I abandon the copy of the state that went down that path and use the copy I had before I went down that path. I don't think I explained that very well. I hope it made sense. Anyway, every time I change the state I effectively make a copy because the state is immutable. All that object creation can get expensive. I'm sure I could make it work with a mutable state, but I expect that would be significantly more complicated. It would also make it impossible to implement parallelized evaluation, which is something I want to do at some point.

handrews Sep 12, 2022

Thanks, that does make sense to me. I'd actually love to put some real focus on performance for the next release (more on that in a longer reply I'm working on), although the "next release" is kind of overloaded with major questions right now.

Both parallelization and schema compilation are techniques that I also think have been under-considered (certainly by me) over the last several drafts. These arguably fall under the broader umbrella of "performance" but each have their own concerns.

You mention several tradeoffs in that last sentence that are worth exploring in detail in terms of what the spec demands. There are quite a few options here, including defining restricted subsets (whether in terms of keywords, behavior, principles, whatever) for certain well-understood tradeoffs. @karenetheridge's JSON::Schema::Tiny is an inspiration here. @jdesrosiers I know your own implementation is modular, but I haven't looked into how you've broken up the functionality.

Part of what I wanted to do with vocabularies (and the reason unevaluated* got moved to their own vocabulary, and perhaps something similar could/should happen with $dynamic* although its use in meta-schemas makes that more complicated but not impossible) was move away from JSON Schema being everything to everyone. You've raised some significant questions about the granularity of vocabularies vs keywords, but wherever that lands, giving implementers a clear way to express "you can do these things but not those things, and it's not some weird personal decision but a choice that is intentionally accommodated by the spec" is an option I think we should consider.

gregsdennis Sep 12, 2022
Maintainer

Rather than trying to figure out what state needs to be removed in which cases, I use immutable state objects. When I go down one path and it's state needs to be dropped, I abandon the copy of the state that went down that path and use the copy I had before I went down that path.

Mine did this as well, until I had a user suggest a mutable context object (my state). Now my state object is built with individual stacks, and I have to manage it. I'm not sure why it makes a significant difference because it's still tracking the same amount of data, but it seems to.

I need to profile mine as well. Corvus (.net) claims to have a minimal allocation implementation, but I don't know if the test they showcase was specially selected or not.

gregsdennis Sep 12, 2022
Maintainer

maybe it's only hard to implement because it's hard to reason about.

This was it for me. Once I got my head around it, building it was straightforward.

handrews Sep 12, 2022

It was thinking through how things behaved dynamically during evaluation and figuring out how to represent that. But that's more a "hard to reason about" thing than a "hard to implement" thing.

@jdesrosiers @gregsdennis I realize it's very hard to do a hypothetical on something that you've had to work through in a very different way already, but does the following way of framing dynamic lookups/identifiers (using a hypothetical $da and $dr that don't do everything the current $dynamic* keywords do) feel easier to reason about? Both of these keywords only have effect when the schema object that contains them is evaluated.

"$da": "x" associates "x" with the current dynamic scope if and only if "x" is not present in any parent dynamic scope
"$dr": "x" looks for the nearest parent scope with an "x" associated with it by $da, and resolves to the associated schema location

"schema location" here is the URI captured in standard output for any annotations or errors for that dynamic scope, currently as absoluteSchemaLocation, and in the future as just schemaLocation.

handrews · 2022-09-12T20:46:11Z

handrews
Sep 12, 2022

@jdesrosiers pulling this out to a top-level comment b/c the other thread has a good discussion going in a slightly different direction:

So, the question is: is the benefit of dynamic behaviors worth that loss [of performance and the ability to do parallel evaluation]? Personally, I think it can be worth it in exceptional cases, but not in the general case.

But what does that "exceptional" vs "general" really mean in practice?

From a framework+modular keywords perspective, dynamic behaviors are either supported or not. There are two main dynamic behaviors: looking up an identifier in parent dynamic scopes, and reading annotations from child dynamic scopes. When it comes to cost, there are three areas where cost is incurred:

framework code (meaning anything shared by multiple keywords, whether your implementation makes a clear division between "framework" and "keyword" or not)
standard keywords
3rd-party extension keywords.

[side note: I'm going to ignore if/then/else because it's only the requirement that no more than one of then or else be evaluated that makes it non-parallelizable. That requirement could easily be relaxed as long as annotations are deleted correctly for the unnecessary branch.]

Framework costs

Framework support for these behaviors is a one-time cost. Unless we eliminate them entirely (and unevaluated* and $dynamic* with them), that cost is factored in. I've always been concerned about the annotation cost, and filed #236 to address that. As noted above, I did not expect parent lookups to be expensive, so I'm interested in what you have to say about that (since it's state-related, it might be relevant to annotations as well- identifiers and annotations basically only differ in direction of usage (parent to child or child to parent).

We can and should work to reduce this cost, but the only way to eliminate it is to eliminate those keywords, and I don't think you or anyone else are proposing that we do that (please correct me if I am wrong).

We should also work to understand how to make these keywords as parallelization-friendly as possible, while acknowledging that they do impose some serial execution requirements when they are used. We should work to ensure that they do not reduce parallelism when they are not in use.

Standard keyword costs

The area of biggest concern would be standard keyword costs, because most implementations will (correctly) feel obligated to support these, and each such keyword can add additional cost and impose additional serial execution requirements. Part of why I filed #236 is to ensure that, as much as possible, standard keywords with dynamic behavior only incur (most of) their cost when they are used.

While there are coarse-grained optimizations that can be done now (e.g. don't collect annotations if unevaluated* are not being used and the caller did not ask for annotations to be collected), we can make that a lot more fine-grained. I would hope that we can do something similar with $dynamic* if you have not already done that.

But as far as I know, there aren't any more dynamic keywords in high demand. I certainly have no plans to propose any, and the ones we have all met much higher bars of demand and/or discussion before they were accepted than most anything else in JSON Schema. I would hold other dynamic keywords to similar standards, and would support (and write if folks would like) an ADR making that high bar very clear.

So if your concern about exceptional vs general case is primarily about standard keywords, I think I agree and I would like to help codify that.

3rd-party extension keyword costs

This is where we should have a general discussion about what our obligation is regarding what people could do with extension keywords. My keyword behaviors proposal is intended to put some boundaries on that so that there's a limited possibility for extensions to kill JSON Schema understandability and performance and damage the overall reputation of the project. Your counter-idea of doing this at a different abstraction level of principles would presumably accomplish the same goal (and I haven't tried to move the keyword behaviors stuff towards anything concrete because I honestly don't know which level will work out better).

Whether we're talking about behaviors or principles, separating those things into "framework costs" (whether there's a literal framework or not) that are tightly constrained by the spec, vs "keyword costs" that might vary substantially, limits some of the possible costs of 3rd-party keywords. Which is obviously a more general concern than just dynamic behaviors, although dynamic behaviors are among the costs we most want to contain.

Beyond these containment strategies, I don't think there's much we can or should do. We could exclude dynamic behaviors from the modular keyword "API" (in quotes b/c I don't mean literally specifying an API, more of a conceptual thing). But that would be hard to enforce and might actually make people think about those behaviors more. Plus, it would preclude valuable explorations of the tradeoffs.

To summarize...

The most intractible costs are framework costs, and we just have to manage those through Making annotation collection practical #236 and similar discussions — this is where specification work is relevant
To the extent that anything should be "discourage", I think that is best done with an ADR establishing a high bar for accepting new dynamic keywords into the standard JSON Schema Org-maintained vocabularies —this sort of guidance doesn't belong in the spec, but does belong in a formal Org-published document of some sort
Since the ADR in the previous point should make the trade-offs clear and communicate that throwing a bunch of dynamic keyword proposals at the spec team is unlikely to be successful, I don't think anything beyond that should be done about potential 3rd-party dynamic keywords — there's just no upside in making a big deal about it

2 replies

jdesrosiers Sep 12, 2022
Maintainer Author

But what does that "exceptional" vs "general" really mean in practice?

From a framework+modular keywords perspective, dynamic behaviors are either supported or not.

I haven't read the whole comment and I'm too tired to today, but I can explain what I meant here. If something is generally supported it should be part of the framework and have plugin support, etc. If it's not generally supported, it's something that would have to be built into the implementation. Effectively that would mean that we as JSON Schema Org can make exceptions, but there wouldn't be support for third parties to make vocab keywords with dynamic behaviors.

That's probably too extreme. What I had in mind was more like it being optional for implementations to support third-party vocab keywords with dynamic behaviors.

handrews Sep 13, 2022

If something is generally supported it should be part of the framework and have plugin support, etc.

What I had in mind was more like it being optional for implementations to support third-party vocab keywords with dynamic behaviors.

This actually dovetails with some things related to your process proposal that I'm going to open a new discussion on, about framework vs features vs keywords. I'll link it here when I do. Feel free to continue here with anything that feels relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema

Should we redefine our concept of "keyword independence"? #204

{{title}}

Replies: 3 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

JSON Schema

Should we redefine our concept of "keyword independence"? #204

jdesrosiers Jul 22, 2022 Maintainer

Replies: 3 comments · 15 replies

gregsdennis Jul 22, 2022 Maintainer

jdesrosiers Jul 22, 2022 Maintainer Author

gregsdennis Jul 26, 2022 Maintainer

jdesrosiers Sep 12, 2022 Maintainer Author

handrews Sep 12, 2022

gregsdennis Sep 12, 2022 Maintainer

gregsdennis Sep 12, 2022 Maintainer

handrews Sep 12, 2022

handrews Sep 12, 2022

Framework costs

Standard keyword costs

3rd-party extension keyword costs

To summarize...

jdesrosiers Sep 12, 2022 Maintainer Author

handrews Sep 13, 2022

jdesrosiers
Jul 22, 2022
Maintainer

Replies: 3 comments 15 replies

gregsdennis
Jul 22, 2022
Maintainer

jdesrosiers Jul 22, 2022
Maintainer Author

gregsdennis
Jul 26, 2022
Maintainer

jdesrosiers Sep 12, 2022
Maintainer Author

gregsdennis Sep 12, 2022
Maintainer

gregsdennis Sep 12, 2022
Maintainer

handrews
Sep 12, 2022

jdesrosiers Sep 12, 2022
Maintainer Author