-
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"$id": Eliminate base URI shadowing #726
Comments
I'm not sure this improves schema authoring or writing implementations. As far as I can tell, this would mean I'd have to add a condition to implementations saying "Evaluate JSON Pointer fragments until an $id is encountered, so as to make anything under a subschema with $id inaccessible." At that point, it seems like we just want to get rid of JSON Pointer fragments altogether. If I've got this right, why would this be desirable? |
@awwright I found having to manage a stack of in-scope base URIs and the current relative position to each to be one of the most complicated things to implement. But @johandorland can perhaps speak to a more real-world scenario. |
@awwright so far what I have established with recent issues is that there are people who think every possible thing to do or not do with I should probably just ignore all comments/requests/complaints regarding |
Do you mean in
The most important question, I think, is "was this ever intended to be a feature at all?" We added it to examples just in the most recent raft for... um... reasons? Honestly I cannot tell why. It's not clear to me that it was ever intentionally made a feature. I basically filed this issue because I think I made a mistake putting the examples in. Do you remember why that seemed like a good idea at the time? |
The way I look at this shadowed base URIs are not really a feature, but more of an unintended consequence. In the past I've had a hard time explaining to people in the JSON schema slack that a single part of the schema can be addressed by an arbitrary number of Conceptually it'd be nice if every part of a schema only has a single URI it can be @awwright has a very good point though. Currently the way I and I suspect most others to this first pass is to register all different base URIs contained in the JSON and when a So if anyone thinks Option 1 is a good idea, we should just say that the behaviour of shadowed base URIs is undefined. That way an implementation is a bit more free in how it wants to implement behind the scenes, but others do not have to go out of their way to explicitly disable it. Even though I suggested this I'm not actually sure it's a good idea. Even though I don't like shadowed base URIs, the cure might be worse than the problem itself as I'm not sure how we disallow it in a proper manner without making it more complicated. All I have to say about option 2 is that I don't get the mental model described, so I can't really comment on that. |
I think it's odd that if I embed a document using
But that's intentional with a clear use case. There are two defined fragment syntaxes, and the plain name syntax is for labeling schemas as kind of analogous to exported function from a module. They are independent of the schema's location within the document, so you can refer to them from outside the document without worrying if the internal structure gets rearranged somehow. Note that you cannot use plain name fragments across |
👍 for Option 1
I'd be ok with calling this behavior "undefined" as @johandorland suggests. We can let linters check for document boundary violations. It'd be no worse than what JSON does with calling the behavior of duplicate properties "undefined".
I have no idea what you mean by this. I'm sure it can't mean what it sounds like. Option 2I've always found it awkward to put The approach I prefer is that it just doesn't make sense to put an However, the approach that seems to make most sense given the path JSON Schema has chosen, is to consider I think defining an |
Not using JSON Pointer at all, only named schemas. (I'm not sure it's viable to remove outright, but at least for authoring I don't see any reason to use pointer fragments anymore.) |
The major difference between HTML and JSON schema is that HTML is online by default whilst JSON schema is not. It's even discouraged that remote references are downloaded by default. Having identifiers within each JSON document makes it very helpful to keep track of them. In my opinion https://json-schema.org/latest/json-schema-core.html#rfc.section.8.3.1 explains this very well. You can just pass a schema to an implementation and it will sort out all the references and you don't have to worry about telling which schema is identified by what URI. |
I don't think it makes any difference whether it's remote or local. Either way you are retrieving schemas from somewhere.
I'm still not entirely clear, but I'm assuming that by "named schemas" you are referring to the location-independent identifiers as currently defined in the spec. If so, that's interesting. It's a little less efficient than using pointers because you would have to scan the document for all the named schemas before you could process the schema. But if you support named schemas at all, you have to do that anyway. Without pointers, schemas wouldn't be able to arbitrarily reference any part of a schema. Especially with regard to remote references, this might be a good thing. It makes only part of a schema "public" (can be As for removing JSON Pointer support entirely, I'll point out that JSON Pointer (or something similar) is required for my use-case of JSON Reference. My use case doesn't involve JSON Schema, so it's not relevant by itself, but with the introduction of vocabularies we might not want to keep options open. If my use-case requires JSON Pointer, some vocabulary might need it as well. |
Also consider: All parsers have to do some sort of indexing. If you're a JSON parser, you're indexing a byte stream into a set of properties or items. We're just so used to the JSON parser doing it for us that when we add an additional thing to index ("$id"), it seems foreign: We have to take the parsed result, and scan over it again. If you were parsing a JSON Schema document using a SAX-like event stream instead, it wouldn't look like an additional step at all. It would just be an event you handle, just like all the others. |
🐡 Mind blown! (there's no github exploding head emoji, so... blowfish 🤷♂️ ) |
@johandorland wrote:
Yes, this! @jdesrosiers without root Most of my career has been in enterprise infrastructure appliances and/or cloud-hosted services. The appliances are frequently behind firewalls with limited external access, and a strong resistance to opening additional routes through the firewall. With managed service providers and the largest enterprise customers, even with cloud-hosted products they want their own instance in their private cloud with their own security controls. This is an even more stringent requirement when selling to the government or other highly regulated spaces, where the product must be "air-gapped", meaning that it cannot rely on a connection to the outside internet in any way. So for all of these use cases, while the schemas are hosted on the manufacturer's public web site, and their Furthermore, during development, schemas are often written as YAML or JavaScript and loaded from a local filesystem, from files with a .yaml or .js (or .json5, etc.) extension so that editors work properly. In that case as well, it is important that the root Root |
@awwright @johandorland @jdesrosiers my preference at this point would be to make the behavior of shadowed JSON Pointer fragments undefined. The philosophical justification is: It should (SHOULD?) be possible to move any document embedded using This is not possible if you are using shadowed JSON Pointer references, so the behavior of such references is necessarily undefined. This avoids imposing a further burden on implementors, but makes clear why schema authors should not do this regardless of any one implementation's support. For use cases where schemas are sometimes bundled into a single file, this ensures that the same references will work in both bundled and unbundled states. How does that sound? |
@handrews Consider this brief aside for a moment: The W3C TAG published this a long time ago saying (in brief) you shouldn't parse arbitrary data as a certain media type unless you know it was given that media type by the server: https://www.w3.org/2001/tag/doc/mime-respect Well there's a similar thing here, should we know affirmatively that the inner instance being pointed to is, in fact, a schema (and not a random value)? For example, should this be legal (even though the $ref isn't pointing to an item known to be a schema): {
$id: "#",
properties: {
"foo": { const: { type: "string" } },
"bar": { $ref: "#/properties/foo/const" }
}
} |
That's all well and good, but it's completely irrelevant to the point I was making. My point applies to whether schemas are network accessible, or on a local file system, or in memory. It doesn't matter. There's always some sort of locator. Even if the Ultimately, which addressing approach you think is better is going to depend on your use-case. Every use-case I want to use JSON Schema for involves the schema being part of a web-based system (such as a Hypermedia API). It might be a closed network, but it's still web based. Web addressability is essential in these use cases and It's probably best that JSON Schema continue to support and encourage the use of both depending on what makes the most sense for the use case. Therefore, I would argue against ...
|
@awwright You're packaging up several things in an effort to come up with a counter-example. It's a good counter-example, and a good reference drawing a parallel with The way you tell whether the "server" (schema author) as labeled an object or boolean as a schema is by noting whether it is being applied by an applicator keyword, or stored in a schema placeholder keyword ( Obviously, in the case of nested subschemas, they all need to be chained together by applicators / schema placeholders (I just made up that term, it's not great, everyone please roll with it rather than derailing onto the topic of what to call In PR #713 makes it clear that being a There is more discussion on this in issue #687, and there was even more on slack. If we continue in the current direction for another draft, we will formalize how an implementation can recognize extension applicators/placeholders, as that is the only piece missing. Currently, it either needs to be a standard keyword, or one that an implementation understands as a known extension. |
@jdesrosiers I don't know how to respond to an objection of "I don't care about your use case or your entire industry's use case." You're far from being one of the most difficult offenders here, but this week has been a constant stream of people telling me that only their use case matters and everything else should be thrown out, so I'm probably way more angered by this position today than I otherwise would be (in general, not at you personally). At this point, no one is pushing hard to forbid using the retrieval URI in addition to the root |
I'm seriously going to write a code of conduct and it will include THOU SHALT NOT DISPARAGE OR DISREGARD ANYONE ELSE'S USE CASE. 😝 |
@handrews When I said your response was irrelevant, I wasn't saying that your use-case was irrelevant. I was saying that you were making a counter-argument for something that wasn't what I was arguing. It's like I said, "Dogs are great companions" and you responded with, "No way! Carrots taste great and they're good for you too. I eat carrots all the time". That's the kind of irrelevant I was talking about. To attempt again, to clarify, I am not suggesting that JSON Schema should always be network accessible. Networks had nothing to do with what I was talking about.
I never thought anyone was. You mentioned it as an option. It was relevant to the discussion, so I mentioned it. If I came off as defensive or combative, I'm sorry. |
Sigh. I do apologize. It's been a really long week for me with this project and I'm more than a little on edge. And trying to push through anyway because I know at least some people want this draft despite recent assertions by a vocal minority to the contrary. I've been told a lot of stuff I've spent a lot of time and effort trying to get right over the past two years is irrelevant or harmful in the last week, so that was not a fun sentence to read right now. Not your fault. That said, I do not see my response as irrelevant. The presence or absence of a network was not the point I was trying to make. The point is that when you say:
I am responding that, based on my numerous use cases with real, shipping products, there is no usable locator. The problem is not that there is no network-accessible locator. It's that no possible locator is usable in place of |
To me, this is key, and is the main issue various people have been talking about with regard to $ref / $id support, as I understand. It's been created as a dynmaic system to allow flexibility, allowing people to create converluted situations. It's not clear that a specific subset of those situations have a use case, which creates authoring issues, and may create implementation issues. |
I'm in favour of reducing the complexity and possible situations here, and waiting to see if anyone says this would cause issues for them. I would note, I would prefer to avoid any more "...behaviour is undefined" situations. If I understand correctly, that isn't the objective here, but just in case I haven't understood everything quite right. |
I'd like to give one or two more (different) people a chance to comment, before moving this to |
What is the motivation for requiring implementations to error out? It's definitely nicer, but @awwright did raise a concern that that might be harder than just letting it work for some implementations. |
At first glance I'd think sure why not, then I realized by using these tactics you can do some even nastier things. You'd probably break a lot of JSON schema implementations once you start putting It's probably off-topic for this issue, but it might be worth to consider to forbid these kind of references in one way or another. |
My understanding is, we want to make this change because of two main reasons:
I realise I'm not the average JSON Schema author, but the reference examples you pasted from the current spec in your opening issue comment here, seem logical and OK to me. I can see those examples, look at each one, and can understand the resultion reasoning. In fact, this list of examples was a great help in getting me to understand. I'm not an implementer. I don't maintain an implementation, nor am I likely to. I can't speak to etiher of those use cases, but if the reason for the change is because it's confusing, making it "not officially supported" but not likely to "throw an error", feels to me like it would create even more confusion? Maybe because it's so unliekly that anyone is doing this, it won't be a problem (just like how it's uncommon for someone to have duplicate keys in a JSON object). Maybe I'd feel easier about it if we were to add something like. |
OK, I've had a discussion with @handrews about this off github to make sure we're on the same page. I would propose that It looks like @awwright was the only one to propose that this would cause a problem. Can you expand on the reasons for this please? After discussion with @johandorland (and please correct me if I'm wrong), he doesn't feel it would cause an issue for his implementation (just to take an example from the other end). |
I think @KayEss did not implement shadowing at all in his code, so might offer another useful counterpoint. (and yes, I support what @Relequestual wrote above) |
I don't find this very convincing. If I mirror your schema somewhere else then it is available under two URLs. If you mirror the schema inside another schema why would you expect any difference? |
@KayEss, I think you're missing the point. Assume we have the following two schemas.
{
"type": "object",
"properties": {
"aaa": { "$ref": "/bar" }
}
}
{ "type": "string" } Now assume we want to embed the referenced
{
"type": "object",
"properties": {
"aaa": {
"$id": "/bar",
"type": "string"
}
}
} Now consider the URI URI fragments indicate a fragment of the document. If we allow JSON Pointers to "see inside" |
I don't agree with that. I think that the documents are different (in one you have more stuff) and so a user visible change in behaviour is reasonable. I don't think that we should try to specify some sort of "smart" or "multi" document format within JSON. If you want to mirror a schema inside another that opens up aliasing, just as it does if I want to mirror your This shadowing thing should just remain undefined behaviour for now. |
That's what we're trying to do- make it undefined (or possibly an error, but the momentum seems to be with undefined). Right now it is required that all of the aliases work. @jdesrosiers I'm still more or less with you on the embedded document thing, but this issue doesn't decide for or against that model. I wanted to split things up and see if we could even get agreement on this part (which is apparently at least a bit challenging on its own). |
Fair enough, but I feel like the justification for this proposal hinges on the concept of
To be clear, this is far from my goal. This (for me) is all about the principle that fragments are not allowed to reach from one document to another. If you don't accept the
I strongly disagree with this. I think the embedded document model is far more simple in all respects than the base URI change model. The concepts you have to understand are few, simple, and consistent. Which leads to a very simple implementation with few edge cases. |
Leaving aside the embedded document perspective, there is another problem with the base URI change model. It allows one identifier to identify two different documents. Taking this example again.
{ "type": "string" }
{
"type": "object",
"properties": {
"aaa": {
"$id": "/bar",
"type": "string"
}
}
}
|
I guess we both agree that if you replace the embedded schema at I do like the embedded document idea is a metaphor to explain things, but I'm not convinced that the metaphor should suggest this sort of change to the aliasing rules. Full disclosure though, my implementation handles this aliasing just fine, and banning it would be a significant amount of extra code. |
Hence the likelihood of declaring that behavior to be undefined, which is standards-ese for "you really shouldn't use this and definitely can't rely on it but sometimes it might work" Basically, I'd like to discourage its use by schema authors without requiring implementations to rip it out. |
I'm going to write a PR for this, possibly in combination with other |
I don't think I have it in me to get this through in draft-08 |
OK, I read back through all of this and think we all did a really great job of sorting through the issues and settling on the "shadowed URIs have undefined behavior" change. Let's get this in draft-08 after all: I'm writing the PR right now. |
"URI shadowing" (h/t to @johandorland for the term) refers to the scenario when there are multiple
$id
s in a schema, with at least one$id
in a subschema of another object containing$id
(such a the root schema object).When did this become a thing?
Here is the schema and various resolved URI examples, copied and pasted directly from the current spec:
Notice that all of the locations (the list section headers) are done as JSON Pointer fragments, relative to the root of the entire example schema.
Prior to draft-handrews-json-schema-01, the analogous example section did not show any examples with JSON Pointer fragments. Using the exact same schema, it showed these resolved URIs only:
But you can see that the locations were done the same way- as a relative JSON Pointer fragment from the overall document root.
The key point here is that we added examples of URI shadowing in handrews-*-01, which was a clarification of draft-07 (draft-07 was originally published as handrews-*00). So we didn't even introduce these with draft-07. There was some confusion over it (and other aspects of
$id
and$ref
) and we added the examples when we clarified the text.Why did we do this?
Good question. The change was made in PR #550, which was primarily about removing the terminology around "internal references" vs "external references". Which had itself been a great improvement by @awwright over the draft-04 language around "inline resolution" and "canonical resolution". None of these prior approaches directly explained how JSON Pointer fragments work across the presence of
$id
.The issue mentioned by that PR only mentions JSON Pointer fragments once, at the end, after the PR was submitted: #545 (comment)
I know I had questions over how that should be handled when I tried implementing what were then the new draft-06 proposals. And I know that I ended up implementing URI shadowing. I don't remember why.
Quite a few people weighed in on and reviewed #545 and #550, so this isn't something that slipped in by accident.
Did anyone rely on this before handrews-*-01?
No clue.
Did anyone actually implement what's in handrews-*-01?
I think @johandorland did? He commented on the issue/pr. Not sure if anyone else did. @Julian?
Does the test suite test this?
No.
So, really, why did we do this?
I think the rationale was that since the locations in the existing example were given as JSON Pointer fragments, that must mean that those fragments were valid, which would definitely imply that every parent
$id
introduces a base URI that must be tracked throughout all subschemas, even if another$id
appears.Can we just not?
I think so. JSON Pointer fragment evaluation is just defined in terms of JSON document structure, without any thought given to changing base URIs or embedding one document in another. But media types get to specify fragment syntax and semantics, so I think we can reasonably say that it stops when the base URI is reset, and from that point on, the prior base URI no longer applies at all.
That is actually how we interpret plain name fragments. Note in this part of the example:
the
#bar
plain name fragment can be used with the innermost basehttp://example.com/other.json
, but not with the outer basehttp://example.com/root.json
.So restricting the use of JSON Pointer fragments to not cross
$id
boundaries would make their behavior more consistent with how we handle plain name fragments.Similar to #724, this considers
$id
to establish a document boundary, with the new base URI it creates applying to the document within that boundary. Unlike #724, what is proposed here does not make any other changes to$id
's behavior. In particular, the effect of an$id
with a JSON Pointer fragment remains undefined, although we can address that separately later if we want to.So what would this look like?
There are two options WITHIN THE SCOPE OF THIS ISSUE. If you want to talk about other options, file them yourself :-) Comments that go off-topic here will be deleted.
Here's the example schema again so you don't have to scroll so much:
Option 1: Exactly one valid URI involving a JSON Pointer fragment for each location:
This is a lot more straightforward, and all of these addresses work whether the schemas were loaded as one document (as in this example), or if each of
root.json
,other.json
, andinner.json
had been loaded separately, and$ref
-ed each other.Option 2: But what about the URI used to fetch the document?
If it's possible to retrieve this example from
http://example.com/alias.json
(as opposed to or in addition to its declared$id
ofhttp://example.com/root.json
), then should fragments also work from that base?I can see it either way.
One option is to consider fetching this from
alias.json
and discovering that its declared$id
isroot.json
is to conceptualize this as a redirect, in which case the fragment is applied to the redirect target (at least in HTTP). h/t to @jdesrosiers for this mental model. In this sense, anywhere we haveroot.json
we would also require supportingalias.json
On the other hand, JSON Schema repeatedly talks about URIs as identifiers rather than locators, so I think we could rationally say that the
$id
is how the schema needs to be referenced. It's fine if it's fetched from elsewhere outside of the reference process (like being pre-loaded from a local filesystem) but does that mean that we should support those loading URIs when an$id
is provided? (obviously if no$id
, then the loading URI is the only available base).Supporting the loading URI as a shadowed base, but nothing else, might be a bit odd, or it might be more intuitive. I really have no idea at this point 😆
The text was updated successfully, but these errors were encountered: