Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report schema paths in validation errors #199

Closed
Stranger6667 opened this issue Apr 29, 2021 · 10 comments
Closed

Report schema paths in validation errors #199

Stranger6667 opened this issue Apr 29, 2021 · 10 comments

Comments

@Stranger6667
Copy link
Owner

Stranger6667 commented Apr 29, 2021

When a validation error is reported, it contains only instance_path that points to the erroneous part of the input, but there is no information about what part of the schema caused this error.

The gist of the implementation is to keep a stack of strings while the input schema is compiled. The stack should be convertible to paths::JSONPointer.

The process for regular, non-composite validators (more on this below):

  • When the compilation process enters a validator, push its relevant keyword name to the stack (e.g. properties);
  • When the validator struct is created, clone this stack and store it within this struct;
  • When the validator is compiled - pop from the stack

I think that pushing/popping could be implemented similarly to instance_path - a linked list without heap allocations. This way, the actual popping step is not needed at all.

Composite validators are slightly more complicated - they are single structs that implement the logic for multiple different keywords or their specific variants (e.g. this one handles patternProperties and additionalProperties: false) which is done due to performance reasons.

To handle them, the stack should not be pushed during compilation, but rather during validation - if a validation error happens, the stored stack should be cloned, and the exact keyword variant should be pushed to it. And finally, it is moved to ValidationError.

Validators live in keywords.

A lot of work in this direction was done by @gavadinov in gavadinov@4014932, and I think that it could be a great base for implementing this feature.

@gavadinov, let me know if you are OK if somebody can base their work on your commit, or if you want to continue it yourself :)

@DzikiChrzan
Copy link
Contributor

May I take a swing on this?

@Stranger6667
Copy link
Owner Author

Hi @DzikiChrzan!

Absolutely! Let me know if you have any questions on it :)

@DzikiChrzan
Copy link
Contributor

I've been studying the code and I can't fully understand the flow of the data.
First of all, how does PathChunk looks like and what Index in it is?
Do you recommend some end-to-end tests that I could run with gdb to actually see the flow?

Instance is a path to json file, which is read via serde. At start there is called validate on Schema, which then calls crate::validate(self, instance) underneath. This is method from trait and I see that all keywords have those implementations, but I can't find out how validation by keyword is run on whole instance_json.

@Stranger6667
Copy link
Owner Author

First of all, how does PathChunk looks like and what Index in it is?

PathChunk is an enum that represents either a key within a JSON object or an index within a JSON array. Then a sequence of such chunks represents any path within a JSON value. For example, let's take the following JSON data that represents a stored "ls -lh /home" command:

{
   "cmd": ["ls", "-lh", "/home"]
}

If we'd like to extract the "/home" string, then we need to take the following steps when navigating from the document root:

  1. Go to the cmd property - it corresponds to PathChunk::Name("cmd".to_string())
  2. Within the cmd's value, we have an array - ["ls", "-lh", "/home"]. Here, we need to take the element with index 2. It corresponds to PathChunk::Index(2)

When ValidationError is constructed we convert a sequence of such PathChunk that we accumulated during the validation into Vec<PathChunk>. It is done within InstancePath::to_vec and used to create JSONPointer. That JSONPointer struct is stored in ValidationError.instance_path attribute and then later can be converted to a string (which is a valid RFC 6901 JSON Pointer) or to Vec<String>.

The main point of this two-variant enum is to avoid creating heap-allocated strings for indexes within JSON arrays as usize is a Copy type.

I'll update the docs for these structs, and maybe PathChunk and its variants could be renamed to reflect better what they are? E.g. PathComponent with PropertyName and ArrayIndex variants ... what do you think?

Do you recommend some end-to-end tests that I could run with gdb to actually see the flow?

I think that is_valid_another_type or is_valid tests from keywords/mod.rs are relatively small & might be convenient for this purpose.

Instance is a path to json file, which is read via serde. At start there is called validate on Schema, which then calls crate::validate(self, instance) underneath. This is method from trait and I see that all keywords have those implementations, but I can't find out how validation by keyword is run on whole instance_json.

I'll try to describe the main blocks of the compilation process first, as the validation part uses its outcomes.

The schema compilation starts in compilation::options::CompilationOptions::compile, and its goal is to traverse the input schema and build a tree of validators.

This tree is actually a vector of pointers (then it is a forest, I believe) that reference validators that implement the Validate trait. Some of them may have their own vectors of pointers to deeper validators and so on.

Each validator there knows how to process their inputs. For example - if the input is Value::Array, the ItemsObjectValidator validator will iterate over array items and delegate validation of each item to the next level in the validation tree (which is stored in its validators attribute). I.e., validators know how to descend deeper into their inputs.

When the validation process starts inside JSONSchema::validate, it iterates over the top validators level and delegates validation to them. As they know how to process their respective inputs, the instance object is processed recursively in the same way as the tree was build during the compilation step. I.e., all the delegation goes through the validate method, which may call validate methods on "deeper" validators and so on.

I hope that it helps :) If not - I'd be happy to elaborate.

@DzikiChrzan
Copy link
Contributor

DzikiChrzan commented May 1, 2021

E.g. PathComponent with PropertyName and ArrayIndex variants

I think that one's one understands point of PathChunk enum, then naming seems fine. Updating the docs is a good point.

Thank you for thorough explanation. I'll get back at you when I'll have design in mind, probably tommorow.

@DzikiChrzan
Copy link
Contributor

I just noticed, that gavadinov@4014932 is not merged yet.
Should I work on top of his fork or is it going to be one pull request (his changes + this issue)?

@Stranger6667
Copy link
Owner Author

Should I work on top of his fork or is it going to be one pull request (his changes + this issue)?

The best way is to use that commit as inspiration & implement this issue separately. It changes most of the places required to implement this issue, and that approach works.

The original goal of that commit was to track paths within instances, but the implementation was then re-worked in #184. Instances can't be unambiguously tracked by having data only from the input schema. However, the initial take in that commit could be extended to track paths within schemas.

After that point, I changed the way instance paths are tracked - it uses a linked list without allocating a vector for the path components. Probably something similar can be done for paths in schemas as well. Still, I'm happy with a vector-based stack as the compilation step performance is not that important compared to validation.

@DzikiChrzan
Copy link
Contributor

Sorry for long no-response period, but I got some time off and was resting from computer.
I've prepared a draft, so far I've added schema_path as a vector of Strings which are pushed before compilation and poped right after (just for couple of validators yet).
I just wanted to confirm with you that's more or less design you had in mind.
Also, I don't like idea of changing mutability of context but currently I see no other way...

@Stranger6667
Copy link
Owner Author

Hi @DzikiChrzan

Thank you, I'll check it tomorrow :)

Stranger6667 added a commit to DzikiChrzan/jsonschema-rs that referenced this issue Jun 19, 2021
Stranger6667 added a commit that referenced this issue Jun 19, 2021
Ref: #199

Co-authored-by: ueco <ueco@libertymail.net>
Co-authored-by: Dmitry Dygalo <dadygalo@gmail.com>
@Stranger6667
Copy link
Owner Author

Resolved via #226

Thank you, @DzikiChrzan, for taking a step on this :) I added few more things on top in that PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants