-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-6861 Schema Registry: verbose compat checks for JSON #23208
Conversation
Adds a helper for constructing singleton `raw_compatibility_result`'s.
Define `operator==` and `operator<<` for `compatibility_result` to allow using it in `BOOST_REQUIRE_EQ` in tests.
Add all json incompatibility types that we expect to report with their descriptions matching the reference implementation.
I don't think you need to backport this to v24.1 or v23.3, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this monumental effort. need to do a second pass to understand better the pr. left some bookmarks for snippets that i'd like to revisit
std::for_each( | ||
newer_it, newer_tuple_schema.end(), [&](json::Value const& n) { | ||
auto item_p = p / "items" / std::to_string(i); | ||
auto sup_res = is_superset(ctx, older_additional_schema, n, item_p); | ||
auto sup_err = sup_res.has_error(); | ||
|
||
if (sup_err) { | ||
res.merge(std::move(sup_res)); | ||
res.emplace<json_incompatibility>( | ||
std::move(item_p), | ||
json_incompatibility_type:: | ||
item_removed_not_covered_by_partially_open_content_model); | ||
} | ||
|
||
++i; | ||
}); | ||
|
||
// Check if older has excess elements that are not compatible with | ||
// newer["additionalItems"] | ||
auto newer_additional_schema = get_object_or_empty( | ||
ctx.newer, newer, get_additional_items_kw(ctx.newer.dialect())); | ||
if (!std::all_of( | ||
older_it, older_tuple_schema.end(), [&](json::Value const& o) { | ||
return is_superset(ctx, o, newer_additional_schema); | ||
})) { | ||
return false; | ||
} | ||
return true; | ||
std::for_each( | ||
older_it, older_tuple_schema.end(), [&](json::Value const& o) { | ||
auto item_p = p / "items" / std::to_string(i); | ||
auto sup_res = is_superset(ctx, o, newer_additional_schema, item_p); | ||
auto sup_err = sup_res.has_error(); | ||
|
||
if (sup_err) { | ||
res.merge(std::move(sup_res)); | ||
res.emplace<json_incompatibility>( | ||
std::move(item_p), | ||
json_incompatibility_type:: | ||
item_added_not_covered_by_partially_open_content_model); | ||
} | ||
|
||
++i; | ||
}); | ||
|
||
return res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change the interesting difference
is_superset(ctx, older_additional_schema, n, item_p)
vs is_superset(ctx, o, newer_additional_schema, item_p)
is drowned between equal lines.
not sure if factoring the common part could result in more readable code, have you tried it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. I've refactored this now, I think it made it more readable but let me know what you think.
&& !is_superset(ctx, older_additional_properties, schema)) { | ||
// not compatible | ||
return false; | ||
if (!is_false_schema(older_additional_properties)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to me: not sure why is_false_schema(older_additional_properties) is brought here, need to re-read this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't change the set of schema pairs that are compatible (because a false schema won't match against any schema anyway). But it's necessary to report the specific error of property_removed_from_closed_content_model
when older_additional_properties
is false. Let me invert the condition here, that seems simpler.
auto older_minus_newer = json_type_list{}; | ||
std::ranges::set_difference( | ||
older_types, newer_types, std::back_inserter(older_minus_newer)); | ||
|
||
if ( | ||
older_minus_newer.empty() | ||
|| (older_minus_newer == json_type_list{json_type::integer} && newer_minus_older == json_type_list{json_type::number})) { | ||
res.emplace<json_incompatibility>( | ||
std::move(p), json_incompatibility_type::type_narrowed); | ||
} else { | ||
res.emplace<json_incompatibility>( | ||
std::move(p), json_incompatibility_type::type_changed); | ||
} | ||
return res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: i need to break down on paper this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment now, maybe that helps, but yeah you might need a paper for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*newer_value)) { | ||
return false; | ||
} | ||
} else if (older_value.has_value() /* && !newer_value.has_value() */) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: commented out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to have this as a helping comment to say that in this case newer_value
doesn't have a value, but perhaps it's more confusing than helpful. I've removed it now.
context const& ctx, | ||
json::Value const& older, | ||
json::Value const& newer, | ||
std::filesystem::path p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: should this be a const&
to avoid possible copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's probably pointless, since at every level we need to generate a new path anyway
Pure refactor, no behaviour change intended. Make a small change to the method to make the diff in the following commit simpler to read and reason about.
Pure refactor, no behaviour change intended. Make a small change to the method to make the diff in the following commit simpler to read and reason about.
Pure refactor, no behaviour change intended. Make a small change to the method to make the diff in the following commit simpler to read and reason about. New tests are also added (they passed before too).
6d86671
to
3c88300
Compare
return false; | ||
if (is_false_schema(older_additional_properties)) { | ||
res.emplace<json_incompatibility>( | ||
std::move(prop_path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird it triggers use-after-move but it seems safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I went with an lambda here to avoid the move now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some other comments, but overall seems good. waiting for the walkthrough before approving
if (res.has_error()) { | ||
return res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we returning at the first json_type is_superset error, instead of merging the errors of the other is_[type]_superset and enum/combinators? is it to match behavior?
I think the final result will depend on the ordering of the "type": [...] array.
Not that this is highly critical, since I didn't see much use of the "type": [...] feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this now. It makes sense to continue to return as many errors as possible.
} else { | ||
throw as_exception(invalid_schema(fmt::format( | ||
"dependencies can only be an array or an object for valid " | ||
"schemas but it was: {}", | ||
pj{o}))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: json schema validation should prevent reaching this block, but at the same time it's not harmful to leave it here.
auto path_dep = p / "dependencies" / as_string_view(older_dep.name); | ||
auto const& o = older_dep.value; | ||
auto n_it = newer_p.FindMember(older_dep.name); | ||
|
||
if (o.IsObject()) { | ||
if (n_it == newer_p.MemberEnd()) { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_schema_added); | ||
return; | ||
} | ||
|
||
if (!n.IsObject()) { | ||
return false; | ||
} | ||
auto const& n = n_it->value; | ||
|
||
// schemas: o and n needs to be compatible | ||
return is_superset(ctx, o, n); | ||
} else if (o.IsArray()) { | ||
if (n_it == newer_p.MemberEnd()) { | ||
return false; | ||
} | ||
if (!n.IsObject()) { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_schema_added); | ||
return; | ||
} | ||
|
||
auto const& n = n_it->value; | ||
// schemas: o and n needs to be compatible | ||
res.merge(is_superset(ctx, o, n, std::move(path_dep))); | ||
} else if (o.IsArray()) { | ||
if (n_it == newer_p.MemberEnd()) { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_array_added); | ||
return; | ||
} | ||
|
||
if (!n.IsArray()) { | ||
return false; | ||
} | ||
// string array: n needs to be a a superset of o | ||
// TODO: n^2 search | ||
bool n_superset_of_o = std::ranges::all_of( | ||
o.GetArray(), [n_array = n.GetArray()](json::Value const& p) { | ||
return std::ranges::find(n_array, p) != n_array.End(); | ||
}); | ||
if (!n_superset_of_o) { | ||
return false; | ||
} | ||
return true; | ||
} else { | ||
throw as_exception(invalid_schema(fmt::format( | ||
"dependencies can only be an array or an object for valid " | ||
"schemas but it was: {}", | ||
pj{o}))); | ||
} | ||
}); | ||
auto const& n = n_it->value; | ||
|
||
if (!n.IsArray()) { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_array_added); | ||
return; | ||
} | ||
// string array: n needs to be a a superset of o | ||
// TODO: n^2 search | ||
bool n_superset_of_o = std::ranges::all_of( | ||
o.GetArray(), [n_array = n.GetArray()](json::Value const& p) { | ||
return std::ranges::find(n_array, p) != n_array.End(); | ||
}); | ||
if (!n_superset_of_o) { | ||
bool o_superset_of_n = std::ranges::all_of( | ||
n.GetArray(), [o_array = o.GetArray()](json::Value const& p) { | ||
return std::ranges::find(o_array, p) != o_array.End(); | ||
}); | ||
if (o_superset_of_n) { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_array_extended); | ||
} else { | ||
res.emplace<json_incompatibility>( | ||
std::move(path_dep), | ||
json_incompatibility_type::dependency_array_changed); | ||
} | ||
} | ||
return; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this diff could be less jarring if the lambda was pulled outside before changing all_of to (for_each+json_compatibility_result), but overall the code seems ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this as is. You're right, I could move the lambda out and that would make the diff much nicer, but I would want to move it back to its place afterwards so I think I am going to leave this as is to reduce code motion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for a fix for use-after-free (possibly a false positive), but looks overall good
Adapt the json compatibility check code to report the exact reason of the incompatibility that is causing the compatibility check to fail. Tests are adapted to now make checks explicitly on the reported incompatibilities instead of just the boolean of whether the change is compatible or not. Note that because our json compatibility checks are expressed as forward compatibility checks and not backward compatibility checks, the names of the error types are always the inverse of what the surrounding code and comments express. For example, we report the `additional_items_removed` error when `newer` has `additionalItems` but `older` does not. Some general patterns in this change are: * return false --> add a specific error * return true --> return a non-error result * return is_superset(...) --> depending on how the reference implementation works one of the following is chosen: - Forward the result: return is_superset(...) - Merge the result and continue to gather more: res.merge(is_superset(...)) - Merge the result and add a more specific error: res.merge(is_superset(...)); res.emplace(...) - Only add a specific error if the result has an error: res.emplace(...)
Add a smoketest to further test that multiple (as many as possible) errors are reported for a complex schema. This is checked in both the forward and backward direction.
Extend the verbose compatibility message test to include json as well.
3c88300
to
6b16be3
Compare
Force-push: fix (incorrect) use after move clang tidy error, address more code review feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think you missed one of my comments about appending "items"
to the path.
nit? There are a few cases that don't have tests.
I responded to the "items" comment now. I thought I fixed it but it was just a similar line close to your comment: #23208 (comment)
Are there any in particular that you have in mind? I tried increasing test coverage with this PR, covering the existing tests and adding new ones for all interesting verbose error types that hadn't been covered. For some very similar error types (eg. maximum decreased vs minimum increased, etc.) I didn't bother adding new tests but I tried to have at least 1 test for each error type otherwise. |
/backport v24.2.x |
This implements verbose compatibility reporting for json schema registry. This has already been implemented for Protobuf and Avro, and with this patch it is going to be supported for json as well.
This PR only changes what error message is being reported when there is an incompatibility but the logic of "what sets of schemas are considered compatible" is intended to be unchanged. The changes start with a few commits that introduce helpers and do non-logic-changing refactoring to make the main commit (47d543b) easier to review. While the main commit is large, the changes in it are trivial, most of its complexity lies in verifying that the reported compatibility error type matches the reference implementation. The existing compatibility test suite is extended to capture not just the (non)compatibility of the schemas but also the reason for the incompatibility by checking the verbose error reported. This helps us achieve reasonable coverage of this patch.
Note that because our json compatibility checks are expressed as forward compatibility checks and not backward compatibility checks, the names of the error types are always the inverse of what the surrounding code and comments express. For example, we report the
additional_items_removed
error whennewer
hasadditionalItems
butolder
does not.Known discrepancies
There are some
Note
s andTODO
s left in the unit tests where I noticed that the error type we report is different to what the reference implementation reports.Note
s highlight differences I don't intend to change because I consider them bugs in the reference implementationTODO
s I intend to make follow up tickets forFixes https://redpandadata.atlassian.net/browse/CORE-6861
Backports Required
Release Notes
Features