Make large, recursive schemas diff-able by deferring computation of diffs #249
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
My employer has an API schema with a lot of deeply nested and
recursively referenced objects. We wanted to validate that changes made
by developers are backwards compatible. Unfortunately the OpenAPI-Diff
tool would run practically forever. There were too many situations
where it would have to recompute a diff and could not use the cached
result.
I implemented an approach that defers computing schema diffs until
the entire structure of the API schema has been parsed. This prevents
recursive schema definitions from being computed over and over again
during parsing. It also ensures that diffs are only computed exactly
one time, not recomputed.
All this reduces the computational complexity of parsing this big,
recursive schema to a manageable time, and avoids recomputing diffs.
Test case
I have created a test case:
LargeSchemaTest
that generates a schemasimilar to the one my employer uses. (Unfortunately our schema is for
an internal system and I can't share it.)
It will generate similar, but incompatible schemas. These schemas each
have:
recursively referencing other schemas defined in #/components schemas.
ResponseBody.
When this test on the
master
branch openapi-diff code, it will notcomplete. When you profile, you will find that the time is spent
in
Changed.isChanged()
which recursively calls other instances ofChanged
. The deep recursion causes an exponential explosion of thenumber of calls required to compute changed for the whole model.
The solution: Deferring computation of diffs
The solution is to break the diff into a two step process:
computation of actual differences, and avoiding recursive differences.
recursive differences.
This is implemented in OpenApiDiff.compare()]
Implementing this was a relatively small change to the code.
The
DeferredSchemaCache
holds the cache of SchemaDiffs. It is able todistinguish multiple requests for the same differences. This is the
key to avoiding recomputing the same difference multiple times.
I replaced all the
Optional<?> diff(...)
withDeferredChanged<?> diff(...)
,and chose an interface for
DeferredChanged
that matched theOptional
interface. This minimized the lines of code changed, making it easier
to review.
Finally, I created a helper object called
DeferredBuilder
whichsimplifies the task of collecting a bunch of
DeferedChanged
instancestogether to make composing a change easier to program and read.