Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make large, recursive schemas diff-able by deferring computation of diffs #249

Merged
merged 2 commits into from
Sep 30, 2021

Conversation

solarhess
Copy link
Contributor

Why

My employer has an API schema with a lot of deeply nested and
recursively referenced objects. We wanted to validate that changes made
by developers are backwards compatible. Unfortunately the OpenAPI-Diff
tool would run practically forever. There were too many situations
where it would have to recompute a diff and could not use the cached
result.

I implemented an approach that defers computing schema diffs until
the entire structure of the API schema has been parsed. This prevents
recursive schema definitions from being computed over and over again
during parsing. It also ensures that diffs are only computed exactly
one time, not recomputed.

All this reduces the computational complexity of parsing this big,
recursive schema to a manageable time, and avoids recomputing diffs.

Test case

I have created a test case: LargeSchemaTest that generates a schema
similar to the one my employer uses. (Unfortunately our schema is for
an internal system and I can't share it.)

It will generate similar, but incompatible schemas. These schemas each
have:

  • 250 schemas defined in #/components/schemas, each with 5 properties
    recursively referencing other schemas defined in #/components schemas.
  • 100 api endpoints that use those schemas in the RequestBody or
    ResponseBody.

When this test on the master branch openapi-diff code, it will not
complete. When you profile, you will find that the time is spent
in Changed.isChanged() which recursively calls other instances of
Changed. The deep recursion causes an exponential explosion of the
number of calls required to compute changed for the whole model.

The solution: Deferring computation of diffs

The solution is to break the diff into a two step process:

  • Step 1: Read the schema and align all the diff computations, deferring
    computation of actual differences, and avoiding recursive differences.
  • Step 2: Compute all the differences, avoiding recomputing the
    recursive differences.

This is implemented in OpenApiDiff.compare()]

Implementing this was a relatively small change to the code.
The DeferredSchemaCache holds the cache of SchemaDiffs. It is able to
distinguish multiple requests for the same differences. This is the
key to avoiding recomputing the same difference multiple times.

I replaced all the Optional<?> diff(...) with DeferredChanged<?> diff(...),
and chose an interface for DeferredChanged that matched the Optional
interface. This minimized the lines of code changed, making it easier
to review.

Finally, I created a helper object called DeferredBuilder which
simplifies the task of collecting a bunch of DeferedChanged instances
together to make composing a change easier to program and read.

I857847 added 2 commits July 19, 2021 10:40
…erconnected schema.

This new test generates a large, interconnected api schema with 200 endpoints,
250 model schemas, and references between the models. This generated schema is similar to a real-world
schema that we use in production at my job, that failed to diff because
it never completed its diff computation.
My employer has an API schema with a lot of deeply nested and
recursively referenced objects. We wanted to validate that changes made
by developers are backwards compatible. Unfortunately the OpenAPI-Diff
tool would run practically forever. There were too many situations
where it would have to recompute a diff and could not use the cached
result.

I implemented an approach that defers computing schema diffs until
the entire structure of the API schema has been parsed. This prevents
recursive schema definitions from being computed over and over again
during parsing. It also ensures that diffs are only computed exactly
one time, not recomputed.

All this reduces the computational complexity of parsing this big,
recursive schema to a manageable time, and avoids recomputing diffs.

== Test case

I have created a test case: `LargeSchemaTest` that generates a schema
similar to the one my employer uses. (Unfortunately our schema is for
an internal system and I can't share it.)

It will generate similar, but incompatible schemas. These schemas each
have:
- 250 schemas defined in #/components/schemas, each with 5 properties
  recursively referencing other schemas defined in #/components schemas.
- 100 api endpoints that use those schemas in the RequestBody or
  ResponseBody.

When this test on the `master` branch openapi-diff code, it will not
complete. When you profile, you will find that the time is spent
in `Changed.isChanged()` which recursively calls other instances of
`Changed`. The deep recursion causes an exponential explosion of the
number of calls required to compute changed for the whole model.

== The solution: Deferring computation of diffs

The solution is to break the diff into a two step process:
- Step 1: Read the schema and align all the diff computations, deferring
  computation of actual differences, and avoiding recursive differences.
- Step 2: Compute all the differences, avoiding recomputing the
  recursive differences.

This is implemented in [OpenApiDiff.compare()](core/src/main/java/org/openapitools/openapidiff/core/compare/OpenApiDiff.java#L89-L127)]

Implementing this was a relatively small change to the code.
The `DeferredSchemaCache` holds the cache of SchemaDiffs. It is able to
distinguish multiple requests for the same differences. This is the
key to avoiding recomputing the same difference multiple times.

I replaced all the `Optional<?> diff(...)` with `DeferredChanged<?> diff(...)`,
and chose an interface for `DeferredChanged` that matched the `Optional`
interface. This minimized the lines of code changed, making it easier
to review.

Finally, I created a helper object called `DeferredBuilder` which
simplifies the task of collecting a bunch of `DeferedChanged` instances
together to make composing a change easier to program and read.
Copy link
Contributor

@joschi joschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solarhess Awesome, thanks a lot for your contribution.

While I haven't reviewed every single detail of this PR, it generally looks good and still successfully builds against the latest master branch. 👍

@joschi joschi merged commit ca9c6ab into OpenAPITools:master Sep 30, 2021
@joschi joschi added this to the Release 2.0.0 milestone Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants