Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anyOf/oneOf with recursive schemas causes out of memory exception #703

Closed
43081j opened this issue Feb 12, 2018 · 10 comments
Closed

anyOf/oneOf with recursive schemas causes out of memory exception #703

43081j opened this issue Feb 12, 2018 · 10 comments

Comments

@43081j
Copy link

43081j commented Feb 12, 2018

I'm using the latest version of ajv:

new Ajv({ allErrors: true })

Essentially it seems there's a huge inefficiency somewhere when dealing with recursive schemas.

I have a fairly large schema so I'm unable to extract an example of this happening (I'll keep trying to narrow it down). But essentially I have something along the lines of:

{
  "anyOf": [ { "$ref": "#/definitions/generic" } ],
  "definitions": {
    "generic": {
      "type": "object",
      "anyOf": [
        { "$ref": "#/definitions/model1" },
        { "$ref": "#/definitions/model2" }
      ]
    },
    "genericChildren": {
      "type": "array",
      "items": { "$ref": "#/definitions/generic" },
      "additionalItems": false
    },
    "model1": {
      "type": "object",
      "properties": {
        "type": {
          "type": "string",
          "enum": [ "model1" ]
        },
        "model1_a": { "type": "string" },
        "model1_b": { "type": "string" },
        "children": { "$ref": "#/definitions/genericChildren" }
      },
      "required": [ "type" ],
      "additionalProperties": false
    },
    "model2": {
      "type": "object",
      "properties": {
        "type": {
          "type": "string",
          "enum": [ "model2" ]
        },
        "children": { "$ref": "#/definitions/genericChildren" }
      },
      "required": [ "type" ],
      "additionalProperties": false
    }
  }
}

If you have data like:

{
  "type": "model1",
  "model1_a": "foo",
  "model1_b": "bar",
  "children": [
    {
      "type": "model2",
      "non-existent-property": null
    }
  ]
}

As you can see, model2 will be invalid, as expected. Because additionalProperties is false.

This does validate correctly (assuming what I wrote above off the top of my head is syntactically correct).

However, it produces a huge amount of errors internally. In my case, I get 2,900+ errors. This seems directly proportional to how many schemas exist in the anyOf of generic.

In cases where I have a large object, deeply nested, I often get 1 million or more errors from ajv. Most of these errors look like duplicates or at least very similar to each other.

I've stepped through ajv's generated code for some time now and didn't really get anywhere of use. All I found out was:

  • If an object is invalid, it will have the sum of all errors from all anyOf entries (which is already a lot, but not the thousands I saw)
  • The parent will no longer match any entries of the anyOf (due to its children not being matched either), so it too generates the same amount of errors, if not more
  • Using oneOf causes performance to decrease massively because, unlike anyOf, it won't quit early once it matches
  • Even if an object matches one of the anyOf entries, it seems it will produce errors for all other entries (even if they may not be used in the end)

This causes node to throw an out of memory exception due to the max heap size being reached. In the browser, it will crash the page (along with dev tools). Essentially the vErrors array grows to such a large size that it can use 2GB+ of memory in my case.

@43081j
Copy link
Author

43081j commented Feb 12, 2018

ok here's a JS reproduction:

      const schema = {
        "$schema": "http://json-schema.org/draft-06/schema#",
        anyOf: [ { $ref: '#/definitions/generic' } ],
        definitions: {
          generic: {
            type: 'object',
            anyOf: [
              { $ref: '#/definitions/model1' }
            ]
          },
          genericChildren: {
            type: 'array',
            items: { $ref: '#/definitions/generic' },
            additionalItems: false
          },
          model1: {
            type: 'object',
            properties: {
              type: {
                type: 'string',
                enum: [ 'model1']
              },
              model1_a: { type: 'string' },
              model1_b: { type: 'string' },
              children: { $ref: '#/definitions/genericChildren' }
            },
            required: ['type'],
            additionalProperties: false
          }
        }
      };
      for (let i = 2; i < 20; i++) {
        schema.definitions[`model${i}`] = {
          type: 'object',
          properties: {
            type: {
              type: 'string',
              enum: [`model${i}`]
            },
            children: { $ref: '#/definitions/genericChildren' }
          },
          required: ['type'],
          additionalProperties: false
        };
        schema.definitions.generic.anyOf.push({
          $ref: `#/definitions/model${i}`
        });
      }
      const json = {
        type: 'model1',
        model1_a: 'foo',
        model1_b: 'bar',
        children: [{
          type: 'model2',
          nonExistent: null
        }]
      };

      //editor.set(json);
      const ajv = new Ajv({
        allErrors: true
      });
      ajv.addMetaSchema(await getJSON('node_modules/ajv/lib/refs/json-schema-draft-06.json'));
      const valid = ajv.validate(schema, json);

nonExistent: null invalidates the whole tree, and we then end up with 778 errors in ajv.errors.

If we instead pass the following data:

{
  "type": "model1",
  "model1_a": "foo",
  "model1_b": "bar",
  "children": [{
    "type": "model2",
    "children": [{
      "type": "model2",
      "nonExistent": null
    }]
  }]
}

It is one level deeper, we now get 14,135 errors in ajv.errors.

As you can see, in this schema, if a very deep child is invalid we will easily reach chrome (or node's) maximum memory allocation.

@epoberezkin
Copy link
Member

epoberezkin commented Feb 12, 2018

You specifically instructed ajv to collect all possible errors (allErrors: true), so that's exactly what it does. There is no easy generic way to decide which branch is "closer" to being valid, so ajv validates data against all branches and includes all errors. You may have a smaller number of errors without allErrors: true, but because ajv still doesn't know which branch is less invalid it would include errors from all branches of at least one "anyOf".

To have a smaller number of more useful errors you need to specifically tell ajv which branch it should and which branch it shouldn't try to validate against. draft-07 introduced conditional if/then/else to make it possible. So instead of using:

{
  "anyOf": [
    {"$ref": "#/definitions/model1"},
    {"$ref": "#/definitions/model2"}
  ]
}

you could use something like

{
  "properties": {"type": {"enum": ["model1", "model2"]}},
  "allOf": [
    {
      "if": {"properties": {"type": {"const": "model1"}}},
      "then": {"$ref": "#/definitions/model1"},
    },
    {
      "if": {"properties": {"type": {"const": "model2"}}},
      "then": {"$ref": "#/definitions/model2"},
    }
  ]
}

You obviously can do it without allOf, using only if/then/else, but if you have many possible types the above way seems better to me - it allows to avoid deep nesting of if/then/else.

EDIT: On another hand, with if/then/else only, without allOf, it can be faster, as in the schema above would have to compare "type" property with every possible type value, while with if/then/else it would stop as soon as the first one matches...

{
  "if": {"properties": {"type": {"const": "model1"}}},
  "then": {"$ref": "#/definitions/model1"},
  "else": {
    "if": {"properties": {"type": {"const": "model2"}}},
    "then": {"$ref": "#/definitions/model2"},
    "else": false
  }
}

@43081j
Copy link
Author

43081j commented Feb 12, 2018

yup I figured allErrors is what causes the explosion in error count, but you're right I do still get a rather large amount without it.

I understand why it does it (because as you said, it can't know which of the anyOf it should be validating against). However, I didn't realise the if/then was in the spec now (assumed its still a proposal of some sort).

I'll give that a go.

@epoberezkin
Copy link
Member

If the problem is indeed stack size because of deeply nested recursive data (it should be really deep for it to be a problem though), you can consider using async schemas - it should help with the stack size (negatively affecting performance though. But as I understood, your problem is the large size of error array.

You can also try not using verbose option - it also increases the size of error object (data and schema can be large strings that are copied by value, not by reference).

@epoberezkin
Copy link
Member

There is a proposal for select that makes above even more concise. It depends on $data proposal, so it's unlikely that select makes into the next draft. select is available in ajv-keywords. With select you can:

{
  "select": {"$data": "0/type"},
  "selectCases": {
    "model1": { "$ref": "#/definitions/model1" },
    "model2": { "$ref": "#/definitions/model2" }
  },
  "selectDefault": false
}

But bear in mind that it has some limitations (they aren't critical though - see issues there).

@43081j
Copy link
Author

43081j commented Feb 12, 2018

no worries, i'll try with allErrors: false, verbose: false.

what spec is if/then in? is that draft 7?

also, just to confirm i understand this. allErrors: false will mean per object we return after the first error? and is that the first error of each of anyOf?

i do get quite a few errors still but far, far less than with allErrors: true.

so lets say i have 4 entries in my anyOf, i will get 1 error per entry for a given object? or i will get 1 error that nothing in the anyOf was matched?

@epoberezkin
Copy link
Member

what spec is if/then in? is that draft 7?

yes

allErrors: false will mean per object we return after the first error?

yes, if this error means failure. That's the default.

and is that the first error of each of anyOf

each branch of anyOf will stop validating on the first error in each branch, but all branches must be validated as some branch may pass. All errors that were collected will be reported. "anyOf" always stops validating on the first passing branch. "allOf" stops validating on the first failing branch, but in allErrors mode it will always validate against all branches. "oneOf" stops validating on the second passing branch, otherwise it will always validate against all branches (as only one branch allowed to be valid) - that's why anyOf generally better than oneOf.

@epoberezkin
Copy link
Member

In all cases without allErrors it will stop as early as possible and with allErrors collect as many errors as possible.

@43081j
Copy link
Author

43081j commented Feb 12, 2018

yup sounds good to me.

i did switch from oneOf to anyOf for performance reasons (better to quit early than have to test every branch, especially since all entries have a different type so we can't match two anyway).

I totally didn't realise if/then was in one of the drafts already so i'll give that a go.

Seems most of my problems are solved by allErrors: false, fortunately.

Thanks for the quick responses, i figured it would be schema related or config rather than a bug but wasn't sure.

The whole idea of branch selection is really what we need. To say "based on the type, use this associated schema".

@epoberezkin
Copy link
Member

epoberezkin commented Feb 12, 2018

Try select too - I'm sure it'll get into the spec some day :) See above if you missed. Actually, the more people use it and support it in JSON-Schema-org the more likely it to become the standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants