json.match_schema performance #7011

lcarva · 2024-09-10T19:14:56Z

Short description

The json.match_schema function takes much longer when the JSON schema is significantly large.

I created a simple reproducer here: https://github.com/lcarva/opa-json-schema-perf
(Schema too large for rego playground)

The reproducer validates a small object against the CycloneDX SBOM JSON Schema (about 5k lines long).

$ opa eval --data . 'data.main.results' --profile --format=pretty
[
  [
    true,
    []
  ],
  [
    true,
    []
  ]
]
+------------------------------+-----------+
|            METRIC            |   VALUE   |
+------------------------------+-----------+
| timer_rego_load_files_ns     | 26615260  |
| timer_rego_module_compile_ns | 24751477  |
| timer_rego_module_parse_ns   | 26171100  |
| timer_rego_query_compile_ns  | 38743     |
| timer_rego_query_eval_ns     | 288952134 |
| timer_rego_query_parse_ns    | 37035     |
+------------------------------+-----------+
+--------------+----------+----------+--------------+-------------------+
|     TIME     | NUM EVAL | NUM REDO | NUM GEN EXPR |     LOCATION      |
+--------------+----------+----------+--------------+-------------------+
| 288.827434ms | 4        | 4        | 4            | main.rego:10      |
| 63.714µs     | 4        | 4        | 4            | main.rego:12      |
| 32.994µs     | 4        | 4        | 4            | main.rego:14      |
| 15.366µs     | 1        | 1        | 1            | data.main.results |
| 3.809µs      | 1        | 1        | 1            | main.rego:5       |
| 3.181µs      | 1        | 1        | 1            | schema.rego:3     |
| 2.385µs      | 1        | 1        | 1            | schema.rego:36    |
+--------------+----------+----------+--------------+-------------------+

main.rego:10 is the json.match_schema call where the CycloneDX schema is being used. main.rego:12 uses a much smaller schema. That's 288,827 vs 63 microseconds.

Version: 0.68.0
Build Commit: db53d77c482676fadd53bc67a10cf75b3d0ce00b
Build Timestamp: 2024-08-29T15:23:19Z
Build Hostname: 3aae2b82a15f
Go Version: go1.22.5
Platform: linux/amd64
WebAssembly: available

Steps To Reproduce

See description.

Expected behavior

Validation of object should not take longer than 1ms.

The text was updated successfully, but these errors were encountered:

anderseknert · 2024-09-10T20:34:55Z

Hi there! And thanks for filing this issue.

Looking into this briefly, and almost all of that time is spent in loading the JSON schema, not actually validating.
This loading isn't cached either, so each call is going to repeat loading the schema. Using a cached schema makes things... faster, to say the least. Notice the use of gojsonschema.NewSchema(sl) below, where the returned schema is reused:

package main

import (
	"fmt"
	"os"
	"time"
)
import "github.com/xeipuuv/gojsonschema"

func main() {
	now := time.Now()

	bs, err := os.ReadFile("schema.json")
	if err != nil {
		panic(err)
	}

	sl := gojsonschema.NewBytesLoader(bs)

	schema, err := gojsonschema.NewSchema(sl)
	if err != nil {
		panic(err)
	}

	dl := gojsonschema.NewStringLoader(`{"name": "John", "age": 30}`)

	result, err := schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))

	now = time.Now()

	dl = gojsonschema.NewStringLoader(`{"another": "object", "x": 1}`)

	result, err = schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))
}

Output

false
637.524583ms
false
14.709µs

I guess using the inter query cache for this built-in storing loaded schemas across decisions would be the way to go.

It wouldn't make your single opa eval call any faster though, as the first invocation would still need to load the schema.

I figured I'd test this out anyway, and this seemed like a good case given that there was an actual issue on this. Testing response times with OPA running as a server, and the first request is ~800 ms while the following ones are ~10 ms. Fixes open-policy-agent#7011 Signed-off-by: Anders Eknert <anders@styra.com>

anderseknert · 2024-10-01T13:03:27Z

Caching this now as described above. Note that like I mentioned, the first hit will still be expensive, as the schema must be loaded at some point. But subsequent requests are now instantaneous.

lcarva · 2024-10-11T13:41:48Z

@anderseknert, I accidentally figured out why loading takes so long. The CycloneDX schema has external $refs that cause additional schemas to be fetched at runtime. Removing those, or bundling them, makes the call 100 times faster.

I think caching is still useful in the cases where remote references must be used. I just wanted to share this new finding as it may help others in the future.

anderseknert · 2024-10-11T13:48:37Z

Ah, yeah, that certainly explains a lot. Thanks for letting me know! Being able to cache the schema is a good change either way, as recomputing that per request is just wasting resources 🙂

lcarva added the bug label Sep 10, 2024

anderseknert mentioned this issue Oct 1, 2024

Use new value cache for json.match_schema #7081

Merged

anderseknert closed this as completed in #7081 Oct 1, 2024

anderseknert closed this as completed in 6303aa2 Oct 1, 2024

BrewTestBot mentioned this issue Oct 31, 2024

opa 0.70.0 Homebrew/homebrew-core#196286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.match_schema performance #7011

json.match_schema performance #7011

lcarva commented Sep 10, 2024

anderseknert commented Sep 10, 2024

anderseknert commented Oct 1, 2024

lcarva commented Oct 11, 2024

anderseknert commented Oct 11, 2024

json.match_schema performance #7011

json.match_schema performance #7011

Comments

lcarva commented Sep 10, 2024

Short description

Steps To Reproduce

Expected behavior

anderseknert commented Sep 10, 2024

anderseknert commented Oct 1, 2024

lcarva commented Oct 11, 2024

anderseknert commented Oct 11, 2024