Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json.match_schema performance #7011

Closed
lcarva opened this issue Sep 10, 2024 · 4 comments · Fixed by #7081
Closed

json.match_schema performance #7011

lcarva opened this issue Sep 10, 2024 · 4 comments · Fixed by #7081
Labels

Comments

@lcarva
Copy link
Contributor

lcarva commented Sep 10, 2024

Short description

The json.match_schema function takes much longer when the JSON schema is significantly large.

I created a simple reproducer here: https://github.com/lcarva/opa-json-schema-perf
(Schema too large for rego playground)

The reproducer validates a small object against the CycloneDX SBOM JSON Schema (about 5k lines long).

$ opa eval --data . 'data.main.results' --profile --format=pretty
[
  [
    true,
    []
  ],
  [
    true,
    []
  ]
]
+------------------------------+-----------+
|            METRIC            |   VALUE   |
+------------------------------+-----------+
| timer_rego_load_files_ns     | 26615260  |
| timer_rego_module_compile_ns | 24751477  |
| timer_rego_module_parse_ns   | 26171100  |
| timer_rego_query_compile_ns  | 38743     |
| timer_rego_query_eval_ns     | 288952134 |
| timer_rego_query_parse_ns    | 37035     |
+------------------------------+-----------+
+--------------+----------+----------+--------------+-------------------+
|     TIME     | NUM EVAL | NUM REDO | NUM GEN EXPR |     LOCATION      |
+--------------+----------+----------+--------------+-------------------+
| 288.827434ms | 4        | 4        | 4            | main.rego:10      |
| 63.714µs     | 4        | 4        | 4            | main.rego:12      |
| 32.994µs     | 4        | 4        | 4            | main.rego:14      |
| 15.366µs     | 1        | 1        | 1            | data.main.results |
| 3.809µs      | 1        | 1        | 1            | main.rego:5       |
| 3.181µs      | 1        | 1        | 1            | schema.rego:3     |
| 2.385µs      | 1        | 1        | 1            | schema.rego:36    |
+--------------+----------+----------+--------------+-------------------+

main.rego:10 is the json.match_schema call where the CycloneDX schema is being used. main.rego:12 uses a much smaller schema. That's 288,827 vs 63 microseconds.

Version: 0.68.0
Build Commit: db53d77c482676fadd53bc67a10cf75b3d0ce00b
Build Timestamp: 2024-08-29T15:23:19Z
Build Hostname: 3aae2b82a15f
Go Version: go1.22.5
Platform: linux/amd64
WebAssembly: available

Steps To Reproduce

See description.

Expected behavior

Validation of object should not take longer than 1ms.

@lcarva lcarva added the bug label Sep 10, 2024
@anderseknert
Copy link
Member

Hi there! And thanks for filing this issue.

Looking into this briefly, and almost all of that time is spent in loading the JSON schema, not actually validating.
This loading isn't cached either, so each call is going to repeat loading the schema. Using a cached schema makes things... faster, to say the least. Notice the use of gojsonschema.NewSchema(sl) below, where the returned schema is reused:

package main

import (
	"fmt"
	"os"
	"time"
)
import "github.com/xeipuuv/gojsonschema"

func main() {
	now := time.Now()

	bs, err := os.ReadFile("schema.json")
	if err != nil {
		panic(err)
	}

	sl := gojsonschema.NewBytesLoader(bs)

	schema, err := gojsonschema.NewSchema(sl)
	if err != nil {
		panic(err)
	}

	dl := gojsonschema.NewStringLoader(`{"name": "John", "age": 30}`)

	result, err := schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))

	now = time.Now()

	dl = gojsonschema.NewStringLoader(`{"another": "object", "x": 1}`)

	result, err = schema.Validate(dl)
	if err != nil {
		panic(err)
	}

	fmt.Println(result.Valid())
	fmt.Println(time.Since(now))
}

Output

false
637.524583ms
false
14.709µs

I guess using the inter query cache for this built-in storing loaded schemas across decisions would be the way to go.

It wouldn't make your single opa eval call any faster though, as the first invocation would still need to load the schema.

anderseknert added a commit to anderseknert/opa that referenced this issue Oct 1, 2024
I figured I'd test this out anyway, and this seemed like
a good case given that there was an actual issue on this.

Testing response times with OPA running as a server, and
the first request is ~800 ms while the following ones are
~10 ms.

Fixes open-policy-agent#7011

Signed-off-by: Anders Eknert <anders@styra.com>
anderseknert added a commit to anderseknert/opa that referenced this issue Oct 1, 2024
I figured I'd test this out anyway, and this seemed like
a good case given that there was an actual issue on this.

Testing response times with OPA running as a server, and
the first request is ~800 ms while the following ones are
~10 ms.

Fixes open-policy-agent#7011

Signed-off-by: Anders Eknert <anders@styra.com>
anderseknert added a commit to anderseknert/opa that referenced this issue Oct 1, 2024
I figured I'd test this out anyway, and this seemed like
a good case given that there was an actual issue on this.

Testing response times with OPA running as a server, and
the first request is ~800 ms while the following ones are
~10 ms.

Fixes open-policy-agent#7011

Signed-off-by: Anders Eknert <anders@styra.com>
@anderseknert
Copy link
Member

Caching this now as described above. Note that like I mentioned, the first hit will still be expensive, as the schema must be loaded at some point. But subsequent requests are now instantaneous.

@lcarva
Copy link
Contributor Author

lcarva commented Oct 11, 2024

@anderseknert, I accidentally figured out why loading takes so long. The CycloneDX schema has external $refs that cause additional schemas to be fetched at runtime. Removing those, or bundling them, makes the call 100 times faster.

I think caching is still useful in the cases where remote references must be used. I just wanted to share this new finding as it may help others in the future.

@anderseknert
Copy link
Member

Ah, yeah, that certainly explains a lot. Thanks for letting me know! Being able to cache the schema is a good change either way, as recomputing that per request is just wasting resources 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants