rego policy has poor performance #2281

WeiZhang555 · 2020-04-08T15:34:11Z

Expected Behavior

We're trying to implement some network traffic routing functions in our system with REGO policies, for the best performance, we choose to compile REGO to WASM format and run it with wasmer project.

The reason we choose WASM is that in our preliminary tests we found that WASM has better performance, we hope to use REGO to match HTTP headers/cookies/queries etc, and expect it to handle 5000 rules in less than 1ms, but the performance can't reach our expectation.

Actual Behavior

the policy wasm file handles 5000 rules in 2740ms, it's far larger than 1ms.
More weird thing is time opa eval ... only costs 1229ms, which is less than wasm format. So I think there should be some bugs or problems introduced while generating the WASM binary.

Steps to Reproduce the Problem

I'll paste the rego file here:

===========================
control.rego:

package control

result = output {
	# Select the specs that the input matches all of the conditons.
	getSpecs[specs]
	satSpecs := [specs[i] | not unsatisfyConds(specs[i])]

	# Get the max priority of the selected sepcs.
	highPriority := min([ prio | prio := satSpecs[j].priority])

	output := [{
		"targets": satSpecs[k].traffic_target,
		"labels": getLabels(satSpecs[k]),
	} |
		# Choose the spec that with the max priority.
		highPriority == satSpecs[k].priority
	]
}

getSpecs[output] {
	# NOTICE: we can't modify this piece of code to such format:
	# output = data.exact[input.host]
	# issue: https://github.com/open-policy-agent/opa/issues/2260
	exact := data.exact
	output = exact[input.host]
}

getLabels(spec) = output {
	not spec.labels
	output := {}
}

getLabels(spec) = output {
	labels := spec.labels
	labelsFromSpec := getLabelsFromSpec(labels)
	labelsFromNoneList := getLabelsFromNoneList(labels)
	labelsFromList := getLabelsFromList(labels)
	output := object.union(object.union(labelsFromSpec, labelsFromNoneList), labelsFromList)
}

getLabelsFromSpec(labels) = output {
	output := {k: v |
		k := labels[i].key
		v := labels[i].value
		k != ""
		v != ""
	}
}

# Get labels from input none list field.
getLabelsFromNoneList(labels) = output {
	output := {k: v |
		k := labels[i].key
		k != ""
		scope := labels[i].value_from.match_scope
		v := input[scope]
		labels[i].value_from.key = scope
	}
}

# Get labels from input list field.
getLabelsFromList(labels) = output {
	output := {k: v |
		k := labels[i].key
		k != ""
		scope := labels[i].value_from.match_scope
		v := input[scope][j].value
		key := labels[i].value_from.key
		key != scope
		input[scope][j].key = key
	}
}

unsatisfyConds(trafficSpec) {
	c := trafficSpec.conditions[_]
	not check(c)
}

##################################
# Check inclusive conditions.    #
##################################

check(condition) {
	c := condition.inclusive[_]
	checkPathInclusive(c)
	checkHostInclusive(c)
	checkMethodInclusive(c)
	checkQueryInclusive(c)
	checkCookieInclusive(c)
	checkHeaderInclusive(c)
}

##################################
# Check exclusive conditions.    #
##################################

check(condition) {
	c := condition.exclusive[_]
	checkPathExclusive(c)
	checkHostExclusive(c)
	checkMethodExclusive(c)
	checkQueryExclusive(c)
	checkCookieExclusive(c)
	checkHeaderExclusive(c)
}

##################################
# Check path conditions.         #
##################################

anyPathMatch(condition) {
	condition.key = "path"
	condition.value = input.path
}

checkPathInclusive(condition) {
	not condition.match_scope = "path"
}

else {
	condition.match_scope = "path"
	condition.match_type = "exact"
	anyPathMatch(condition)
}

checkPathExclusive(condition) {
	not condition.match_scope = "path"
}

else {
	condition.match_scope = "path"
	condition.match_type = "exact"
	not anyPathMatch(condition)
}

##################################
# Check host conditions.         #
##################################

anyHostMatch(condition) {
	condition.key = "host"
	condition.value = input.host
}

checkHostInclusive(condition) {
	not condition.match_scope = "host"
}

else {
	condition.match_scope = "host"
	condition.match_type = "exact"
	anyHostMatch(condition)
}

checkHostExclusive(condition) {
	not condition.match_scope = "host"
}

else {
	condition.match_scope = "host"
	condition.match_type = "exact"
	not anyHostMatch(condition)
}

##################################
# Check method conditions.       #
##################################

anyMethodMatch(condition) {
	condition.key = "method"
	condition.value = input.method
}

checkMethodInclusive(condition) {
	not condition.match_scope = "method"
}

else {
	condition.match_scope = "method"
	condition.match_type = "exact"
	anyMethodMatch(condition)
}

checkMethodExclusive(condition) {
	not condition.match_scope = "method"
}

else {
	condition.match_scope = "method"
	condition.match_type = "exact"
	not anyMethodMatch(condition)
}

##################################
# Check query conditions.        #
##################################

anyQueryMatch(condition) {
	condition.value = input.query[condition.key]
}

checkQueryInclusive(condition) {
	not condition.match_scope = "query"
}

else {
	condition.match_scope = "query"
	condition.match_type = "exact"
	anyQueryMatch(condition)
}

checkQueryExclusive(condition) {
	not condition.match_scope = "query"
}

else {
	condition.match_scope = "query"
	condition.match_type = "exact"
	not anyQueryMatch(condition)
}

##################################
# Check cookie conditions.       #
##################################

anyCookieMatch(condition) {
	condition.value = input.cookie[condition.key]
}

checkCookieInclusive(condition) {
	not condition.match_scope = "cookie"
}

else {
	condition.match_scope = "cookie"
	condition.match_type = "exact"
	anyCookieMatch(condition)
}

checkCookieExclusive(condition) {
	not condition.match_scope = "cookie"
}

else {
	condition.match_scope = "cookie"
	condition.match_type = "exact"

	# not any == FOR ALL
	not anyCookieMatch(condition)
}

##################################
# Check header conditions.       #
##################################

anyHeaderMatch(condition) {
	condition.value = input.header[condition.key]
}

checkHeaderInclusive(condition) {
	not condition.match_scope = "header"
}

else {
	condition.match_scope = "header"
	condition.match_type = "exact"
	anyHeaderMatch(condition)
}

checkHeaderExclusive(condition) {
	not condition.match_scope = "header"
}

else {
	condition.match_scope = "header"
	condition.match_type = "exact"

	# not any == FOR ALL
	# be careful about "Universal Quantification" writting: 
	# https://www.openpolicyagent.org/docs/latest/policy-language/#universal-quantification-for-all
	not anyHeaderMatch(condition)
}

===========================
data.json:

data.json.txt

===========================
input.json:

{
	"path": "/hello/world",
	"host": "test1.com",
	"method": "GET",
	"query": {
		"key1": "value1"
	},
	"cookie": {
		"key1": "value1"
	},
	"header": {
		"key1": "value1",
		"key2": "value2",
		"key3": "value3"
	}
}

The integration code with wasmer is too many, so I won't paste it here unless it's really necessary.

I want to know that if it's possible to finish evaluation in less than 1ms or it's actually impossible? Is that a bug that WASM runs slower than opa eval ...?

The text was updated successfully, but these errors were encountered:

WeiZhang555 · 2020-04-08T15:49:22Z

The data.json format can be described with this piece of GO code:

package main

// TrafficSpecs are policy configuration data
type TrafficSpecs struct {
	// map from exact hostname to TrafficSpec list
	ExactMatch map[string][]TrafficSpec `json:"exact"`
	// map from regexp hostname to Traffic list
	RegexMatch map[string][]TrafficSpec `json:"regexp"`
}

type TrafficSpec struct {
	Priority       uint        `json:"priority"`
	Conditions     []Condition `json:"conditions"`
	TrafficTargets []string    `json:"traffic_target"`
	Labels         []Label     `json:"labels,omitempty"`
}

type Condition struct {
	IC []KVMap `json:"inclusive,omitempty"`
	EC []KVMap `json:"exclusive,omitempty"`
}

type KVMap struct {
	Key        string `json:"key"`
	MatchScope string `json:"match_scope"`
	Value      string `json:"value"`
	MatchType  string `json:"match_type"`
}

type Label struct {
	Key       string   `json:"key"`
	Value     string   `json:"value,omitempty"`
	ValueFrom ScopeKey `json:"value_from,omitempty"`
}

type ScopeKey struct {
	Key        string `json:"key"`
	MatchScope string `json:"match_scope"`
}

// =================================================
// Input are query input data, it contians field data from HTTP request
type Input struct {
	Path   string            `json:"path,omitempty"`
	Host   string            `json:"host,omitempty"`
	Method string            `json:"method,omitempty"`
	Query  map[string]string `json:"query,omitempty"`
	Cookie map[string]string `json:"cookie,omitempty"`
	Header map[string]string `json:"header,omitempty"`
}

// =================================================
type Output struct {
	Targets []string          `json:"targets"`
	Labels  map[string]string `json:"labels"`
}

WeiZhang555 · 2020-04-08T15:50:01Z

ping @tsandall for help

tsandall · 2020-04-08T16:17:29Z

@WeiZhang555 Hi! Thanks for filing this with info to help reproduce. One comment and one question right off the bat:

We don't support rule indexing for Wasm-compiled policies today so latency will grow linearly with the size of the ruleset (e.g., if 10K allow rules will take ~2x longer to evaluate than 5K allow rules.) This is NOT the case with the Go implementation. The Go implementation indexes rulesets so that policy evaluation ONLY executes rules that could match the input. If there are 5K allow rules and only 10 of them could match (e.g., because the other 4,990 don't have matching HTTP method/path values), then only those 10 will be evaluated. **In your example, what does the 5K number relate to in your policy? Is it the size of data.exact[input.host]?
Since you're specifying specs and conditions in data, partial evaluation could help here. I'll take a look and see if it's able to unroll anything and trigger indexing.

WeiZhang555 · 2020-04-09T02:04:01Z

@tsandall Thanks very much for your quick response!

In your example, what does the 5K number relate to in your policy? Is it the size of data.exact[input.host]?

Nope, in the data.json I provided above,

ExactMatch map[string][]TrafficSpec `json:"exact"`

The ExactMatch used map(in rego it's Object) and it has 2 elements, since map is quite efficient, this won't be the bottleneck.

For each element in map, []TrafficSpec has 100 elements,
For each TrafficSpec, the Conditions []Condition contains 50 KVMap in total.

To calculate the conditions, each ExactMatch[key] has 100*50=5000 KVMap.
for the 5000 conditions, it actually means 5000 KVMap.

In the whole data.json, it's 5000 * 2 =10000 KVMap.

========
More details describing the policy:

For each condition:

type Condition struct {
	IC []KVMap `json:"inclusive,omitempty"`
	EC []KVMap `json:"exclusive,omitempty"`
}

and example json:

{
    "inclusive": [a, b],
    "exclusive": [c, d],
}

It means (a OR b OR (NOT c) OR (not d)), and among Conditions []Condition are AND relations.

=====
Not sure if partial eval could help us, in my test, the OPA server mode cost more than xx ms always.

WeiZhang555 · 2020-04-09T03:53:31Z

We don't support rule indexing for Wasm-compiled policies today so latency will grow linearly with the size of the ruleset

Is this problem solvable? If true, do you have plan to improve WASM-compiled policy performance? @tsandall

tsandall · 2020-04-09T15:19:49Z

The current eval latency on my machine with provided policy/data/input is ~145ms...
so ~145x higher than the requirement.

$ opa bench -d control.rego -d data.json -i input.json 'data.control.result=x'
+-------------------------------------------+---------------+
| samples                                   |             8 |
| ns/op                                     |     145485461 |
| B/op                                      |     119011042 |
| allocs/op                                 |       1858467 |
| histogram_timer_rego_query_eval_ns_75%    |     150153361 |
| histogram_timer_rego_query_eval_ns_90%    |     158051397 |
| histogram_timer_rego_query_eval_ns_95%    |     158051397 |
| histogram_timer_rego_query_eval_ns_99%    |     158051397 |
| histogram_timer_rego_query_eval_ns_99.9%  |     158051397 |
| histogram_timer_rego_query_eval_ns_99.99% |     158051397 |
| histogram_timer_rego_query_eval_ns_count  |          8.00 |
| histogram_timer_rego_query_eval_ns_max    |     158051397 |
| histogram_timer_rego_query_eval_ns_mean   |     145463348 |
| histogram_timer_rego_query_eval_ns_median |     143215456 |
| histogram_timer_rego_query_eval_ns_min    |     138385930 |
| histogram_timer_rego_query_eval_ns_stddev |       6092078 |
+-------------------------------------------+---------------+

A lot of this just comes from the interpretive overhead. Simply iterating over all
of the conditions takes ~10ms:

# $ opa bench -d control.rego -d data.json -i input.json 'data.control.search'
# +-------------------------------------------+------------+
# | samples                                   |        100 |
# | ns/op                                     |   10699421 |
# | B/op                                      |    6713139 |
# | allocs/op                                 |     205417 |
# | histogram_timer_rego_query_eval_ns_75%    |   11261398 |
# | histogram_timer_rego_query_eval_ns_90%    |   11844537 |
# | histogram_timer_rego_query_eval_ns_95%    |   12487026 |
# | histogram_timer_rego_query_eval_ns_99%    |   14215295 |
# | histogram_timer_rego_query_eval_ns_99.9%  |   14227405 |
# | histogram_timer_rego_query_eval_ns_99.99% |   14227405 |
# | histogram_timer_rego_query_eval_ns_count  |        100 |
# | histogram_timer_rego_query_eval_ns_max    |   14227405 |
# | histogram_timer_rego_query_eval_ns_mean   |   10686602 |
# | histogram_timer_rego_query_eval_ns_median |   10957291 |
# | histogram_timer_rego_query_eval_ns_min    |    9358892 |
# | histogram_timer_rego_query_eval_ns_stddev |     986429 |
# +-------------------------------------------+------------+
search {
	data.exact["test0.com"][i].conditions[j]
}

Adding inclusive and exclusive statements on conditions increases the latency
to ~40ms. So by iterating over everything on each query, the budget is already
blown by 40x. The conditions still need to be evaluated. As written each condition
takes ~60us (microseconds) to evaluate. I reduced this to ~45us with some simple
refactoring:

$ opa bench -d control.rego -d data.json -i input.json 'data.control.check(data.exact["test0.com"][0].conditions[0])'
+-------------------------------------------+------------+
| samples                                   |      26326 |
| ns/op                                     |      45388 |
| B/op                                      |      28818 |
| allocs/op                                 |        439 |
| histogram_timer_rego_query_eval_ns_75%    |      38902 |
| histogram_timer_rego_query_eval_ns_90%    |      46016 |
| histogram_timer_rego_query_eval_ns_95%    |      55258 |
| histogram_timer_rego_query_eval_ns_99%    |      98401 |
| histogram_timer_rego_query_eval_ns_99.9%  |     290903 |
| histogram_timer_rego_query_eval_ns_99.99% |     291079 |
| histogram_timer_rego_query_eval_ns_count  |      26326 |
| histogram_timer_rego_query_eval_ns_max    |     291079 |
| histogram_timer_rego_query_eval_ns_mean   |      39843 |
| histogram_timer_rego_query_eval_ns_median |      36016 |
| histogram_timer_rego_query_eval_ns_min    |      31515 |
| histogram_timer_rego_query_eval_ns_stddev |      16414 |
+-------------------------------------------+------------+

However, with 5,000 conditions to check, we're looking at roughly 225ms to check
all of the conditions. This doesn't include ranking that must happen at the end
according to the policy. Note that checking 5,000 conditions in <= 1ms leaves an
average budget of 200ns (nanoseconds) per condition.

In theory wasm could help here however the older benchmarks (only) showed a speedup
of ~20-40x for this kind of use case (i.e., interpreting a bunch of boolean conditions
encoded in data by performing a linear scan.) This is still not enough without doing
a bunch of optimization work to get the >=100x speedup you're looking for.

Given this, any solution with OPA will have to be able to leverage rule indexing.
This assumes there is a relatively small conditions that could match a given request
(e.g., on the order of 10-20 if each condition takes ~50us). In the dataset that
you provided I noticed that all of the specs for "test0.com" match the provided input.
Is this expected in your use case?

$ opa eval -d control_original.rego -d data.json -i input.json 'count(data.exact[input.host])' -f pretty
100
$ opa eval -d control_original.rego -d data.json -i input.json 'count([s|data.exact[input.host][_] = s; not data.control.unsatisfyConds(s)])' -f pretty
100

Note that rule indexing relies on the conditions being encoded in the policy not inside
of the data--there are two options here:

Write the policy in a way that partial evaluation can process. Currently the policy
makes heavy-use of ordering (else) and aggregates that prevent partial evaluation from
working well. The ordering can be removed relatively easily however the aggregates
used to pick the highest priority spec are more difficult. I don't know if we can address
that with partial evaluation in the short term.
Write some code that codegens Rego from your data model. This would give you
full control over the Rego.

How many conditions do you expect to have to load into a single OPA instance? E.g.,
you have 2 hosts in your example data set but how many would you expect in practice?
If you have more than few hundred, I think you'll start to run into issues with
partial evaluation or policy compiles taking too long (even though they can happen
in the background, I'd expect them to take many seconds if you have 100K-1M rules
which could occur if you have thousands of rules and hundreds of hosts.)

Thanks for the detailed writeup and for providing insight into the use case. If you
can share a few more details about the use case we could probably answer the feasibility
questions that I've raised above.

WeiZhang555 · 2020-04-10T03:40:07Z

I appreciate your time on this issue so much! @tsandall

In the dataset that
you provided I noticed that all of the specs for "test0.com" match the provided input.
Is this expected in your use case?

Nope, this is the worst case I believe, in real world, I guess only several rules could match.

How many conditions do you expect to have to load into a single OPA instance? E.g.,
you have 2 hosts in your example data set but how many would you expect in practice?

In our production system, we have more than 5000 hosts. In my previous test, if I add 5000 hosts in single data file, the data file would be too large and the data loading and json parsing could take too long. So we decided to only put 1 host in single data.json and we will have 5000 data.json in separate packages.

If you
can share a few more details about the use case we could probably answer the feasibility
questions that I've raised above.

Recently I'm trying to introduce OPA to our internal system, this is the first try to push OPA to our production. We have several nginx servers as ingress network traffic routers between public and our web service, it's very sensitive to security and performance. The performance requirement is that >90% requests must be handled in 1ms, so this actually leaves less than 1ms (may be 300us) to OPA.

In my preliminary test, WASM compiled REGO policy can handle 10~20 rules in less than 100us, this is inspiring, but after the rule numbers increase, it's performance turns bad so quickly.

So now we're really stuck at the performance issue 😢

WeiZhang555 · 2020-04-10T11:14:28Z

Today I got some progress, after I changed the initial instance memory of policy WASM to much bigger number, the latency quickly decreased from 1.6s to 20ms. As a contrast, opa eval costs about 170ms, which is 8.5x slower than WASM.

It's closer to our target now 😄

tsandall · 2020-04-10T18:48:47Z

@WeiZhang555 that sounds promising! Let us know if the performance is good enough with the larger heap size. If you need to reduce the time further, there is probably quite a bit of performance optimization that could happen in the query planner, wasm backend, and C code that would help. We'd have to do some profiling and analysis first.

WeiZhang555 · 2020-04-11T15:06:43Z

@tsandall I will try more to decrease the latency to 1ms, thanks for your suggestion, I will read the codes and investigate more. 😄

tsandall · 2020-04-29T15:14:35Z

@WeiZhang555 I'm going to close this for now. Also, in #2328 we improved set/object implementations to use hashtables (they were using linked lists before) so the performance should be more inline with expectations (e.g., object lookup is constant-time now as it should be.)

WeiZhang555 · 2020-05-12T05:18:14Z

@tsandall OK. Thanks for your help any way 😄

tsandall added the investigating Issues being actively investigated label Apr 8, 2020

tsandall added ir/wasm investigating Issues being actively investigated and removed investigating Issues being actively investigated labels Apr 13, 2020

tsandall self-assigned this Apr 15, 2020

tsandall closed this as completed Apr 29, 2020

chicco785 mentioned this issue Jan 13, 2023

explore further performance optimization orchestracities/anubis#196

Open

ashutosh-narkar added this to Open Policy Agent Aug 5, 2024

ashutosh-narkar moved this to Done in Open Policy Agent Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rego policy has poor performance #2281

rego policy has poor performance #2281

WeiZhang555 commented Apr 8, 2020

WeiZhang555 commented Apr 8, 2020

WeiZhang555 commented Apr 8, 2020

tsandall commented Apr 8, 2020

WeiZhang555 commented Apr 9, 2020

WeiZhang555 commented Apr 9, 2020

tsandall commented Apr 9, 2020

WeiZhang555 commented Apr 10, 2020

WeiZhang555 commented Apr 10, 2020

tsandall commented Apr 10, 2020

WeiZhang555 commented Apr 11, 2020

tsandall commented Apr 29, 2020

WeiZhang555 commented May 12, 2020

rego policy has poor performance #2281

rego policy has poor performance #2281

Comments

WeiZhang555 commented Apr 8, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

WeiZhang555 commented Apr 8, 2020

WeiZhang555 commented Apr 8, 2020

tsandall commented Apr 8, 2020

WeiZhang555 commented Apr 9, 2020

WeiZhang555 commented Apr 9, 2020

tsandall commented Apr 9, 2020

WeiZhang555 commented Apr 10, 2020

WeiZhang555 commented Apr 10, 2020

tsandall commented Apr 10, 2020

WeiZhang555 commented Apr 11, 2020

tsandall commented Apr 29, 2020

WeiZhang555 commented May 12, 2020