Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decision log should include values from non-deterministic built-in functions like time.now_ns #1514

Closed
tsandall opened this issue Jun 21, 2019 · 16 comments · Fixed by #4926 or #5147
Closed

Comments

@tsandall
Copy link
Member

Decision log events should include all of the information necessary to re-execute policy evaluation. Today if policies depend on time.now_ns() or other (custom) non-deterministic built-in functions, it is not easy to accurately re-execute policy evaluation. We should extend the decision log event structure to include additional values supplied to or computed during policy evaluation that are not contained in the input, policy, or contextual data (e.g., time.now_ns() result, opa.runtime() result, http.send() result, etc.)

@philipaconrad
Copy link
Contributor

philipaconrad commented Jul 18, 2022

Built-in Functions with non-deterministic outputs

Based on a cursory look through the built-ins list, these are the functions I found that have obvious non-deterministic behavior. The rest should be deterministic.

@srenatus
Copy link
Contributor

FWIW the RNG source should come from the BuiltinContext. We've had to bend io.jwt.encode_sign and _raw to use it, though: e732b0b. But controlling that alone doesn't give us completely reproducible signatures, since there's an uncontrollable-random step that will either discard or use the first byte. (See this comment) The code path for this is Sign -> randutil.MaybeReadByte

@anderseknert
Copy link
Member

Interesting. I guess it makes sense for consistency, but do users really want reproducible token signing? Would that not entail having their private keys logged with each decision? 🤔

@srenatus
Copy link
Contributor

Hah, good point. I guess it would make sense to not worry about the io.jwt.encode_sign[_raw] builtins too much.

But thinking about this again, we also can't ever control http.send and net.lookup_ip_addr's inputs (those being outside servers) -- so, reading the top issue again, the proper way to go here might be to record the outcomes, and provide ways to stub those out for decision log replay. For that, the RNG doesn't matter as much -- we need to know what the result was -- and then io.jwt.encode_sign[_raw] aren't so different from the other builtins @philipaconrad has collected there.

@anderseknert
Copy link
Member

Agreed, persisting the output of those rather than the input seems like a better approach 👍

@philipaconrad
Copy link
Contributor

I'm considering a simple K/V cache for implementing aspects of this. Do we want a max size restriction on the cache? I'd assume that if someone enables recording non-deterministic builtin results that memory consumption isn't exactly a great concern, since you wouldn't turn it on unless you wanted to get a potentially large pile of results. 🤔

@philipaconrad
Copy link
Contributor

Also, because this feature will potentially result in massive additional memory consumption (and equally massive decision logs in some cases!), I am planning to gate this feature of decision logging behind a config file option, like decision.record_non_deterministic_builtin_results = true/false, for example.

@srenatus
Copy link
Contributor

Would it be that much, though? One time ns integer if time us used, one string for each encode_sign call... I suppose it would go into the decision logs entry, so we don't need to hold onto it anywhere else. 🤔

IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?

@srenatus
Copy link
Contributor

Hmm http.send is different, though. Those response payloads can be very large.

@philipaconrad
Copy link
Contributor

Hmm http.send is different, though. Those response payloads can be very large.

Right, I'm thinking of the cases where a user hits several unique webpages.

IIRC decision logs are flushed when their buffer is full, and that would then occur more often...?

This would definitely increase how often those buffers flush, to such a degree that I think it's worth feature-gating behind a config option, with the default being to leave it turned off. 😰

@philipaconrad
Copy link
Contributor

philipaconrad commented Jul 21, 2022

My current plan (I'm copying this over from the design doc) is to only cache the first result from a non-deterministic builtin. That at least bounds how much memory we're burning to cache stuff. The main builtins that might be affected by this behavior are probably http.send, net.lookup_ip_addr, and possibly opa.runtime(), since those all could potentially have non-idempotent results.

I think this is a reasonable, scope-limiting behavior decision because if a single webpage gives back different results each time it's queried with http.send, there's no way to replicate that in a decision replay down the road unless you're recording every result from http.send, and then we're basically just doing selective tracing with more steps, and will still have to account for cases where the user queries that webpage one more time than we have recorded results for. 🙁

@srenatus
Copy link
Contributor

because if a single webpage gives back different results each time it's queried with http.send,

I don't think we'd notice 🙃 We shouldn't do two requests for

p {
    obj := {"method": "GET", "url": "https://example.com"}
    http.send(obj)
    http.send(obj)
}

so intra-query caching should make us not send two requests here (even if it was different rules etc; same input to http.send implies only one result, only one request sent. -- here's the code, I think

Same for net.lookup_ip_addr, see this.

@philipaconrad
Copy link
Contributor

Well, that's perfect then! Since we already have caching in place for these things, it's mostly going to be an exercise in plumbing work to get those cached values up to where they need to go for decision logging.

philipaconrad added a commit to philipaconrad/opa that referenced this issue Sep 1, 2022
This commit includes evaluator support for an opt-in, non-deterministic
builtins caching system, designed to help with future replay of decision
logs.

The cache allows early-exit in the evaluator if the builtin is
non-deterministic, and has already cached a result. Since the cache can
be pre-populated by `rego` module users, this should make offline policy
testing and future work around decision replay more straightforward.

Fixes: open-policy-agent#1514

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
philipaconrad added a commit that referenced this issue Sep 1, 2022
This commit includes evaluator support for an opt-in, non-deterministic
builtins caching system, designed to help with future replay of decision
logs.

The cache allows early-exit in the evaluator if the builtin is
non-deterministic, and has already cached a result. Since the cache can
be pre-populated by `rego` module users, this should make offline policy
testing and future work around decision replay more straightforward.

Fixes: #1514

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
@srenatus
Copy link
Contributor

srenatus commented Sep 1, 2022

Is this fixed already? I thought the merged PR was the first step....

@philipaconrad
Copy link
Contributor

Ah, nuts. I accidentally left in the Fixes #1514 in the PR, because when it was first written, the plan had been to go all the way to decision logging integration. Yeah, let's reopen this one.

@philipaconrad philipaconrad reopened this Sep 1, 2022
@philipaconrad
Copy link
Contributor

We'll put this zombie issue back to rest when decision logging integrates the non-deterministic builtin cache work from #4926. 🧟

philipaconrad added a commit to philipaconrad/opa that referenced this issue Oct 6, 2022
This commit integrates the ND builtins caching system into decision
logging, both in the server and sdk packages. Some reworking of the
NDBCache's serialization format were required to accommodate this.
The feature is disabled by default, and must be opted into by user
configuration.

The feature can be enabled via a top-level config key:

    nd_builtin_cache=true

The NDBCache is exposed to the masking system under the
`/nd_builtin_cache` path, which allows masking or dropping sensitive
values from decision logs selectively.

Note: If a decision log event exceeds the `upload_size_limit_bytes`
value for the OPA instance, OPA will reattempt uploading it, after
dropping the NDBCache from the event. This behavior will trigger a log
error, and will increment the `decision_logs_nd_builtin_cache_dropped`
metrics counter.

Fixes: open-policy-agent#1514

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
philipaconrad added a commit that referenced this issue Oct 6, 2022
This commit integrates the non-deterministic builtins caching system
into decision logging, both in the server and sdk packages. Some
reworking of the NDBCache's serialization format were required to
accommodate this. The feature is disabled by default, and must be
opted into by user configuration.

The feature can be enabled via a top-level config key:

    nd_builtin_cache=true

The NDBCache is exposed to the masking system under the
`/nd_builtin_cache` path, which allows masking or dropping sensitive
values from decision logs selectively.

Note: If a decision log event exceeds the `upload_size_limit_bytes`
value for the OPA instance, OPA will reattempt uploading it, after
dropping the NDBCache from the event. This behavior will trigger a log
error, and will increment the `decision_logs_nd_builtin_cache_dropped`
metrics counter.

Fixes: #1514

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
byronic pushed a commit to byronic/opa that referenced this issue Oct 17, 2022
…olicy-agent#5147)

This commit integrates the non-deterministic builtins caching system
into decision logging, both in the server and sdk packages. Some
reworking of the NDBCache's serialization format were required to
accommodate this. The feature is disabled by default, and must be
opted into by user configuration.

The feature can be enabled via a top-level config key:

    nd_builtin_cache=true

The NDBCache is exposed to the masking system under the
`/nd_builtin_cache` path, which allows masking or dropping sensitive
values from decision logs selectively.

Note: If a decision log event exceeds the `upload_size_limit_bytes`
value for the OPA instance, OPA will reattempt uploading it, after
dropping the NDBCache from the event. This behavior will trigger a log
error, and will increment the `decision_logs_nd_builtin_cache_dropped`
metrics counter.

Fixes: open-policy-agent#1514

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
Signed-off-by: Byron Lagrone <byron.lagrone@seqster.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
4 participants