Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when bundle and decision logging enabled #279

Merged

Conversation

ashutosh-narkar
Copy link
Member

This commit attempts to fix the deadlock that happens
when bundle and decision logging are both enabled.

The opa-envoy plugin creates a new transaction during
query evaluation and closes it once eval is complete.
Then when it attempts to log the decision, the decision
log plugin grabs mask mutex and calls the PrepareForEval
function in the rego package which tries to open a new
read transaction on the store since the log plugin does
not provide one. This call gets blocked
if concurrently the bundle plugin has a write transaction
open on the store. This write invokes the decision log plugin's callback
and tries to grab the mask mutex. This call gets blocked because
the decision log plugin is already holding onto it for the mask query.

To avoid this, we keep the transaction open in the opa-envoy plugin
till we log the decision.

Fixes: open-policy-agent/opa#3722

Signed-off-by: Ashutosh Narkar anarkar4387@gmail.com

@ashutosh-narkar
Copy link
Member Author

This change was tested by running the opa-envoy plugin with both the decision log and bundle plugin enabled. The max upload frequency and the max bundle download frequency was 1 sec. The test ran close to 6 hours. No deadlock was observed.

@ashutosh-narkar
Copy link
Member Author

The original deadlock as documented in open-policy-agent/opa#3722 was reproduced as below:

  1. Run the opa-envoy plugin with both the decision log and bundle plugin enabled. The max upload frequency and the max bundle download frequency was 1 sec.

  2. Adding a sleep after the decision log plugin grabs the mask mutex here.

The decision log plugin would grab the mask mutex and sleep. Concurrently the bundle plugin would open a write transaction on the store and lock the read mutex. It would then try to grab the mask mutex which the decision log plugin is holding on to and get blocked. The decision log plugin would wake up and then try to open a read transaction on the store and get blocked.

With the fix in this PR, this deadlock was not observed.

if txnErr != nil {
p.Store().Abort(ctx, txn)
} else {
err = p.Store().Commit(ctx, txn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any reason to commit here; the check handler is read-only so we can always just abort the transaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

return &er, stop, nil
}

// WithTxn sets the transaction on the evaluation result
func (result *EvalResult) WithTxn(txn storage.Transaction) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably just drop this function since the Txn is exported on the EvalResult anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

result.TxnID = txn.ID()
err := getRevision(ctx, evalContext.Store(), result.Txn, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Eval() function is exported now (instead of being hidden under internal/), someone could be calling. If someone is calling it, this change is going to break them because the function doesn't open a transaction anymore. I could see two approaches here:

  1. Change the function signature so that at least callers break and can't compile
  2. Make the change backwards compatible so that the code opens a txn if needed

I think that (2) wouldn't be too difficult with a bit of refactoring to avoid duplication, but I'd have to write the code myself to find out. Could you see if it's doable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code a bit to make changes backwards compatible. LMKWYT.

This commit attempts to fix the deadlock that happens
when bundle and decision logging are both enabled.

The opa-envoy plugin creates a new transaction during
query evaluation and closes it once eval is complete.
Then when it attempts to log the decision, the decision
log plugin grabs mask mutex and calls the PrepareForEval
function in the rego package which tries to open a new
read transaction on the store since the log plugin does
not provide one. This call gets blocked
if concurrently the bundle plugin has a write transaction
open on the store. This write invokes the decision log plugin's callback
and tries to grab the mask mutex. This call gets blocked because
the decision log plugin is already holding onto it for the mask query.

To avoid this, we keep the transaction open in the opa-envoy plugin
till we log the decision.

Fixes: open-policy-agent/opa#3722

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
@ashutosh-narkar ashutosh-narkar merged commit 50b620e into open-policy-agent:main Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OPA deadlock with decision logs enabled
2 participants