Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA deadlock with decision logs enabled #3722

Closed
amitkatyal opened this issue Aug 9, 2021 · 11 comments · Fixed by open-policy-agent/opa-envoy-plugin#279
Closed

OPA deadlock with decision logs enabled #3722

amitkatyal opened this issue Aug 9, 2021 · 11 comments · Fixed by open-policy-agent/opa-envoy-plugin#279
Assignees
Labels
bug int-envoy Issues related to the opa-envoy-plugin investigating Issues being actively investigated

Comments

@amitkatyal
Copy link

amitkatyal commented Aug 9, 2021

We are running openpolicyagent/opa version 0.28.0-envoy in the k8s cluster and are facing a deadlock issue after enabling the decision logs.

Steps to reproduce

  1. Enable the decision logs
  2. Configure the policy bundle download every X seconds
  3. Run the policy evaluation test

After a couple of days of stress test, we are seeing that OPA is getting stuck. We captured the call stack of all the goroutines and figured out that there is a deadlock between the policy bundle download and policy evaluation goroutine.

  1. Policy evaluation goroutine acquires the maskMutex inside the log plugin and attempts to take (rmu) read lock inside the inmem storage plugin but gets blocked as rmu mutex is already acquired by the downloaded goroutine.

  2. Download goroutine acquires the (rmu) inside inmem Commit() and tries to acquire maskMutex inside compilerUpdated of log plugin but get blocked as maskMutex is already acquired by the policy evaluation goroutine.

The above sequence of events results in the deadlock.

Download Goroutine call stack

#0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0xc000543470, lock=0x2b71ae0 <runtime.semtable+5440>, reason=18 '\022', traceEv=25 '\031', traceskip=5) at /usr/local/go/src/runtime/proc.go:307
#1  0x0000000000619825 in runtime.goparkunlock (traceskip=0, lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>) at /usr/local/go/src/runtime/proc.go:312
#2  runtime.semacquire1 (addr=0xc0002b20b4, lifo=false, profile=3, skipframes=1) at /usr/local/go/src/runtime/sema.go:144
#3  0x000000000063ac87 in sync.runtime_SemacquireMutex (addr=0xc0002b20b4, lifo=false, skipframes=1) at /usr/local/go/src/runtime/sema.go:71
#4  0x0000000000655d65 in sync.(*Mutex).lockSlow (m=0xc0002b20b0) at /usr/local/go/src/sync/mutex.go:138
#5  0x0000000000f674b5 in sync.(*Mutex).Lock (m=0xc0002b20b0) at /usr/local/go/src/sync/mutex.go:81
#6  github.com/open-policy-agent/opa/plugins/logs.(*Plugin).compilerUpdated (p=0xc0002b2000, txn=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/logs/plugin.go:548
#7  0x0000000000f6a83e in github.com/open-policy-agent/opa/plugins/logs.(*Plugin).compilerUpdated-fm (txn=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/logs/plugin.go:547
#8  0x0000000000bd37d3 in github.com/open-policy-agent/opa/plugins.(*Manager).onCommit (m=0xc0000c49a0, ctx=..., txn=..., event=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/plugins.go:635
#9  0x0000000000bd5a70 in github.com/open-policy-agent/opa/plugins.(*Manager).onCommit-fm (ctx=..., txn=..., event=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/plugins.go:620
#10 0x0000000000bc5bf2 in github.com/open-policy-agent/opa/storage/inmem.(*store).runOnCommitTriggers (db=0xc000464700, ctx=..., txn=..., event=...) at /src/vendor/github.com/open-policy-agent/opa/storage/inmem/inmem.go:215
#11 0x0000000000bc4f26 in github.com/open-policy-agent/opa/storage/inmem.(*store).Commit (db=0xc000464700, ctx=..., txn=..., ~r2=...) at /src/vendor/github.com/open-policy-agent/opa/storage/inmem/inmem.go:103
#12 0x0000000000acc336 in github.com/open-policy-agent/opa/storage.Txn (ctx=..., store=..., params=..., f={void (github.com/open-policy-agent/opa/storage.Transaction, error *)} 0xc000543950, ~r4=...) at /src/vendor/github.com/open-policy-agent/opa/storage/storage.go:100
#13 0x0000000000f3f5aa in github.com/open-policy-agent/opa/plugins/bundle.(*Plugin).activate (p=0xc0000de790, ctx=..., name="authz", b=0xc0000fb340, ~r3=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/bundle/plugin.go:427
#14 0x0000000000f3e675 in github.com/open-policy-agent/opa/plugins/bundle.(*Plugin).process (p=0xc0000de790, ctx=..., name="authz", u=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/bundle/plugin.go:362
#15 0x0000000000f3dd3e in github.com/open-policy-agent/opa/plugins/bundle.(*Plugin).oneShot (p=0xc0000de790, ctx=..., name="authz", u=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/bundle/plugin.go:318
#16 0x0000000000f41488 in github.com/open-policy-agent/opa/plugins/bundle.(*Plugin).newDownloader.func1 (ctx=..., u=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/bundle/plugin.go:306
#17 0x0000000000f37815 in github.com/open-policy-agent/opa/download.(*Downloader).oneShot (d=0xc0002a18c0, ctx=..., ~r1=...) at /src/vendor/github.com/open-policy-agent/opa/download/download.go:153
#18 0x0000000000f37390 in github.com/open-policy-agent/opa/download.(*Downloader).loop (d=0xc0002a18c0) at /src/vendor/github.com/open-policy-agent/opa/download/download.go:117
#19 0x000000000063ecc1 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1374
#20 0x000000c0002a18c0 in ?? ()
#21 0x0000000000000000 in ?? ()

Policy Evaluation Goroutine call stack

#0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0xc00074e908, lock=0x2b72120 <runtime.semtable+7040>, reason=18 '\022', traceEv=25 '\031', traceskip=4) at /usr/local/go/src/runtime/proc.go:307
#1  0x0000000000619825 in runtime.goparkunlock (traceskip=0, lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>) at /usr/local/go/src/runtime/proc.go:312
#2  runtime.semacquire1 (addr=0xc00046470c, lifo=false, profile=3, skipframes=0) at /usr/local/go/src/runtime/sema.go:144
#3  0x000000000063ac87 in sync.runtime_SemacquireMutex (addr=0xc00046470c, lifo=false, skipframes=0) at /usr/local/go/src/runtime/sema.go:71
#4  0x0000000000bc4d6f in sync.(*RWMutex).RLock (rw=<optimized out>) at /usr/local/go/src/sync/rwmutex.go:50
#5  github.com/open-policy-agent/opa/storage/inmem.(*store).NewTransaction (db=0xc000464700, ctx=..., params=[]github.com/open-policy-agent/opa/storage.TransactionParams, ~r2=..., ~r3=...) at /src/vendor/github.com/open-policy-agent/opa/storage/inmem/inmem.go:90
#6  0x0000000000ee6dc2 in github.com/open-policy-agent/opa/rego.(*Rego).getTxn (r=0xc0004ef400, ctx=..., ~r1=..., ~r2=<optimized out>, ~r3=...) at /src/vendor/github.com/open-policy-agent/opa/rego/rego.go:2297
#7  0x0000000000edea10 in github.com/open-policy-agent/opa/rego.(*Rego).PrepareForEval (r=0xc0004ef400, ctx=..., opts=[]github.com/open-policy-agent/opa/rego.PrepareOption, ~r2=..., ~r3=...) at /src/vendor/github.com/open-policy-agent/opa/rego/rego.go:1408
#8  0x0000000000f69c45 in github.com/open-policy-agent/opa/plugins/logs.(*Plugin).maskEvent.func1 (p=0xc0002b2000, txn=..., ~r0=..., ~r1=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/logs/plugin.go:729
#9  0x0000000000f6883a in github.com/open-policy-agent/opa/plugins/logs.(*Plugin).maskEvent (p=0xc0002b2000, ctx=..., txn=..., event=0xc00074f728, ~r3=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/logs/plugin.go:738
#10 0x0000000000f66e35 in github.com/open-policy-agent/opa/plugins/logs.(*Plugin).Log (p=0xc0002b2000, ctx=..., decision=0xc00087f948, ~r2=...) at /src/vendor/github.com/open-policy-agent/opa/plugins/logs/plugin.go:500
#11 0x0000000000f6abdb in github.com/open-policy-agent/opa-envoy-plugin/opa/decisionlog.LogDecision (ctx=..., manager=0xc0000c49a0, info=0xc00087f948, result=0xc00082c820, err=..., ~r5=...) at /src/opa/decisionlog/decision_log.go:65
#12 0x0000000000f6d652 in github.com/open-policy-agent/opa-envoy-plugin/internal.(*envoyExtAuthzGrpcServer).log (p=0xc0000ddae0, ctx=..., input=..., result=0xc00082c820, err=..., ~r4=...) at /src/internal/internal.go:371
#13 0x0000000000f6e276 in github.com/open-policy-agent/opa-envoy-plugin/internal.(*envoyExtAuthzGrpcServer).check.func2 (~r0=<optimized out>) at /src/internal/internal.go:253
#14 0x0000000000f6c1a7 in github.com/open-policy-agent/opa-envoy-plugin/internal.(*envoyExtAuthzGrpcServer).Check (p=0xc0000ddae0, ctx=..., req=0xc000737380, ~r2=<optimized out>, ~r3=...) at /src/internal/internal.go:232
#15 0x0000000000e74f54 in github.com/envoyproxy/go-control-plane/envoy/service/auth/v3._Authorization_Check_Handler (srv=..., ctx=..., dec={void (interface {}, error *)} 0xc00074fbd0, interceptor={void (context.Context, interface {}, google.golang.org/grpc.UnaryServerInfo *, google.golang.org/grpc.UnaryHandler, interface {} *, error *)} 0xc00074fbd8, ~r4=..., ~r5=...) at /src/vendor/github.com/envoyproxy/go-control-plane/envoy/service/auth/v3/external_auth.pb.go:638
#16 0x0000000000e546c2 in google.golang.org/grpc.(*Server).processUnaryRPC (s=0xc0002b6000, t=..., stream=0xc0008b0600, info=0xc0002b0420, md=0x2ac5730 <github.com/envoyproxy/go-control-plane/envoy/service/auth/v3..stmp_14>, trInfo=0x0, err=...) at /src/vendor/google.golang.org/grpc/server.go:1210
#17 0x0000000000e58845 in google.golang.org/grpc.(*Server).handleStream (s=0xc0002b6000, t=..., stream=0xc0008b0600, trInfo=0x0) at /src/vendor/google.golang.org/grpc/server.go:1533
#18 0x0000000000e65665 in google.golang.org/grpc.(*Server).serveStreams.func1.2 (&wg=0xc00048fe04, s=0xc0002b6000, st=..., stream=0xc0008b0600) at /src/vendor/google.golang.org/grpc/server.go:871
#19 0x000000000063ecc1 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1374
#20 0x000000c00048fe04 in ?? ()
#21 0x000000c0002b6000 in ?? ()
#22 0x000000000203c060 in github.com/open-policy-agent/opa/topdown..stmp_8 ()
#23 0x000000c000442c00 in ?? ()
#24 0x000000c0008b0600 in ?? ()
#25 0x0000000000000000 in ?? ()
@srenatus srenatus added bug int-envoy Issues related to the opa-envoy-plugin labels Aug 9, 2021
@srenatus
Copy link
Contributor

srenatus commented Aug 9, 2021

Thanks for reporting this!

I'll try to have a closer look this week, but maybe someone beats me to it.

🤔 0.28.0 is a bit old -- would you be able to repeat your tests easily with a more recent version? (Anyhow, skimming the release notes, no fixed deadlock springs at me, though; only one related to bundle downloads and shutdown...)

@amitkatyal
Copy link
Author

Thanks, @srenatus!
Yes, we looked at the release notes and couldn't find any fix for this deadlock.
We were earlier running version 29 but had to revert back to 28 due to a memory leak problem after enabling the caching.
Will wait for your findings. Thanks!

@srenatus
Copy link
Contributor

srenatus commented Aug 9, 2021

We were earlier running version 29 but had to revert back to 28 due to a memory leak problem after enabling the caching.

Have you had a chance to bring these up here? There have been cache-related fixes since 0.29.0, perhaps that issue has even been resolved?

Outside of this concrete problem reported here, it would be unfortunate to be stuck on an old version.

@srenatus srenatus added the investigating Issues being actively investigated label Aug 9, 2021
@amitkatyal
Copy link
Author

amitkatyal commented Aug 10, 2021

Yes, there is already an open ticket for the same. #2978

@tsandall
Copy link
Member

@amitkatyal thanks for filing a detailed issue! It does indeed look like there's a bug in the opa-envoy-plugin. I believe the issue is that the opa-envoy-plugin isn't keeping the policy query transaction open when it logs the decision. The bundle plugin should not be able to commit until the decision is logged.

Assuming this is correct, we'll get this fixed ASAP.

@ashutosh-narkar ashutosh-narkar self-assigned this Aug 13, 2021
@amitkatyal
Copy link
Author

@tsandall thanks for confirming the root cause. Will for the fix.

@ashutosh-narkar
Copy link
Member

@amitkatyal we are investigating this and we should have a fix out by next week.

@amitkatyal
Copy link
Author

@ashutosh-narkar thanks for the update.

@amitkatyal
Copy link
Author

@ashutosh-narkar, Any update?

@ashutosh-narkar
Copy link
Member

Hey @amitkatyal we've started to look into this in this change. We should be able to make progress on it this week. Thanks for your patience.

@amitkatyal
Copy link
Author

@ashutosh-narkar Thanks for the update!

ashutosh-narkar added a commit to ashutosh-narkar/opa-istio-plugin that referenced this issue Aug 30, 2021
This commit attempts to fix the deadlock that happens
when bundle and decision logging are both enabled.

The opa-envoy plugin creates a new transaction during
query evaluation and closes it once eval is complete.
Then when it attempts to log the decision, the decision
log plugin grabs mask mutex and calls the PrepareForEval
function in the rego package which tries to open a new
read transaction on the store since the log plugin does
not provide one. This call gets blocked
if concurrently the bundle plugin has a write transaction
open on the store. This write invokes the decision log plugin's callback
and tries to grab the mask mutex. This call gets blocked because
the decision log plugin is already holding onto it for the mask query.

To avoid this, we keep the transaction open in the opa-envoy plugin
till we log the decision.

Fixes: open-policy-agent/opa#3722

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
ashutosh-narkar added a commit to ashutosh-narkar/opa-istio-plugin that referenced this issue Sep 14, 2021
This commit attempts to fix the deadlock that happens
when bundle and decision logging are both enabled.

The opa-envoy plugin creates a new transaction during
query evaluation and closes it once eval is complete.
Then when it attempts to log the decision, the decision
log plugin grabs mask mutex and calls the PrepareForEval
function in the rego package which tries to open a new
read transaction on the store since the log plugin does
not provide one. This call gets blocked
if concurrently the bundle plugin has a write transaction
open on the store. This write invokes the decision log plugin's callback
and tries to grab the mask mutex. This call gets blocked because
the decision log plugin is already holding onto it for the mask query.

To avoid this, we keep the transaction open in the opa-envoy plugin
till we log the decision.

Fixes: open-policy-agent/opa#3722

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
ashutosh-narkar added a commit to open-policy-agent/opa-envoy-plugin that referenced this issue Sep 21, 2021
This commit attempts to fix the deadlock that happens
when bundle and decision logging are both enabled.

The opa-envoy plugin creates a new transaction during
query evaluation and closes it once eval is complete.
Then when it attempts to log the decision, the decision
log plugin grabs mask mutex and calls the PrepareForEval
function in the rego package which tries to open a new
read transaction on the store since the log plugin does
not provide one. This call gets blocked
if concurrently the bundle plugin has a write transaction
open on the store. This write invokes the decision log plugin's callback
and tries to grab the mask mutex. This call gets blocked because
the decision log plugin is already holding onto it for the mask query.

To avoid this, we keep the transaction open in the opa-envoy plugin
till we log the decision.

Fixes: open-policy-agent/opa#3722

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug int-envoy Issues related to the opa-envoy-plugin investigating Issues being actively investigated
Projects
Archived in project
4 participants