Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/reduced decision log locking #6859

Merged

Conversation

johanfylling
Copy link
Contributor

Reducing amount of work performed inside global lock in decision log plugin.

Local tests show a marginal improvement (~3 percentage points) in contention time during load (~500req/s).

Copy link

netlify bot commented Jul 5, 2024

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 58b83b5
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/668bff32ef2de00008914f12
😎 Deploy Preview https://deploy-preview-6859--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@johanfylling
Copy link
Contributor Author

johanfylling commented Jul 7, 2024

There seems to be a race issue here. Will dig into it on Monday.

  • Fix failing race test

Signed-off-by: Johan Fylling <johan.dev@fylling.se>
Signed-off-by: Johan Fylling <johan.dev@fylling.se>
This reverts commit c502cf9.

Signed-off-by: Johan Fylling <johan.dev@fylling.se>
Signed-off-by: Johan Fylling <johan.dev@fylling.se>
@johanfylling johanfylling force-pushed the fix/decision_log_locking_2 branch from e449f5f to 58b83b5 Compare July 8, 2024 15:01
// Make a local copy of the plugins's status. This is needed as locking the status for
// the status upload duration will block policy evaluation and result in
// increased latency for OPA clients
p.mtx.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still valid, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status update is no longer protected by the same mutex that is used for logging, so I don't think policy evaluation is affected anymore.

@@ -892,12 +890,9 @@ func (p *Plugin) oneShot(ctx context.Context) (ok bool, err error) {
continue
}

p.mtx.Lock()
for _, event := range events {
p.encodeAndBufferEvent(event)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeAndBufferEvent is called in the Log method as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. encodeAndBufferEvent now does it's own locking, which is why we shouldn't also lock here.

Copy link
Member

@ashutosh-narkar ashutosh-narkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanfylling this seems fine. The thing we need to be sure about is previously encodeAndBufferEvent was an atomic operation and now it's not. We just need to ensure this doesn't have any unintended consequences and also the benefits outweigh any potential risk.

@johanfylling
Copy link
Contributor Author

I think it should be fine that encodeAndBufferEvent isn't atomic anymore. The buffer and encoder are the component's that needs to be protected. I don't think splitting a single, costly, block into a number of smaller blocks will have ill effects. The worst case scenario I can think of is that a couple of events might get their ND-builtins-caches dropped, and once they're pushed to the chunk it has already been cleared by the consumer-side, and now these events are pushed to a new chunk that would have had the capacity to hold the ND caches.

All tests pass, and I've done a fair bit of stress testing without seeing anything amiss. Do you have any specific additional tests we can run, or possible side effects to look out for in mind?

@ashutosh-narkar
Copy link
Member

All tests pass, and I've done a fair bit of stress testing without seeing anything amiss. Do you have any specific additional tests we can run, or possible side effects to look out for in mind?

Thanks! Nothing particular. Good to know about the testing that's gone behind these changes.

@johanfylling johanfylling merged commit 3209910 into open-policy-agent:main Jul 9, 2024
28 checks passed
@johanfylling johanfylling deleted the fix/decision_log_locking_2 branch July 9, 2024 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants