Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace CGO in the critical path #42

Open
itaysk opened this issue Jul 26, 2021 · 9 comments
Open

Replace CGO in the critical path #42

itaysk opened this issue Jul 26, 2021 · 9 comments
Assignees

Comments

@itaysk
Copy link
Collaborator

itaysk commented Jul 26, 2021

To improve performance, we can bypass libbpf and cgo in the critical path (Libbpf callback)

@itaysk
Copy link
Collaborator Author

itaysk commented Sep 30, 2021

from #80

It is well known that cgo has bad performance when calling c code, and even worse when calling go callbacks from c (see, for example, https://about.sourcegraph.com/go/gophercon-2018-adventures-in-cgo-performance/).
This is actually not a problem for most of the use cases of libbpfgo, where we just need to load a program and attach it, or update a map, as these operations are not that frequent. It may become a problem, however, when we need to poll for events coming from the kernel through one of the perf/ring buffers, as these are much more frequent.
Suggestion: for buffers polling (either perf or ring buffers) let's implement the logic in pure go. These functions can be added as an alternative API for the already existing functions, and will offer high performance where needed.
Specifically, these are the libbpf functions that we will need to implement: C.perf_buffer__poll() and C.ring_buffer_poll() (both are called by the PerfBuffer/RingBuffer poll() function)

@yanivagman
Copy link
Collaborator

yanivagman commented Oct 3, 2021

After performing some local tests, it seems that cgo is not the bottleneck of tracee, but the printer.
Below is a pprof output where I used a very noisy event (sched_switch) and the gob printer.
The conclusion is the same for other workloads (e.g. using default event set on an idle system) and other printers (table/json) - the printer path is always worse than that of cgo.

cgo_vs_printer

@yanivagman
Copy link
Collaborator

yanivagman commented Oct 3, 2021

With table printer:

cgo_vs_printer2

@itaysk
Copy link
Collaborator Author

itaysk commented Oct 3, 2021

What does this mean for this issue? Isn't it still something we should do?

@yanivagman
Copy link
Collaborator

I still need to find a way to compare a prototype I have with pure go implementation to the current cgo implementation.
With pprof I can only see the bottleneck, but can't quantitatively compare between the two implementation.
Any suggestion for how to do that?

@yanivagman
Copy link
Collaborator

yanivagman commented Oct 19, 2021

The performance of c to go calls has improved in recent go versions: golang/go#42469 (comment)

If there is no strong evidence that this is still an issue for libbpfgo, we may probably close this one for now

@simar7
Copy link
Member

simar7 commented Oct 19, 2021

The performance of c to go calls has improved in recent go versions: golang/go#42469 (comment)

If there is no strong evidence that this is still an issue for libbpfgo, we may probably close this one for now

It's also important to note that since we also pass pointers around in our cgo code, that can be much more safely and efficiently done using Cgo Handles as of go 1.17. https://pkg.go.dev/runtime/cgo#Handle

@guyarb
Copy link
Contributor

guyarb commented Dec 8, 2021

Hey @yanivagman
I saw issue #80 and wondered what did was your plan there? By implementing the polling in pure go you mean just call the epoll_wait from go instead of cgo? Or did you talk about implementing more function inside the perf_buffer_poll?
In my project Im handling performance issues that are caused mainly by the perf_buffer_poll cgo implemention and Im trying to find a solution

As i understand, unless we can somehow trigger the perf callback in pure go there is still going to be a massive cpu consumptions as c-to-go is the most expensive directive in cgo

@yanivagman
Copy link
Collaborator

Hey @yanivagman I saw issue #80 and wondered what did was your plan there? By implementing the polling in pure go you mean just call the epoll_wait from go instead of cgo? Or did you talk about implementing more function inside the perf_buffer_poll? In my project Im handling performance issues that are caused mainly by the perf_buffer_poll cgo implemention and Im trying to find a solution

As i understand, unless we can somehow trigger the perf callback in pure go there is still going to be a massive cpu consumptions as c-to-go is the most expensive directive in cgo

Hi @guyarb,
To implement the polling in pure go, the following changes are required:

  1. Change InitPerfBuff function by removing the usage of C.init_perf_buf. This function then creates a new perf buffer per each cpu in the system (using perf_event_open()) and mmap()s it to memory. The new fds can then be added to epoll.
  2. Change the Start function to call a pure go polling function. This function should then wait for events on the epoll fd, and for each ring buffer that got new events, read the received data from it.

A reference implementation is Cilium's ebpf library (that is written in pure go).
When playing with this code and trying to measure differences using pprof, I didn't see major improvements, as described above. It might be that pprof is not the right tool for this task.

In your project, how did you find that the performance issues were caused mainly by the cgo perf_buffer_poll call?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants