new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs #820

Andreagit97 · 2023-01-10T10:04:32Z

I want to highlight that these features are experimental, which means they could change over libs releases. More in general all the modern probe is still experimental so don't expect to have a stable interface at least for this release :)

What type of PR is this?

/kind feature

Any specific area of the project related to this PR?

/area driver-modern-bpf

/area libscap-engine-modern-bpf

/area libpman

/area tests

Does this PR require a change in the driver versions?

No

What this PR does / why we need it:

The way in which the modern probe references memory is a little bit different from the old BPF probe.
Usually, we ask our drivers to open an 8 MB buffer for every CPU, so the kernel allocates 8 MB, and then the userspace map this memory 2 times in order to efficiently read collected events. This is not completely true in the modern BPF probe. When we ask to allocate 8 MB, the kernel under the hood maps this dimension 2 times (https://github.com/torvalds/linux/blob/5a41237ad1d4b62008f93163af1d9b1da90729d8/kernel/bpf/ringbuf.c#L107-L123), in this way both kernel and user-space implementations are simplified and they are also more efficient. Then the userspace will map the entire 16 MB exposed by the kernel. So, in the end, the virtual space of the process will be the same with modern and old BPF, what changes is the referenced memory since in one case we have only 8 MB while in the other we have 16 MB!

Here we have an example just to be surer explicit: we have 8 available CPUs and we allocate a 1 GB ring buffer for each one:

PID    USER      PR  NI    VIRT    RES    SHR  S     %CPU       %MEM     TIME+ COMMAND   
32257 root      20   0     16,0g  16,0g  16,0g S      3,0       51,6     0:01.31 scap-open (with modern bpf) 
32647 root      20   0     16,1g   8,1g   8,1g S      1,7       26,2     0:01.73 scap-open (with old probe)

This could become an issue if we have many CPUs or if we want buffer greater than 8 MB. This is the reason behind this patch!
Now the modern bpf engine exposes 2 new params:

cpus_for_each_buffer, allows users to specify how many CPUs they want to associate with a ring buffer. For example, cpus_for_each_buffer = 1 means that we want a ring buffer for every CPU, so like today, cpus_for_each_buffer = 2 means that we want a ring buffer every 2 CPUs, so a ring buffer will be shared between 2 CPUs. 0 is a special value and means that we want only a ring buffer shared between all the CPUs.
The rule is: cpus_for_each_buffer must >=0 and <=max_possible_CPUs_in_the_system.
cpus_for_each_buffer=0 and cpus_for_each_buffer=max_possible_CPUs_in_the_system do exactly the same thing, 0 is just a simpler way to specify it without knowing the available CPU of our system.
allocate_online_only allows user to allocate ring buffers only for online CPUs. Before this patch, the modern probe allocated a ring buffer for every available CPU and not for online CPUs! This param can be used in combination with cpus_for_each_buffer so also in this case we can associate more than one CPU to a ring buffer.

I think that these 2 new parameters could offer great flexibility to the end user.
Let's consider a final example to clarify the solution. Let's imagine we have a system with 8 CPUs. With the old probe, we will have an 8 MB buffer for every CPU (therefore a total of 64 MB). With the modern probe, we have different ways to obtain the same referenced memory:

a 4 MB ring buffer for every CPU
an 8 MB ring buffer for every CPU pair
a 32 MB ring buffer shared between all the 8 CPUs

Which is the best solution really depends on your system load, but now you have the power to find the optimal solution for your deployment

Which issue(s) this PR fixes:

Special notes for your reviewer:

The PR seems huge but most of the lines are due to tests or comments :)

Does this PR introduce a user-facing change?:

new(modern_bpf):  support variable number of ring buffers and online CPUs

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

FedeDP · 2023-01-10T11:41:32Z

/milestone 0.10.1

incertum · 2023-01-11T07:55:05Z

❤️ I like this approach and btw thanks for fixing this. During tests the extra memory allocated made a huge difference for instance for machines with 64 processors ...

FedeDP

This is beautiful!
/approve

poiana · 2023-01-11T15:11:18Z

LGTM label has been added.

Git tree hash: 93d3a3537791e593df0ec9889a7b8ccc39f8b0a0

hbrueckner · 2023-01-11T15:55:59Z

@Andreagit97 Not sure if this question is a bit out-of-scope... but how will be the behavior in case of CPU hot plugging/un-plugging?

Andreagit97 · 2023-01-11T17:20:38Z

@Andreagit97 Not sure if this question is a bit out-of-scope... but how will be the behavior in case of CPU hot plugging/un-plugging?

This is a good question indeed. The modern probe, unlikely the old drivers, is able to manage the hot-plug because it opens a ring buffer for every possible CPU in the system, and this concept is kept also in this patch. So let's consider an example:

(X) means offline CPU

CPU 0 (X) \
           RING BUF 0
CPU 1 (X) /

CPU 2  \
           RING BUF 1
CPU 3  /

CPU 4 (X) \
           RING BUF 2
CPU 5     /

In this case we have a ring buffer for every CPU pair, but the rationale is the same for all other cases. As you can notice both CPU 0 and CPU 1 are offline but the ring buffer is allocated for this pair, because these CPUs are available in the system and maybe they will become online in the next future. Same for CPU 4, the CPU is liked to RING BUF 2 and when it will become on-line it will have the ring buffer already in place for sending events

Not sure this will answer your question, if not feel free to ask :)

Andreagit97 · 2023-01-12T08:01:04Z

/hold

hbrueckner · 2023-01-12T10:38:00Z

This is a good question indeed. The modern probe, unlikely the old drivers, is able to manage the hot-plug because it opens a ring buffer for every possible CPU in the system, and this concept is kept also in this patch. So let's consider an example:

Not sure this will answer your question, if not feel free to ask :)

It helped and I have seen that the PR uses sysconf(_SC_NPROCESSORS_CONF) to set up the ringbuffer for configured CPUs. There is the corner case to attach additional CPUs and configure them later. In that case, the number of configured CPUs would increase over time. Here is a small example:

# lscpu |grep -A2 '^CPU(s):'
CPU(s):                          3
On-line CPU(s) list:             0,1
Off-line CPU(s) list:            2

# vmcp def cpu 04
CPU 04 defined

# chcpu -r
Triggered rescan of CPUs

# lscpu |grep -A2 '^CPU(s):'
CPU(s):                          4
On-line CPU(s) list:             0,1
Off-line CPU(s) list:            2,3

Andreagit97 · 2023-01-12T16:21:03Z

It helped and I have seen that the PR uses sysconf(_SC_NPROCESSORS_CONF) to set up the ringbuffer for configured CPUs. There is the corner case to attach additional CPUs and configure them later. In that case, the number of configured CPUs would increase over time. Here is a small example:

Thank you very much for this suggestion! libbpf uses under-the-hood /sys/devices/system/cpu/possible (https://github.com/libbpf/libbpf/blob/3423d5e7cdab356d115aef7f987b4a1098ede448/src/libbpf.c#L12188). But I think it is still vulnerable to you corner case (https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu):

		possible: cpus that have been allocated resources and can be
		brought online if they are present.

Let's say the final solution that I see here is: allocate a ring buffer array with all possible CPUs entries (not sure what is the reliable way to do that 🤔 /sys/devices/system/cpu/kernel_max is too big unfortunately) and then according to what happens in the system, remove and add new ring buffers a run-time according to various hot-plugs. When a CPU starts and doesn't have a ringbuf we can receive a notification in userspace in order to create a dedicated buffer. We can try to do that in the next future!

The other hidden dream is to use a single ring buffer for all possible CPUs (so cpus_for_each_buffer = 0), this would be amazing but I don't think we can really support it unless we reduce the number of events we send to userspace, right now we have huge through of data :(

Let's say this PR is a sort of, let's provide users with many configurations since we are in an experimental phase, let's gather some useful info about winning deployment solutions and then release something really production ready! In this sense I would like to add to this PR the support for only online CPUs, this could be a game changer in huge environments with many CPUs disabled

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

Andreagit97 · 2023-01-13T10:04:11Z

I added the online CPUs feature, sorry for the rebase but since we need to cherry-pick it without tests it was better to do like that

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

hbrueckner · 2023-01-13T10:22:22Z

userspace/libpman/src/ringbuffer.c

-	ringubuf_array_fd = bpf_map__fd(g_state.skel->maps.ringbuf_maps);
-	if(ringubuf_array_fd <= 0)
+	/* CPU 0 is always online */
+	if(cpu_id == 0)


Hmm... I would say this is assumption is true on a single-CPU system. Otherwise, CPU 0 could be set off-line, e.g., chcpu -d 0

CPU(s): 4 On-line CPU(s) list: 1,2 Off-line CPU(s) list: 0,3

Tha's true the real issue here is that cpu0 has not the /sys/devices/system/cpu/cpu0/online file that we use here... same for old BPF https://github.com/falcosecurity/libs/blob/master/userspace/libscap/engine/bpf/scap_bpf.c#L1471

might have changed these days:

# cat /sys/devices/system/cpu/cpu0/online 1

However, fine if that's a problem on older systems. Thanks for clarification.

Mmmh cpu0/online does not exist for me either, on

uname -r 6.1.4-arch1-1

Perhaps some cpus allow the cpu0 to be disabled and the kernel then exposes their online file too?

Could be or it could be also related to CONFIG_HOTPLUG_CPU kernel config.

this is interesting, I've the CONFIG_HOTPLUG_CPU=y config but no online file for cpu0

Perhaps some cpus allow the cpu0 to be disabled and the kernel then exposes their online file too?

It could be 🤔

hbrueckner · 2023-01-13T10:29:07Z

userspace/libsinsp/examples/test.cpp

@@ -194,7 +194,7 @@ void open_engine(sinsp& inspector)
 	}
 	else if(!engine_string.compare(MODERN_BPF_ENGINE))
 	{
-		inspector.open_modern_bpf(buffer_bytes_dim, ppm_sc, tp_set);
+		inspector.open_modern_bpf(buffer_bytes_dim, DEFAULT_CPU_FOR_EACH_BUFFER, true, ppm_sc, tp_set);


If I see this right, in scap-open the default is false where for libsinsp it is true. I like just to confirm this difference.

Yeah IMHO the default should be only online CPUs just to be compliant with the other 2 drivers, but this is a good point I can change it in the scap-open thank you :)

You are welcome. Thank you!

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it> Co-authored-by: Hendrik Brueckner <brueckner@de.ibm.com>

FedeDP

/approve

poiana · 2023-01-16T10:54:46Z

LGTM label has been added.

Git tree hash: b929d88e7e6919dda9f0c17be7e3c9e8f06b2510

hbrueckner

Thanks a lot!
/approve

poiana · 2023-01-16T13:53:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97, FedeDP, hbrueckner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Andreagit97,FedeDP]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Andreagit97 · 2023-01-16T14:07:31Z

/unhold

fix: correctly free the state in modern bpf probe

9447da5

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

poiana added kind/feature New feature or request release-note dco-signoff: yes area/driver-modern-bpf area/libscap-engine-modern-bpf area/libpman area/tests size/XXL approved labels Jan 10, 2023

poiana requested review from hbrueckner and leogr January 10, 2023 10:05

poiana added this to the 0.10.1 milestone Jan 10, 2023

FedeDP mentioned this pull request Jan 11, 2023

sync: 0.10.1 #827

Merged

FedeDP previously approved these changes Jan 11, 2023

View reviewed changes

poiana assigned FedeDP Jan 11, 2023

poiana added the lgtm label Jan 11, 2023

poiana added the do-not-merge/hold label Jan 12, 2023

Andreagit97 added 2 commits January 13, 2023 10:22

new: support multiple CPUs per buffer

7bcd0f9

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

update: propagate support to scap-open

e50d37c

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

Andreagit97 changed the title ~~new(modern_bpf): support variable number of ring buffers~~ new(modern_bpf): support variable number of ring buffers and online CPUs Jan 13, 2023

Andreagit97 dismissed FedeDP’s stale review via 2c91e41 January 13, 2023 10:02

Andreagit97 force-pushed the modern_bpf_buffers branch from 0fc5abe to 2c91e41 Compare January 13, 2023 10:02

poiana removed the lgtm label Jan 13, 2023

poiana requested a review from FedeDP January 13, 2023 10:02

Andreagit97 added 2 commits January 13, 2023 11:17

update: propagate support to sinsp

43b33a9

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

tests: add new test suite for the modern probe

e7d5ef9

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it>

Andreagit97 force-pushed the modern_bpf_buffers branch from 2c91e41 to e7d5ef9 Compare January 13, 2023 10:18

Andreagit97 changed the title ~~new(modern_bpf): support variable number of ring buffers and online CPUs~~ new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs Jan 13, 2023

hbrueckner reviewed Jan 13, 2023

View reviewed changes

update: set online_only as default in scap-open

c3a5c47

Signed-off-by: Andrea Terzolo <andrea.terzolo@polito.it> Co-authored-by: Hendrik Brueckner <brueckner@de.ibm.com>

FedeDP approved these changes Jan 16, 2023

View reviewed changes

poiana added the lgtm label Jan 16, 2023

hbrueckner approved these changes Jan 16, 2023

View reviewed changes

poiana assigned hbrueckner Jan 16, 2023

poiana removed the do-not-merge/hold label Jan 16, 2023

poiana merged commit 336cde0 into falcosecurity:master Jan 16, 2023

Andreagit97 mentioned this pull request Jan 17, 2023

new(userspace/falco): [EXPERIMENTAL] allow modern bpf probe to assign more than one CPU to a single ring buffer falcosecurity/falco#2363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs #820

new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs #820

Andreagit97 commented Jan 10, 2023 •

edited

Loading

FedeDP commented Jan 10, 2023

incertum commented Jan 11, 2023

FedeDP left a comment

poiana commented Jan 11, 2023

hbrueckner commented Jan 11, 2023

Andreagit97 commented Jan 11, 2023 •

edited

Loading

Andreagit97 commented Jan 12, 2023

hbrueckner commented Jan 12, 2023

Andreagit97 commented Jan 12, 2023

Andreagit97 commented Jan 13, 2023

hbrueckner Jan 13, 2023 •

edited

Loading

Andreagit97 Jan 13, 2023

hbrueckner Jan 13, 2023 •

edited

Loading

FedeDP Jan 13, 2023

hbrueckner Jan 13, 2023

Andreagit97 Jan 13, 2023

hbrueckner Jan 13, 2023

Andreagit97 Jan 13, 2023

hbrueckner Jan 13, 2023

FedeDP left a comment

poiana commented Jan 16, 2023

hbrueckner left a comment

poiana commented Jan 16, 2023

Andreagit97 commented Jan 16, 2023

new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs #820

new(modern_bpf): [EXPERIMENTAL] support variable number of ring buffers and online CPUs #820

Conversation

Andreagit97 commented Jan 10, 2023 • edited Loading

FedeDP commented Jan 10, 2023

incertum commented Jan 11, 2023

FedeDP left a comment

Choose a reason for hiding this comment

poiana commented Jan 11, 2023

hbrueckner commented Jan 11, 2023

Andreagit97 commented Jan 11, 2023 • edited Loading

Andreagit97 commented Jan 12, 2023

hbrueckner commented Jan 12, 2023

Andreagit97 commented Jan 12, 2023

Andreagit97 commented Jan 13, 2023

hbrueckner Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

Andreagit97 Jan 13, 2023

Choose a reason for hiding this comment

hbrueckner Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

FedeDP Jan 13, 2023

Choose a reason for hiding this comment

hbrueckner Jan 13, 2023

Choose a reason for hiding this comment

Andreagit97 Jan 13, 2023

Choose a reason for hiding this comment

hbrueckner Jan 13, 2023

Choose a reason for hiding this comment

Andreagit97 Jan 13, 2023

Choose a reason for hiding this comment

hbrueckner Jan 13, 2023

Choose a reason for hiding this comment

FedeDP left a comment

Choose a reason for hiding this comment

poiana commented Jan 16, 2023

hbrueckner left a comment

Choose a reason for hiding this comment

poiana commented Jan 16, 2023

Andreagit97 commented Jan 16, 2023

Andreagit97 commented Jan 10, 2023 •

edited

Loading

Andreagit97 commented Jan 11, 2023 •

edited

Loading

hbrueckner Jan 13, 2023 •

edited

Loading

hbrueckner Jan 13, 2023 •

edited

Loading