Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bpf ring over writable #17

Open
wants to merge 4 commits into
base: francis/main
Choose a base branch
from
Open

Conversation

eiffel-fl
Copy link

From 4b900b4 Mon Sep 17 00:00:00 2001
From: Francis Laniel
Date: Fri, 26 Aug 2022 18:49:30 +0200
Subject: [RFC PATCH v2 0/3] Make BPF ring buffer over writable

Hi.

First, I hope you are fine and the same for your relatives.

Normally, when BPF ring buffer are full, producers cannot write anymore and
need to wait for consumer to get some data.
As a consequence, calling bpf_ringbuf_reserve() from eBPF code returns NULL.

This contribution adds a new flag to make BPF ring buffer over writable.
When the buffer is full, the producers will over write the oldest data.
So, calling bpf_ringbuf_reserve() on an over writable BPF ring buffer never
returns NULL but consumer will loose some data.
This flag can be used to monitor lots of events, like all the syscalls done on
a given machine.

The self test added in the last patch was tested and validated in a VM:
you@vm# ./linux/tools/testing/selftests/bpf/test_progs -t ringbuf_over
Can't find bpf_testmod.ko kernel module: -2
WARNING! Selftests relying on bpf_testmod.ko will be skipped.
torvalds#135 ringbuf_over_writable:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

If you see any way to improve this contribution, feel free to share.

Changes since:
v1:

  • Write from end of the buffer like perf ring buffer, this permits handling
    cases where data stored do not have the same size.

Francis Laniel (3):
bpf: Make ring buffer overwritable.
libbpf: Add flag to create over writable ring buffer.
selftests: Add BPF over writable ring buffer self tests.

include/uapi/linux/bpf.h | 3 +
kernel/bpf/ringbuf.c | 57 +++++--
tools/include/uapi/linux/bpf.h | 3 +
tools/testing/selftests/bpf/Makefile | 5 +-
.../bpf/prog_tests/ringbuf_over_writable.c | 156 ++++++++++++++++++
.../bpf/progs/test_ringbuf_over_writable.c | 61 +++++++
6 files changed, 271 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_over_writable.c
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_over_writable.c

Best regards and thank you in advance.

2.25.1

@eiffel-fl
Copy link
Author

If you want to test it, you can use:
https://gitlab.com/eiffel/qemu-scripts

Follow the README.md and just run the following commands before compiling the kernel:

$ cat tools/testing/selftests/bpf/config >> .config
$ make menuconfig
# You can directly exit the menu config, this is just intended to clean the config.

@eiffel-fl
Copy link
Author

I had an idea regarding the libbpf implementation where I will need to add a previous_prod_pos field to the userspace ring buffer structure.
When *_consume() is called, it will check than the actual prod_pos is bigger than previous_prod_pos, this will avoid consuming data we already consumed.

@alban
Copy link
Member

alban commented Aug 27, 2022

This flag can be used to monitor lots of events, like all the syscalls done on a given machine.

I think it should be more precise in the use case. I suggest the following for the cover letter:

Overwriteable ring buffers are useful in BPF programs that are permanently enabled but rarely read, only on-demand, for example in case of a user request to investigate problems. We would like to use this in the Traceloop project (https://github.com/kinvolk/traceloop) presented at LPC 2020 (https://lpc.events/event/7/contributions/667/).

Perf ring buffers already implement an option to be overwritable. In order to avoid data corruption, the data is written backward, see commit 9ecda41 ("perf/core: Add::write_backward attribute to perf event"). This patch series re-uses the same idea from perf ring buffers but in bpf ring buffers.


I think the first commitmsg 'bpf: Make ring buffer overwritable.' should also explain how it avoids memory corruption with the backward method, and reference commit 9ecda41.


Documentation:

The overwritable mode is documented as follows for the perf ring buffer:

The ring buffer can be used in either an overwrite mode or in producer/consumer mode.
Producer/consumer mode is where if the producer were to fill up the buffer before the consumer could free up anything, the producer will stop writing to the buffer. This will lose most recent events.
Overwrite mode is where if the producer were to fill up the buffer before the consumer could free up anything, the producer will overwrite the older data. This will lose the oldest events.

I think similar or the same text can be reused for documenting this feature for the bpf ring buffer.

And for consistency with the perf ring buffer, the cover letter can adopt the same terminology (overwrite mode and in producer/consumer mode).

include/uapi/linux/bpf.h Outdated Show resolved Hide resolved
@eiffel-fl eiffel-fl force-pushed the bpf-ring-over-writable branch from 4b900b4 to b5a8c44 Compare August 29, 2022 14:08
@eiffel-fl
Copy link
Author

From 4082b21fcf2a1723c63eaeaa523b4f5cd2b19204 Mon Sep 17 00:00:00 2001
From: Francis Laniel
Date: Mon, 29 Aug 2022 16:01:36 +0200
Subject: [RFC PATCH v2 0/5] Make BPF ring buffer overwritable

Hi.

First, I hope you are fine and the same for your relatives.

Normally, when BPF ring buffer are full, producers cannot write anymore and
need to wait for consumer to get some data.
As a consequence, calling bpf_ringbuf_reserve() from eBPF code returns NULL.

This contribution adds a new flag to make BPF ring buffer over writable.
Perf ring buffers already implement an option to be overwritable. In order to
avoid data corruption, the data is written backward, see
commit 9ecda41 ("perf/core: Add ::write_backward attribute to perf event").
This patch series re-uses the same idea from perf ring buffers but in BPF ring
buffers.
So, calling bpf_ringbuf_reserve() on an over writable BPF ring buffer never
returns NULL.
As a consequence, oldest data will be overwritten by the newest so consumer will
loose data.

Overwritable ring buffers are useful in BPF programs that are permanently
enabled but rarely read, only on-demand, for example in case of a user request
to investigate problems. We would like to use this in the Traceloop project [1].

The self test added in this series was tested and validated in a VM:
you@vm# ./share/linux/tools/testing/selftests/bpf/test_progs -t ringbuf_over
Can't find bpf_testmod.ko kernel module: -2
WARNING! Selftests relying on bpf_testmod.ko will be skipped.
torvalds#135 ringbuf_over_writable:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

You can also test the libbpf implementation by using the last patch of this
series which should be applied to iovisor/bcc:
you@home$ cd /path/to/iovisor/bcc
you@home$ git apply 0006-for-test-purpose-only-Add-toy-to-play-with-BPF-ring-.patch
you@home$ cd /path/to/linux/tools/lib/bpf
you@home$ make -j$(nproc)
you@home$ cp libbpf.a /path/to/iovisor/bcc/libbpf-tools/.output
you@home$ cd /path/to/iovisor/bcc/libbpf-tools/
you@home$ make -j toy

Start your VM and copy toy executable inside it.

root@vm-amd64:# ./share/toy &
[1] 287
root@vm-amd64:
# for i in {1..16}; do ls > /dev/null; done
16
15
14
13
12
11
10
9
root@vm-amd64:~# ls > /dev/null && ls > /dev/null
18
17

As you can see, the first eight events are overwritten.

If you see any way to improve this contribution, feel free to share.

Francis Laniel (5):
bpf: Make ring buffer overwritable.
selftests: Add BPF overwritable ring buffer self tests.
docs/bpf: Add documentation for overwritable ring buffer.
libbpf: Add implementation to consume overwritable BPF ring buffer.
do not merge: Temporary fix for is_power_of_2.

Documentation/bpf/ringbuf.rst | 18 +-
include/uapi/linux/bpf.h | 3 +
kernel/bpf/ringbuf.c | 57 +++++--
tools/include/uapi/linux/bpf.h | 3 +
tools/lib/bpf/libbpf.c | 2 +-
tools/lib/bpf/ringbuf.c | 106 ++++++++++++
tools/testing/selftests/bpf/Makefile | 5 +-
.../bpf/prog_tests/ringbuf_overwritable.c | 158 ++++++++++++++++++
.../bpf/progs/test_ringbuf_overwritable.c | 61 +++++++
9 files changed, 397 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_overwritable.c
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwritable.c

Best regards and thank you in advance.

[1] https://github.com/kinvolk/traceloop
Traceloop was presented at LPC 2020 (https://lpc.events/event/7/contributions/667/)

2.25.1

@eiffel-fl
Copy link
Author

You can also test the libbpf implementation by using the last patch of this
series which should be applied to iovisor/bcc:

From ae61d5fff31bc047649ca5747c3797c98f6e3200 Mon Sep 17 00:00:00 2001
From: Francis Laniel
Date: Tue, 9 Aug 2022 18:18:53 +0200
Subject: [PATCH] for test purpose only: Add toy to play with BPF ring buffer.

Signed-off-by: Francis Laniel <francis.laniel@amarulasolutions.com>
---
 libbpf-tools/Makefile  |  1 +
 libbpf-tools/toy.bpf.c | 29 +++++++++++++++++++
 libbpf-tools/toy.c     | 65 ++++++++++++++++++++++++++++++++++++++++++
 libbpf-tools/toy.h     |  4 +++
 4 files changed, 99 insertions(+)
 create mode 100644 libbpf-tools/toy.bpf.c
 create mode 100644 libbpf-tools/toy.c
 create mode 100644 libbpf-tools/toy.h

diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
index c3bbac27..904e7712 100644
--- a/libbpf-tools/Makefile
+++ b/libbpf-tools/Makefile
@@ -62,6 +62,7 @@ APPS = \
 	tcplife \
 	tcprtt \
 	tcpsynbl \
+	toy \
 	vfsstat \
 	#
 
diff --git a/libbpf-tools/toy.bpf.c b/libbpf-tools/toy.bpf.c
new file mode 100644
index 00000000..3c28a20b
--- /dev/null
+++ b/libbpf-tools/toy.bpf.c
@@ -0,0 +1,29 @@
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/bpf.h>
+#include "toy.h"
+
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 4096);
+	__uint(map_flags, 1U << 13);
+} buffer SEC(".maps");
+
+static __u32 count = 0;
+
+SEC("tracepoint/syscalls/sys_enter_execve")
+int sys_enter_execve(void) {
+	count++;
+	struct event *event = bpf_ringbuf_reserve(&buffer, sizeof(struct event), 0);
+	if (!event) {
+		return 1;
+	}
+
+	event->count = count;
+	bpf_ringbuf_submit(event, 0);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/libbpf-tools/toy.c b/libbpf-tools/toy.c
new file mode 100644
index 00000000..4cd8b588
--- /dev/null
+++ b/libbpf-tools/toy.c
@@ -0,0 +1,65 @@
+#include <bpf/libbpf.h>
+#include <stdio.h>
+#include <unistd.h>
+#include "toy.h"
+#include "toy.skel.h"
+#include "btf_helpers.h"
+
+
+static int buf_process_sample(void *ctx, void *data, size_t len) {
+	struct event *evt = (struct event *)data;
+
+	printf("%d\n", evt->count);
+
+	return 0;
+}
+
+int main(void) {
+	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
+	int buffer_map_fd = -1;
+	struct toy_bpf *obj;
+	int err;
+
+	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+	err = ensure_core_btf(&open_opts);
+	if (err) {
+		fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
+		return 1;
+	}
+
+	obj = toy_bpf__open_opts(&open_opts);
+	if (!obj) {
+		fprintf(stderr, "failed to open BPF object\n");
+		return 1;
+	}
+
+	err = toy_bpf__load(obj);
+	if (err) {
+		fprintf(stderr, "failed to load BPF object: %d\n", err);
+		return 1;
+	}
+
+	struct ring_buffer *ring_buffer;
+
+	buffer_map_fd = bpf_object__find_map_fd_by_name(obj->obj, "buffer");
+	ring_buffer = ring_buffer__new(buffer_map_fd, buf_process_sample, NULL, NULL);
+
+	if(!ring_buffer) {
+		fprintf(stderr, "failed to create ring buffer\n");
+		return 1;
+	}
+
+	err = toy_bpf__attach(obj);
+	if (err) {
+		fprintf(stderr, "failed to attach BPF programs\n");
+		return 1;
+	}
+
+	for (;;) {
+		ring_buffer__consume(ring_buffer);
+		sleep(1);
+	}
+
+	return 0;
+}
diff --git a/libbpf-tools/toy.h b/libbpf-tools/toy.h
new file mode 100644
index 00000000..ebfedf06
--- /dev/null
+++ b/libbpf-tools/toy.h
@@ -0,0 +1,4 @@
+struct event {
+	__u32 count;
+	char filler[4096 / 8 - sizeof(__u32) - 8];
+};
-- 
2.25.1

Documentation/bpf/ringbuf.rst Outdated Show resolved Hide resolved
Documentation/bpf/ringbuf.rst Outdated Show resolved Hide resolved
kernel/bpf/ringbuf.c Outdated Show resolved Hide resolved
@eiffel-fl eiffel-fl force-pushed the bpf-ring-over-writable branch 2 times, most recently from e8b50e4 to 1c0a50f Compare August 30, 2022 09:17
@eiffel-fl
Copy link
Author

From 1c0a50f Mon Sep 17 00:00:00 2001
From: Francis Laniel
Date: Tue, 30 Aug 2022 11:17:47 +0200
Subject: [RFC PATCH v2 0/5] Make BPF ring buffer overwritable

Hi.

First, I hope you are fine and the same for your relatives.

Normally, when BPF ring buffer are full, producers cannot write anymore and
need to wait for consumer to get some data.
As a consequence, calling bpf_ringbuf_reserve() from eBPF code returns NULL.

This contribution adds a new flag to make BPF ring buffer overwritable.
Perf ring buffers already implement an option to be overwritable. In order to
avoid data corruption, the data is written backward, see
commit 9ecda41 ("perf/core: Add ::write_backward attribute to perf event").
This patch series re-uses the same idea from perf ring buffers but in BPF ring
buffers.
So, calling bpf_ringbuf_reserve() on an overwritable BPF ring buffer never
returns NULL.
As a consequence, oldest data will be overwritten by the newest so consumer will
loose data.

Overwritable ring buffers are useful in BPF programs that are permanently
enabled but rarely read, only on-demand, for example in case of a user request
to investigate problems. We would like to use this in the Traceloop project [1].

The self test added in this series was tested and validated in a VM:
you@vm# ./share/linux/tools/testing/selftests/bpf/test_progs -t ringbuf_over
Can't find bpf_testmod.ko kernel module: -2
WARNING! Selftests relying on bpf_testmod.ko will be skipped.
torvalds#135 ringbuf_over_writable:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

You can also test the libbpf implementation by using the last patch of this
series which should be applied to iovisor/bcc:
you@home$ cd /path/to/iovisor/bcc
you@home$ git apply 0006-for-test-purpose-only-Add-toy-to-play-with-BPF-ring-.patch
you@home$ cd /path/to/linux/tools/lib/bpf
you@home$ make -j$(nproc)
you@home$ cp libbpf.a /path/to/iovisor/bcc/libbpf-tools/.output
you@home$ cd /path/to/iovisor/bcc/libbpf-tools/
you@home$ make -j toy

Start your VM and copy toy executable inside it.

root@vm-amd64:# ./share/toy &
[1] 287
root@vm-amd64:
# for i in {1..16}; do ls > /dev/null; done
16
15
14
13
12
11
10
9
root@vm-amd64:~# ls > /dev/null && ls > /dev/null
18
17

As you can see, the first eight events are overwritten.

If you see any way to improve this contribution, feel free to share.

Changes since:
v1:

  • Made producers write backward like perf ring buffer, so it permits avoiding
    memory corruption.
  • Added libbpf implementation to consume all events available.
  • Added selftest.
  • Added documentation.

Francis Laniel (5):
bpf: Make ring buffer overwritable.
selftests: Add BPF overwritable ring buffer self tests.
docs/bpf: Add documentation for overwritable ring buffer.
libbpf: Add implementation to consume overwritable BPF ring buffer.
do not merge: Temporary fix for is_power_of_2.

Documentation/bpf/ringbuf.rst | 18 +-
include/uapi/linux/bpf.h | 3 +
kernel/bpf/ringbuf.c | 43 +++--
tools/include/uapi/linux/bpf.h | 3 +
tools/lib/bpf/libbpf.c | 2 +-
tools/lib/bpf/ringbuf.c | 106 ++++++++++++
tools/testing/selftests/bpf/Makefile | 5 +-
.../bpf/prog_tests/ringbuf_overwritable.c | 158 ++++++++++++++++++
.../bpf/progs/test_ringbuf_overwritable.c | 61 +++++++
9 files changed, 385 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_overwritable.c
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwritable.c

Best regards and thank you in advance.

[1] https://github.com/kinvolk/traceloop
Traceloop was presented at LPC 2020 (https://lpc.events/event/7/contributions/667/)

2.25.1

By default, BPF ring buffer are size bounded, when producers already filled the
buffer, they need to wait for the consumer to get those data before adding new
ones.
In terms of API, bpf_ringbuf_reserve() returns NULL if the buffer is full.

This patch permits making BPF ring buffer overwritable.
When producers already wrote as many data as the buffer size, they will begin to
over write existing data, so the oldest will be replaced.
As a result, bpf_ringbuf_reserve() never returns NULL.

To avoid memory consumption, this patch writes data backward like overwritable
perf ring buffer added in
commit 9ecda41 ("perf/core: Add ::write_backward attribute to perf event").

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Add tests to confirm behavior of overwritable BPF ring buffer, particularly the
oldest data being overwritten by newest ones.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Add documentation to precise behavior of overwritable BPF ring buffer compared
to conventionnal ones.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
If the BPF ring buffer is overwritable, ringbuf_process_overwritable_ring() will
be called to handle the data consumption.
All the available data will be consumed but some checks will be performed:
* check we do not read data we already read, if there is no new data, nothing
happens.
* check we do not read more than the buffer size.
* check we do not read invalid data by checking they fit the buffer size.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
@eiffel-fl eiffel-fl force-pushed the bpf-ring-over-writable branch from 1c0a50f to 93911af Compare September 6, 2022 12:22
alban pushed a commit that referenced this pull request Nov 8, 2022
When doing slub_debug test, kfence's 'test_memcache_typesafe_by_rcu'
kunit test case cause a use-after-free error:

  BUG: KASAN: use-after-free in kobject_del+0x14/0x30
  Read of size 8 at addr ffff888007679090 by task kunit_try_catch/261

  CPU: 1 PID: 261 Comm: kunit_try_catch Tainted: G    B            N 6.0.0-rc5-next-20220916 #17
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x48
   print_address_description.constprop.0+0x87/0x2a5
   print_report+0x103/0x1ed
   kasan_report+0xb7/0x140
   kobject_del+0x14/0x30
   kmem_cache_destroy+0x130/0x170
   test_exit+0x1a/0x30
   kunit_try_run_case+0xad/0xc0
   kunit_generic_run_threadfn_adapter+0x26/0x50
   kthread+0x17b/0x1b0
   </TASK>

The cause is inside kmem_cache_destroy():

kmem_cache_destroy
    acquire lock/mutex
    shutdown_cache
        schedule_work(kmem_cache_release) (if RCU flag set)
    release lock/mutex
    kmem_cache_release (if RCU flag not set)

In some certain timing, the scheduled work could be run before
the next RCU flag checking, which can then get a wrong value
and lead to double kmem_cache_release().

Fix it by caching the RCU flag inside protected area, just like 'refcnt'

Fixes: 0495e33 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants