weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33

cpaasch · 2020-06-01T17:09:27Z

Running the test "simple_abndiff" with mptcp.org-client and netnext server [1], I see netnext "stalling".

The test https://github.com/multipath-tcp/mptcp-scripts/blob/master/testing/testing.py#L1725, creates 100 concurrent clients for 100000 requests of 1KB files. mptcp.org is configured with "ndiffports" 8, which means it will create 8 subflows to the primary IP-address without waiting for ADD_ADDR.

netnext reports at the beginning of the test SYN-flooding:
server login: [ 165.274087] TCP: request_sock_subflow: Possible SYN flooding on port 80. Dropping request. Check SNMP counters.

weighttp is reporting connect-timeouts on the mptcp.org-client:
error: connect() failed on port 49932: Connection timed out (110).

Logging in to the netnext-server, I see all apache2 processes spinning at 90% CPU.
Also, I see sshd at 90% :-/

The text was updated successfully, but these errors were encountered:

cpaasch · 2020-06-15T16:13:12Z

Still reproduces on top of:

350b5b59fe9e ("mptcp: Enable MPTCP when IPPROTO_MPTCP is set")  (HEAD -> netnext_mptcp_all) (10 minutes ago) <Christoph Paasch>
a36694f83b38 ("[DO-NOT-MERGE] mptcp: enabled by default")  (tag: export/20200615T152912, mptcp_net-next/export) (44 minutes ago) <Matthieu Baerts>
cef5d2ef00d6 ("[DO-NOT-MERGE] mptcp: use kmalloc on kasan build")  (43 minutes ago) <Paolo Abeni>
b320877ca134 ("mptcp: close poll() races")  (44 minutes ago) <Paolo Abeni>
f9b4521e6eab ("mptcp: add receive buffer auto-tuning")  (44 minutes ago) <Florian Westphal>
2039c773569c ("selftests: mptcp: add option to specify size of file to transfer")  (44 minutes ago) <Florian Westphal>
957e36dad053 ("mptcp: fallback in case of simultaneous connect")  (44 minutes ago) <Davide Caratti>
6606a88aa255 ("net: mptcp: improve fallback to TCP")  (44 minutes ago) <Davide Caratti>
a542d3c6a014 ("mptcp: introduce token KUNIT self-tests")  (44 minutes ago) <Paolo Abeni>
6df1b61785b8 ("mptcp: move crypto test to KUNIT")  (44 minutes ago) <Paolo Abeni>
f5fab672501d ("mptcp: do nonce initialization at subflow creation time")  (44 minutes ago) <Paolo Abeni>
bd361b0588c7 ("mptcp: refactor token container")  (44 minutes ago) <Paolo Abeni>
42618ca6c437 ("mptcp: add __init annotation on setup functions")  (44 minutes ago) <Paolo Abeni>
4bc2c4fceb24 ("mptcp: drop MP_JOIN request sock on syn cookies")  (44 minutes ago) <Paolo Abeni>
107030f6b56d ("mptcp: cache msk on MP_JOIN init_req")  (44 minutes ago) <Paolo Abeni>
96144c58abe7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")  (mptcp_net-next/net-next) (2 days ago) <Linus Torvalds>

pabeni · 2020-06-18T16:43:19Z

could you please collect 'ss -enit' output on both the client and the server? plus:

perf top -a/ perf record -ag ; perf report --stdio

on the server if, nginx is still spinning a lot

cpaasch · 2020-06-22T18:32:15Z

Here is the ss/perf output:

perf_output.txt
ss_client.txt
ss_server.txt

pabeni · 2020-06-26T15:32:57Z

On client side there are:
304 ESTABLISHED
415 SYN-SENT
47 LAST-ACK
11 CLOSING

while on the server side:
50 FIN-WAIT-1
4 ESTABLISHED

There are a lot of sockets on the client side without any suitable counter-part on the server side. A possible explanation is as follow:

due to syncookie we enter the subflow_syn_recv_sock() error path for (many) MP_JOIN requests and tcp_send_active_reset() fails (e.g. due to memory allocation failure and/or device xmit queue full). The child socket is closed and the request is dropped, but the reset do not reach the client. The client keeps the relevant socket open, and eventually re-transmit, but such re-transmissions are ignored from the server side.

If the above held, the following change may help:

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0f0fa1ba57a8..9aaf3d66283f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -514,7 +514,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 dispose_child:
        subflow_drop_ctx(child);
        tcp_rsk(req)->drop_req = true;
-       tcp_send_active_reset(child, GFP_ATOMIC);
+       req->rsk_ops->send_reset(sk, skb);
        inet_csk_prepare_for_destroy_sock(child);
        tcp_done(child);

as req->rsk_ops->send_reset() should be more robust to this kind of issues.

cpaasch · 2020-06-26T21:26:29Z

The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.

Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.

pabeni · 2020-06-29T13:37:42Z

The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.

That sounds good...

Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.

... but there is a "but" ;)

Moving some checks at connection request creation time should be quite easy - using the req->route_req() op.
I'm unsure how much that will help in this specific scenario: tcp_conn_request() happens ~1RTT before tcp_check_req(), the msk could be still established.

Moving the syn_recv_sock() earlier is less trivial and possibly requires some change to the current hooking inside the TCP code. It looks like a complete solution to this issue will require more complex changes.

cpaasch · 2020-07-03T00:34:58Z

starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 6 sec, 44 millisec and 628 microsec, 16543 req/s, 7318 kbyte/s

And with KASAN enabled:

starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 18 sec, 337 millisec and 220 microsec, 5453 req/s, 2412 kbyte/s

Looking really good! :)

The current number of KVM_IRQCHIP_NUM_PINS results in an order 3 allocation (32kb) for each guest start/restart. This can result in OOM killer activity even with free swap when the memory is fragmented enough: kernel: qemu-system-s39 invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0 kernel: CPU: 1 PID: 357274 Comm: qemu-system-s39 Kdump: loaded Not tainted 5.4.0-29-generic #33-Ubuntu kernel: Hardware name: IBM 8562 T02 Z06 (LPAR) kernel: Call Trace: kernel: ([<00000001f848fe2a>] show_stack+0x7a/0xc0) kernel: [<00000001f8d3437a>] dump_stack+0x8a/0xc0 kernel: [<00000001f8687032>] dump_header+0x62/0x258 kernel: [<00000001f8686122>] oom_kill_process+0x172/0x180 kernel: [<00000001f8686abe>] out_of_memory+0xee/0x580 kernel: [<00000001f86e66b8>] __alloc_pages_slowpath+0xd18/0xe90 kernel: [<00000001f86e6ad4>] __alloc_pages_nodemask+0x2a4/0x320 kernel: [<00000001f86b1ab4>] kmalloc_order+0x34/0xb0 kernel: [<00000001f86b1b62>] kmalloc_order_trace+0x32/0xe0 kernel: [<00000001f84bb806>] kvm_set_irq_routing+0xa6/0x2e0 kernel: [<00000001f84c99a4>] kvm_arch_vm_ioctl+0x544/0x9e0 kernel: [<00000001f84b8936>] kvm_vm_ioctl+0x396/0x760 kernel: [<00000001f875df66>] do_vfs_ioctl+0x376/0x690 kernel: [<00000001f875e304>] ksys_ioctl+0x84/0xb0 kernel: [<00000001f875e39a>] __s390x_sys_ioctl+0x2a/0x40 kernel: [<00000001f8d55424>] system_call+0xd8/0x2c8 As far as I can tell s390x does not use the iopins as we bail our for anything other than KVM_IRQ_ROUTING_S390_ADAPTER and the chip/pin is only used for KVM_IRQ_ROUTING_IRQCHIP. So let us use a small number to reduce the memory footprint. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200617083620.5409-1-borntraeger@de.ibm.com

When multiple adapters are present in the system, pci hot-removing second adapter leads to the following warning as both the adapters registered thermal zone device with same thermal zone name/type. Therefore, use unique thermal zone name during thermal zone device initialization. Also mark thermal zone dev NULL once unregistered. [ 414.370143] ------------[ cut here ]------------ [ 414.370944] sysfs group 'power' not found for kobject 'hwmon0' [ 414.371747] WARNING: CPU: 9 PID: 2661 at fs/sysfs/group.c:281 sysfs_remove_group+0x76/0x80 [ 414.382550] CPU: 9 PID: 2661 Comm: bash Not tainted 5.8.0-rc6+ #33 [ 414.383593] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0a 06/23/2016 [ 414.384669] RIP: 0010:sysfs_remove_group+0x76/0x80 [ 414.385738] Code: 48 89 df 5b 5d 41 5c e9 d8 b5 ff ff 48 89 df e8 60 b0 ff ff eb cb 49 8b 14 24 48 8b 75 00 48 c7 c7 90 ae 13 bb e8 6a 27 d0 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 00 0f 1f 44 00 00 48 85 f6 74 31 41 54 [ 414.388404] RSP: 0018:ffffa22bc080fcb0 EFLAGS: 00010286 [ 414.389638] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 414.390829] RDX: 0000000000000001 RSI: ffff8ee2de3e9510 RDI: ffff8ee2de3e9510 [ 414.392064] RBP: ffffffffbaef2ee0 R08: 0000000000000000 R09: 0000000000000000 [ 414.393224] R10: 0000000000000000 R11: 000000002b30006c R12: ffff8ee260720008 [ 414.394388] R13: ffff8ee25e0a40e8 R14: ffffa22bc080ff08 R15: ffff8ee2c3be5020 [ 414.395661] FS: 00007fd2a7171740(0000) GS:ffff8ee2de200000(0000) knlGS:0000000000000000 [ 414.396825] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 414.398011] CR2: 00007f178ffe5020 CR3: 000000084c5cc003 CR4: 00000000003606e0 [ 414.399172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 414.400352] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 414.401473] Call Trace: [ 414.402685] device_del+0x89/0x400 [ 414.403819] device_unregister+0x16/0x60 [ 414.405024] hwmon_device_unregister+0x44/0xa0 [ 414.406112] thermal_remove_hwmon_sysfs+0x196/0x200 [ 414.407256] thermal_zone_device_unregister+0x1b5/0x1f0 [ 414.408415] cxgb4_thermal_remove+0x3c/0x4f [cxgb4] [ 414.409668] remove_one+0x212/0x290 [cxgb4] [ 414.410875] pci_device_remove+0x36/0xb0 [ 414.412004] device_release_driver_internal+0xe2/0x1c0 [ 414.413276] pci_stop_bus_device+0x64/0x90 [ 414.414433] pci_stop_and_remove_bus_device_locked+0x16/0x30 [ 414.415609] remove_store+0x75/0x90 [ 414.416790] kernfs_fop_write+0x114/0x1b0 [ 414.417930] vfs_write+0xcf/0x210 [ 414.419059] ksys_write+0xa7/0xe0 [ 414.420120] do_syscall_64+0x4c/0xa0 [ 414.421278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 414.422335] RIP: 0033:0x7fd2a686afd0 [ 414.423396] Code: Bad RIP value. [ 414.424549] RSP: 002b:00007fffc1446148 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 414.425638] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd2a686afd0 [ 414.426830] RDX: 0000000000000002 RSI: 00007fd2a7196000 RDI: 0000000000000001 [ 414.427927] RBP: 00007fd2a7196000 R08: 000000000000000a R09: 00007fd2a7171740 [ 414.428923] R10: 00007fd2a7171740 R11: 0000000000000246 R12: 00007fd2a6b43400 [ 414.430082] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 [ 414.431027] irq event stamp: 76300 [ 414.435678] ---[ end trace 13865acb4d5ab00f ]--- Fixes: b187191 ("cxgb4: Add thermal zone support") Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>

This mostly reverts commit 99bca61 ("MIPS: pci-legacy: use generic pci_enable_resources"). Fixes regressions such as: ata_piix 0000:00:0a.1: can't enable device: BAR 0 [io 0x01f0-0x01f7] not claimed ata_piix: probe of 0000:00:0a.1 failed with error -22 The only changes from the strict revert are to fix checkpatch errors: ERROR: spaces required around that '=' (ctx:VxV) #33: FILE: arch/mips/pci/pci-legacy.c:252: + for (idx=0; idx < PCI_NUM_RESOURCES; idx++) { ^ ERROR: do not use assignment in if condition #67: FILE: arch/mips/pci/pci-legacy.c:284: + if ((err = pcibios_enable_resources(dev, mask)) < 0) Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

Do not schedule hw full reset if the device is not fully initialized (e.g if the channel has not been configured yet). This patch fixes the kernel crash reported below [ 44.440266] mt7921e 0000:01:00.0: chip reset failed [ 44.527575] Unable to handle kernel paging request at virtual address ffffffc02f3e0000 [ 44.535771] Mem abort info: [ 44.538646] ESR = 0x96000006 [ 44.541792] EC = 0x25: DABT (current EL), IL = 32 bits [ 44.547268] SET = 0, FnV = 0 [ 44.550413] EA = 0, S1PTW = 0 [ 44.553648] Data abort info: [ 44.556613] ISV = 0, ISS = 0x00000006 [ 44.560563] CM = 0, WnR = 0 [ 44.563619] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000000955000 [ 44.570530] [ffffffc02f3e0000] pgd=100000003ffff003, p4d=100000003ffff003, pud=100000003ffff003, pmd=0000000000000000 [ 44.581489] Internal error: Oops: 96000006 [#1] SMP [ 44.606406] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 5.13.0-rc1-espressobin-12875-g6dc7f82ebc26 #33 [ 44.617264] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT) [ 44.623905] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--) [ 44.630100] pc : __queue_work+0x1f0/0x500 [ 44.634249] lr : __queue_work+0x1e8/0x500 [ 44.638384] sp : ffffffc010003d70 [ 44.641798] x29: ffffffc010003d70 x28: 0000000000000000 x27: ffffff8003989200 [ 44.649166] x26: ffffffc010c08510 x25: 0000000000000002 x24: ffffffc010ad90b0 [ 44.656533] x23: ffffffc010c08508 x22: 0000000000000012 x21: 0000000000000000 [ 44.663899] x20: ffffff8006385238 x19: ffffffc02f3e0000 x18: 00000000000003c9 [ 44.671266] x17: 0000000000000000 x16: 0000000000000000 x15: 000009b1a8a3bf90 [ 44.678632] x14: 0098968000000000 x13: 0000000000000000 x12: 0000000000000325 [ 44.685998] x11: ffffff803fda1928 x10: 0000000000000001 x9 : ffffffc010003e98 [ 44.693365] x8 : 0000000000000032 x7 : fff8000000000000 x6 : 0000000000000035 [ 44.700732] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc010adf700 [ 44.708098] x2 : ffffff8006385238 x1 : 000000007fffffff x0 : 0000000000000000 [ 44.715465] Call trace: [ 44.717982] __queue_work+0x1f0/0x500 [ 44.721760] delayed_work_timer_fn+0x18/0x20 [ 44.726167] call_timer_fn+0x2c/0x178 [ 44.729947] run_timer_softirq+0x488/0x5c8 [ 44.734172] _stext+0x11c/0x378 [ 44.737411] irq_exit+0x100/0x108 [ 44.740830] __handle_domain_irq+0x60/0xb0 [ 44.745059] gic_handle_irq+0x70/0x2b4 [ 44.748929] el1_irq+0xb8/0x13c [ 44.752167] arch_cpu_idle+0x14/0x30 [ 44.755858] default_idle_call+0x38/0x168 [ 44.759994] do_idle+0x1fc/0x210 [ 44.763325] cpu_startup_entry+0x20/0x58 [ 44.767372] rest_init+0xb8/0xc8 [ 44.770703] arch_call_rest_init+0xc/0x14 [ 44.774841] start_kernel+0x408/0x424 [ 44.778623] Code: aa1403e0 97fff54f aa0003f5 b5fff500 (f9400275) [ 44.784907] ---[ end trace be73c3142d8c36a9 ]--- [ 44.789668] Kernel panic - not syncing: Oops: Fatal exception in interrupt Fixes: 0c1ce98 ("mt76: mt7921: add wifi reset support") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Felix Fietkau <nbd@nbd.name>

The kmemleak_*_phys() apis do not check the address for lowmem's min boundary, while the caller may pass an address below lowmem, which will trigger an oops: # echo scan > /sys/kernel/debug/kmemleak Unable to handle kernel paging request at virtual address ff5fffffffe00000 Oops [#1] Modules linked in: CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33 Hardware name: riscv-virtio,qemu (DT) epc : scan_block+0x74/0x15c ra : scan_block+0x72/0x15c epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30 gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200 t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90 s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000 a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001 a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005 s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90 s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0 s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000 s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f t5 : 0000000000000001 t6 : 0000000000000000 status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d scan_gray_list+0x12e/0x1a6 kmemleak_scan+0x2aa/0x57e kmemleak_write+0x32a/0x40c full_proxy_write+0x56/0x82 vfs_write+0xa6/0x2a6 ksys_write+0x6c/0xe2 sys_write+0x22/0x2a ret_from_syscall+0x0/0x2 The callers may not quite know the actual address they pass(e.g. from devicetree). So the kmemleak_*_phys() apis should guarantee the address they finally use is in lowmem range, so check the address for lowmem's min boundary. Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

This patch fixes a build error reported in the link. [0] unix_connect.c: In function ‘unix_connect_test’: unix_connect.c:115:55: error: expected identifier before ‘(’ token #define offsetof(type, member) ((size_t)&((type *)0)->(member)) ^ unix_connect.c:128:12: note: in expansion of macro ‘offsetof’ addrlen = offsetof(struct sockaddr_un, sun_path) + variant->len; ^~~~~~~~ We can fix this by removing () around member, but checkpatch will complain about it, and the root cause of the build failure is that I followed the warning and fixed this in the v2 -> v3 change of the blamed commit. [1] CHECK: Macro argument 'member' may be better as '(member)' to avoid precedence issues #33: FILE: tools/testing/selftests/net/af_unix/unix_connect.c:115: +#define offsetof(type, member) ((size_t)&((type *)0)->member) To avoid this warning, let's use offsetof() defined in stddef.h instead. [0]: https://lore.kernel.org/linux-mm/202207182205.FrkMeDZT-lkp@intel.com/ [1]: https://lore.kernel.org/netdev/20220702154818.66761-1-kuniyu@amazon.com/ Fixes: e95ab1d ("selftests: net: af_unix: Test connect() with different netns.") Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20220720005750.16600-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

By keep sending L2CAP_CONF_REQ packets, chan->num_conf_rsp increases multiple times and eventually it will wrap around the maximum number (i.e., 255). This patch prevents this by adding a boundary check with L2CAP_MAX_CONF_RSP Btmon log: Bluetooth monitor ver 5.64 = Note: Linux version 6.1.0-rc2 (x86_64) 0.264594 = Note: Bluetooth subsystem version 2.22 0.264636 @ MGMT Open: btmon (privileged) version 1.22 {0x0001} 0.272191 = New Index: 00:00:00:00:00:00 (Primary,Virtual,hci0) [hci0] 13.877604 @ RAW Open: 9496 (privileged) version 2.22 {0x0002} 13.890741 = Open Index: 00:00:00:00:00:00 [hci0] 13.900426 (...) > ACL Data RX: Handle 200 flags 0x00 dlen 1033 #32 [hci0] 14.273106 invalid packet size (12 != 1033) 08 00 01 00 02 01 04 00 01 10 ff ff ............ > ACL Data RX: Handle 200 flags 0x00 dlen 1547 #33 [hci0] 14.273561 invalid packet size (14 != 1547) 0a 00 01 00 04 01 06 00 40 00 00 00 00 00 ........@..... > ACL Data RX: Handle 200 flags 0x00 dlen 2061 #34 [hci0] 14.274390 invalid packet size (16 != 2061) 0c 00 01 00 04 01 08 00 40 00 00 00 00 00 00 04 ........@....... > ACL Data RX: Handle 200 flags 0x00 dlen 2061 #35 [hci0] 14.274932 invalid packet size (16 != 2061) 0c 00 01 00 04 01 08 00 40 00 00 00 07 00 03 00 ........@....... = bluetoothd: Bluetooth daemon 5.43 14.401828 > ACL Data RX: Handle 200 flags 0x00 dlen 1033 #36 [hci0] 14.275753 invalid packet size (12 != 1033) 08 00 01 00 04 01 04 00 40 00 00 00 ........@... Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

When employed within a sleepable program not under RCU protection, the use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log, particularly when CONFIG_PROVE_RCU is enabled: [ 1259.662357] WARNING: suspicious RCU usage [ 1259.662358] 6.5.0+ #33 Not tainted [ 1259.662360] ----------------------------- [ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! Other info that might help to debug this: [ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1 [ 1259.662368] 1 lock held by trace/72954: [ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0 Stack backtrace: [ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33 [ 1259.662391] Call Trace: [ 1259.662393] <TASK> [ 1259.662395] dump_stack_lvl+0x6e/0x90 [ 1259.662401] dump_stack+0x10/0x20 [ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0 [ 1259.662412] task_css_set.part.0+0x23/0x30 [ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0 [ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93 [ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000 [ 1259.662439] bpf_lsm_bpf+0x5/0x20 [ 1259.662443] ? security_bpf+0x32/0x50 [ 1259.662452] __sys_bpf+0xe6/0xdd0 [ 1259.662463] __x64_sys_bpf+0x1a/0x30 [ 1259.662467] do_syscall_64+0x38/0x90 [ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1259.662479] RIP: 0033:0x7f487baf8e29 [...] [ 1259.662504] </TASK> This issue can be reproduced by executing a straightforward program, as demonstrated below: SEC("lsm.s/bpf") int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size) { struct cgroup *cgrp = NULL; struct task_struct *task; int ret = 0; if (cmd != BPF_LINK_CREATE) return 0; // The cgroup2 should be mounted first cgrp = bpf_cgroup_from_id(1); if (!cgrp) goto out; task = bpf_get_current_task_btf(); if (bpf_task_under_cgroup(task, cgrp)) ret = -1; bpf_cgroup_release(cgrp); out: return ret; } After running the program, if you subsequently execute another BPF program, you will encounter the warning. It's worth noting that task_under_cgroup_hierarchy() is also utilized by bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup() doesn't exhibit this issue because it cannot be used in sleepable BPF programs. Fixes: b5ad4cd ("bpf: Add bpf_task_under_cgroup() kfunc") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: Feng Zhou <zhoufeng.zf@bytedance.com> Cc: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com

Recently, I frequently hit the following test failure: [root@arch-fb-vm1 bpf]# ./test_progs -n 33/1 test_lookup_update:PASS:skel_open 0 nsec [...] test_lookup_update:PASS:sync_rcu 0 nsec test_lookup_update:FAIL:map1_leak inner_map1 leaked! #33/1 btf_map_in_map/lookup_update:FAIL #33 btf_map_in_map:FAIL In the test, after map is closed and then after two rcu grace periods, it is assumed that map_id is not available to user space. But the above assumption cannot be guaranteed. After zero or one or two rcu grace periods in different siturations, the actual freeing-map-work is put into a workqueue. Later on, when the work is dequeued, the map will be actually freed. See bpf_map_put() in kernel/bpf/syscall.c. By using workqueue, there is no ganrantee that map will be actually freed after a couple of rcu grace periods. This patch removed such map leak detection and then the test can pass consistently. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240322061353.632136-1-yonghong.song@linux.dev

This will add eBPF JIT support to the 32-bit ARCv2 processors. The implementation is qualified by running the BPF tests on a Synopsys HSDK board with "ARC HS38 v2.1c at 500 MHz" as the 4-core CPU. The test_bpf.ko reports 2-10 fold improvements in execution time of its tests. For instance: test_bpf: #33 tcpdump port 22 jited:0 704 1766 2104 PASS test_bpf: #33 tcpdump port 22 jited:1 120 224 260 PASS test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:0 238 PASS test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:1 23 PASS test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:0 2034681 PASS test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:1 1020022 PASS Deployment and structure ------------------------ The related codes are added to "arch/arc/net": - bpf_jit.h -- The interface that a back-end translator must provide - bpf_jit_core.c -- Knows how to handle the input eBPF byte stream - bpf_jit_arcv2.c -- The back-end code that knows the translation logic The bpf_int_jit_compile() at the end of bpf_jit_core.c is the entrance to the whole process. Normally, the translation is done in one pass, namely the "normal pass". In case some relocations are not known during this pass, some data (arc_jit_data) is allocated for the next pass to come. This possible next (and last) pass is called the "extra pass". 1. Normal pass # The necessary pass 1a. Dry run # Get the whole JIT length, epilogue offset, etc. 1b. Emit phase # Allocate memory and start emitting instructions 2. Extra pass # Only needed if there are relocations to be fixed 2a. Patch relocations Support status -------------- The JIT compiler supports BPF instructions up to "cpu=v4". However, it does not yet provide support for: - Tail calls - Atomic operations - 64-bit division/remainder - BPF_PROBE_MEM* (exception table) The result of "test_bpf" test suite on an HSDK board is: hsdk-lnx# insmod test_bpf.ko test_suite=test_bpf test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed] All the failing test cases are due to the ones that were not JIT'ed. Categorically, they can be represented as: .-----------.------------.-------------. | test type | opcodes | # of cases | |-----------+------------+-------------| | atomic | 0xC3, 0xDB | 149 | | div64 | 0x37, 0x3F | 22 | | mod64 | 0x97, 0x9F | 15 | `-----------^------------+-------------| | (total) 186 | `-------------' Setup: build config ------------------- The following configs must be set to have a working JIT test: CONFIG_BPF_JIT=y CONFIG_BPF_JIT_ALWAYS_ON=y CONFIG_TEST_BPF=m The following options are not necessary for the tests module, but are good to have: CONFIG_DEBUG_INFO=y # prerequisite for below CONFIG_DEBUG_INFO_BTF=y # so bpftool can generate vmlinux.h CONFIG_FTRACE=y # CONFIG_BPF_SYSCALL=y # all these options lead to CONFIG_KPROBE_EVENTS=y # having CONFIG_BPF_EVENTS=y CONFIG_PERF_EVENTS=y # Some BPF programs provide data through /sys/kernel/debug: CONFIG_DEBUG_FS=y arc# mount -t debugfs debugfs /sys/kernel/debug Setup: elfutils --------------- The libdw.{so,a} library that is used by pahole for processing the final binary must come from elfutils 0.189 or newer. The support for ARCv2 [1] has been added since that version. [1] https://sourceware.org/git/?p=elfutils.git;a=commit;h=de3d46b3e7 Setup: pahole ------------- The line below in linux/scripts/Makefile.btf must be commented out: pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats Or else, the build will fail: $ make V=1 ... BTF .btf.vmlinux.bin.o pahole -J --btf_gen_floats \ -j --lang_exclude=rust \ --skip_encoding_btf_inconsistent_proto \ --btf_gen_optimized .tmp_vmlinux.btf Complex, interval and imaginary float types are not supported Encountered error while encoding BTF. ... BTFIDS vmlinux ./tools/bpf/resolve_btfids/resolve_btfids vmlinux libbpf: failed to find '.BTF' ELF section in vmlinux FAILED: load BTF from vmlinux: No data available This is due to the fact that the ARC toolchains generate "complex float" DIE entries in libgcc and at the moment, pahole can't handle such entries. Running the tests ----------------- host$ scp /bld/linux/lib/test_bpf.ko arc: arc # sysctl net.core.bpf_jit_enable=1 arc # insmod test_bpf.ko test_suite=test_bpf ... test_bpf: #1048 Staggered jumps: JMP32_JSLE_X jited:1 697811 PASS test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed] Acknowledgments --------------- - Claudiu Zissulescu for his unwavering support - Yuriy Kolerov for testing and troubleshooting - Vladimir Isaev for the pahole workaround - Sergey Matyukevich for paving the road by adding the interpreter support Signed-off-by: Shahab Vahedi <shahab@synopsys.com> Link: https://lore.kernel.org/r/20240430145604.38592-1-list+bpf@vahedi.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

During the stress testing of the jffs2 file system,the following abnormal printouts were found: [ 2430.649000] Unable to handle kernel paging request at virtual address 0069696969696948 [ 2430.649622] Mem abort info: [ 2430.649829] ESR = 0x96000004 [ 2430.650115] EC = 0x25: DABT (current EL), IL = 32 bits [ 2430.650564] SET = 0, FnV = 0 [ 2430.650795] EA = 0, S1PTW = 0 [ 2430.651032] FSC = 0x04: level 0 translation fault [ 2430.651446] Data abort info: [ 2430.651683] ISV = 0, ISS = 0x00000004 [ 2430.652001] CM = 0, WnR = 0 [ 2430.652558] [0069696969696948] address between user and kernel address ranges [ 2430.653265] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 2430.654512] CPU: 2 PID: 20919 Comm: cat Not tainted 5.15.25-g512f31242bf6 #33 [ 2430.655008] Hardware name: linux,dummy-virt (DT) [ 2430.655517] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2430.656142] pc : kfree+0x78/0x348 [ 2430.656630] lr : jffs2_free_inode+0x24/0x48 [ 2430.657051] sp : ffff800009eebd10 [ 2430.657355] x29: ffff800009eebd10 x28: 0000000000000001 x27: 0000000000000000 [ 2430.658327] x26: ffff000038f09d80 x25: 0080000000000000 x24: ffff800009d38000 [ 2430.658919] x23: 5a5a5a5a5a5a5a5a x22: ffff000038f09d80 x21: ffff8000084f0d14 [ 2430.659434] x20: ffff0000bf9a6ac0 x19: 0169696969696940 x18: 0000000000000000 [ 2430.659969] x17: ffff8000b6506000 x16: ffff800009eec000 x15: 0000000000004000 [ 2430.660637] x14: 0000000000000000 x13: 00000001000820a1 x12: 00000000000d1b19 [ 2430.661345] x11: 0004000800000000 x10: 0000000000000001 x9 : ffff8000084f0d14 [ 2430.662025] x8 : ffff0000bf9a6b40 x7 : ffff0000bf9a6b48 x6 : 0000000003470302 [ 2430.662695] x5 : ffff00002e41dcc0 x4 : ffff0000bf9aa3b0 x3 : 0000000003470342 [ 2430.663486] x2 : 0000000000000000 x1 : ffff8000084f0d14 x0 : fffffc0000000000 [ 2430.664217] Call trace: [ 2430.664528] kfree+0x78/0x348 [ 2430.664855] jffs2_free_inode+0x24/0x48 [ 2430.665233] i_callback+0x24/0x50 [ 2430.665528] rcu_do_batch+0x1ac/0x448 [ 2430.665892] rcu_core+0x28c/0x3c8 [ 2430.666151] rcu_core_si+0x18/0x28 [ 2430.666473] __do_softirq+0x138/0x3cc [ 2430.666781] irq_exit+0xf0/0x110 [ 2430.667065] handle_domain_irq+0x6c/0x98 [ 2430.667447] gic_handle_irq+0xac/0xe8 [ 2430.667739] call_on_irq_stack+0x28/0x54 The parameter passed to kfree was 5a5a5a5a, which corresponds to the target field of the jffs_inode_info structure. It was found that all variables in the jffs_inode_info structure were 5a5a5a5a, except for the first member sem. It is suspected that these variables are not initialized because they were set to 5a5a5a5a during memory testing, which is meant to detect uninitialized memory.The sem variable is initialized in the function jffs2_i_init_once, while other members are initialized in the function jffs2_init_inode_info. The function jffs2_init_inode_info is called after iget_locked, but in the iget_locked function, the destroy_inode process is triggered, which releases the inode and consequently, the target member of the inode is not initialized.In concurrent high pressure scenarios, iget_locked may enter the destroy_inode branch as described in the code. Since the destroy_inode functionality of jffs2 only releases the target, the fix method is to set target to NULL in jffs2_i_init_once. Signed-off-by: Wang Yong <wang.yong12@zte.com.cn> Reviewed-by: Lu Zhongjun <lu.zhongjun@zte.com.cn> Reviewed-by: Yang Tao <yang.tao172@zte.com.cn> Cc: Xu Xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>

Accessing `mr_table->mfc_cache_list` is protected by an RCU lock. In the following code flow, the RCU read lock is not held, causing the following error when `RCU_PROVE` is not held. The same problem might show up in the IPv6 code path. 6.12.0-rc5-kbuilder-01145-gbac17284bdcb #33 Tainted: G E N ----------------------------- net/ipv4/ipmr_base.c:313 RCU-list traversed in non-reader section!! rcu_scheduler_active = 2, debug_locks = 1 2 locks held by RetransmitAggre/3519: #0: ffff88816188c6c0 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x8a/0x290 #1: ffffffff83fcf7a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_dumpit+0x6b/0x90 stack backtrace: lockdep_rcu_suspicious mr_table_dump ipmr_rtm_dumproute rtnl_dump_all rtnl_dumpit netlink_dump __netlink_dump_start rtnetlink_rcv_msg netlink_rcv_skb netlink_unicast netlink_sendmsg This is not a problem per see, since the RTNL lock is held here, so, it is safe to iterate in the list without the RCU read lock, as suggested by Eric. To alleviate the concern, modify the code to use list_for_each_entry_rcu() with the RTNL-held argument. The annotation will raise an error only if RTNL or RCU read lock are missing during iteration, signaling a legitimate problem, otherwise it will avoid this false positive. This will solve the IPv6 case as well, since ip6mr_rtm_dumproute() calls this function as well. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241108-ipmr_rcu-v2-1-c718998e209b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cpaasch closed this as completed Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33

weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33

cpaasch commented Jun 1, 2020

cpaasch commented Jun 15, 2020

pabeni commented Jun 18, 2020

cpaasch commented Jun 22, 2020

pabeni commented Jun 26, 2020

cpaasch commented Jun 26, 2020

pabeni commented Jun 29, 2020 •

edited

Loading

cpaasch commented Jul 3, 2020

weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33

weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33

Comments

cpaasch commented Jun 1, 2020

cpaasch commented Jun 15, 2020

pabeni commented Jun 18, 2020

cpaasch commented Jun 22, 2020

pabeni commented Jun 26, 2020

cpaasch commented Jun 26, 2020

pabeni commented Jun 29, 2020 • edited Loading

cpaasch commented Jul 3, 2020

pabeni commented Jun 29, 2020 •

edited

Loading