bpf: Cancel the running bpf_timer through kworker · kernel-patches/bpf-rc@b53af55

Commit

bpf: Cancel the running bpf_timer through kworker

During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to spin on current CPU or acquire a being held softirq_expiry_lock
when the current timer is running, these steps above are reasonable.
However, it also has downside. When the timer is running, the cancelling
of the timer is delayed when releasing the last map uref. The delay is
also fixable (e.g., break the cancelling of bpf timer into two parts:
one part in locked scope, another one in unlocked scope), so it can be
revised later if necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12 with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
Signed-off-by: Hou Tao <houtao1@huawei.com>

Loading branch information

Hou Tao authored and Kernel Patches Daemon committed Jan 10, 2025

1 parent d3050a3 commit b53af55

kernel/bpf/helpers.c

-Original file line number
+Diff line change
@@ Expand Up / @@ -1591,12 +1591,19 @@ void bpf_timer_cancel_and_free(void *val) @@
     	 *  bpf_timer_cancel_and_free(timer2)	bpf_timer_cancel_and_free(timer1)
     	 *
     	 * To avoid these issues, punt to workqueue context when we are in a
-    	 * timer callback.
+    	 * timer callback. When the timer is running on other CPUs, also using
+    	 * workqueue context to cancel the timer.
     	 */
-    	if (this_cpu_read(hrtimer_running))
-    		queue_work(system_unbound_wq, &t->cb.delete_work);
-    	else
-    		bpf_timer_delete_work(&t->cb.delete_work);
+    	if (!this_cpu_read(hrtimer_running) && hrtimer_try_to_cancel(&t->timer) >= 0) {
+    		kfree_rcu(t, cb.rcu);
+    		return;
+    	}
+    	/* The timer is running on current or other CPU. Use a kworker to wait
+    	 * for the completion of the timer instead of spinning on current CPU
+    	 * or trying to acquire a sleepable lock to wait for its completion.
+    	 */
+    	queue_work(system_unbound_wq, &t->cb.delete_work);
     }
     /* This function is called by map_delete/update_elem for individual element and
@@ Expand Down @@

0 comments on commit `b53af55`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `b53af55`

Commit

There are no files selected for viewing

0 comments on commit b53af55

0 comments on commit `b53af55`