Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

scx: fix NULL pointer dereference with scx_exit_info #129

Merged
merged 1 commit into from
Jan 28, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 26, 2024

Trying to load multiple schedulers at the same time can trigger the following NULL pointer dereference:

[22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000 [22053.942966] #PF: supervisor write access in kernel mode [22053.942968] #PF: error_code(0x0002) - not-present page [22053.942969] PGD 0 P4D 0
[22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI
[22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G O 6.7.0-4-generic #4+scx3-Ubuntu
[22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022
[22053.942980] Sched_ext: central (enabled+all)
[22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610
[22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7
[22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246
[22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001
[22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98
[22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000
[22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98
[22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0
[22053.942999] FS: 0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000
[22053.943002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0
[22053.943005] PKRU: 55555554
[22053.943006] Call Trace:
[22053.943008]
[22053.943012] ? show_regs+0x6d/0x80
[22053.943016] ? __die+0x24/0x80
[22053.943018] ? page_fault_oops+0x99/0x1b0
[22053.943021] ? do_user_addr_fault+0x2ee/0x6b0
[22053.943024] ? exc_page_fault+0x83/0x1b0
[22053.943028] ? asm_exc_page_fault+0x27/0x30
[22053.943030] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943034] ? scx_ops_disable_workfn+0x85/0x610
[22053.943036] ? asm_sysvec_irq_work+0x1b/0x20
[22053.943042] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943043] kthread_worker_fn+0x9e/0x230
[22053.943048] ? __pfx_kthread_worker_fn+0x10/0x10
[22053.943050] kthread+0xef/0x120
[22053.943053] ? __pfx_kthread+0x10/0x10
[22053.943056] ret_from_fork+0x44/0x70
[22053.943058] ? __pfx_kthread+0x10/0x10
[22053.943061] ret_from_fork_asm+0x1b/0x30
[22053.943064]

This happens because in scx_ops_enable(), if a scheduler is already running, we are freeing scx_exit_info, that is still owned by the running scheduler.

Therefore, as soon as we stop the running scheduler we can hit the NULL pointer dereference.

Reproducer:

  • start any scheduler
  • try to start another scheduler
  • stop the running scheduler
  • BUG

Fix this by not freeing scx_exit_info in error path of scx_ops_enable() when there is a running scheduler.

Moreover, also make sure to access scx_exit_info with the scx_ops_enable_mutex held in scx_ops_disable_workfn(), to prevent other potential race conditions.

Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated")

Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this! The scx_ops_enable() part LGTM. Left a couple comments on the disable path to discuss.

ei = scx_exit_info;
if (!ei) {
mutex_unlock(&scx_ops_enable_mutex);
goto done;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move the mutex acquisition above here, we'll also need to drop the lock on line 3324.

Comment on lines 3305 to 3306
ei = scx_exit_info;
if (!ei) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this should never happen, right? If we're able to swap out scx_exit_kind, it means that we were previously successfully running, and were able to successfully allocate scx_exit_info when the scheduler was loaded. AFAICT, the only fix we should need to add is the one you added below in scx_ops_enable() where we don't incorrectly clear the exit info.

Note that we can't race with someone enabling a scheduler either because if we were able to get here, it means that scx_ops_enable_state() will not be SCX_OPS_DISABLED.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. Also, with the fix to scx_ops_enable() in place, there's no way to have multiple concurrent scx_ops_disable_workfn() running at the same time, so we don't need any extra protection there.

I've updated the PR to include just the fix for the NULL pointer deref in scx_ops_enable(). Thanks!

Trying to load multiple schedulers at the same time can trigger the
following NULL pointer dereference:

[22053.942960] BUG: kernel NULL pointer dereference, address: 0000000000000000
[22053.942966] #PF: supervisor write access in kernel mode
[22053.942968] #PF: error_code(0x0002) - not-present page
[22053.942969] PGD 0 P4D 0
[22053.942972] Oops: 0002 [#1] PREEMPT SMP NOPTI
[22053.942976] CPU: 6 PID: 3550 Comm: sched_ext_ops_h Tainted: G           O       6.7.0-4-generic #4+scx3-Ubuntu
[22053.942978] Hardware name: GPD G1621-02/G1621-02, BIOS 2.04 09/01/2022
[22053.942980] Sched_ext: central (enabled+all)
[22053.942981] RIP: 0010:scx_ops_disable_workfn+0x85/0x610
[22053.942987] Code: 89 df f3 48 ab b9 01 00 00 00 83 fa 01 0f 86 c3 00 00 00 89 d0 f0 0f b1 0d 60 ff 11 02 0f 85 10 05 00 00 48 8b 85 90 fe ff ff <89> 10 81 fa 00 04 00 00 0f 84 ef 04 00 00 0f 8e cb 00 00 00 48 c7
[22053.942989] RSP: 0018:ffffad420257fd30 EFLAGS: 00010246
[22053.942991] RAX: 0000000000000000 RBX: ffffad420257fd58 RCX: 0000000000000001
[22053.942993] RDX: 0000000000000040 RSI: 0000000000000000 RDI: ffffad420257fd98
[22053.942994] RBP: ffffad420257fea8 R08: 0000000000000000 R09: 0000000000000000
[22053.942996] R10: 0000000000000000 R11: 0000000000000000 R12: ffffad420257fd98
[22053.942998] R13: ffff96a4d07eddc0 R14: ffff96a4d07eddc4 R15: ffffffff8897c3b0
[22053.942999] FS:  0000000000000000(0000) GS:ffff96a7dfb00000(0000) knlGS:0000000000000000
[22053.943002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22053.943003] CR2: 0000000000000000 CR3: 00000001f6a3c001 CR4: 0000000000f70ef0
[22053.943005] PKRU: 55555554
[22053.943006] Call Trace:
[22053.943008]  <TASK>
[22053.943012]  ? show_regs+0x6d/0x80
[22053.943016]  ? __die+0x24/0x80
[22053.943018]  ? page_fault_oops+0x99/0x1b0
[22053.943021]  ? do_user_addr_fault+0x2ee/0x6b0
[22053.943024]  ? exc_page_fault+0x83/0x1b0
[22053.943028]  ? asm_exc_page_fault+0x27/0x30
[22053.943030]  ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943034]  ? scx_ops_disable_workfn+0x85/0x610
[22053.943036]  ? asm_sysvec_irq_work+0x1b/0x20
[22053.943042]  ? __pfx_scx_ops_disable_workfn+0x10/0x10
[22053.943043]  kthread_worker_fn+0x9e/0x230
[22053.943048]  ? __pfx_kthread_worker_fn+0x10/0x10
[22053.943050]  kthread+0xef/0x120
[22053.943053]  ? __pfx_kthread+0x10/0x10
[22053.943056]  ret_from_fork+0x44/0x70
[22053.943058]  ? __pfx_kthread+0x10/0x10
[22053.943061]  ret_from_fork_asm+0x1b/0x30
[22053.943064]  </TASK>

This happens because in scx_ops_enable(), if a scheduler is already
running, we are freeing scx_exit_info, that is still owned by the
running scheduler.

Therefore, as soon as we stop the running scheduler we can hit the NULL
pointer dereference.

Reproducer:
 - start any scheduler
 - try to start another scheduler
 - stop the running scheduler
 - BUG

Fix this by not freeing scx_exit_info in error path of scx_ops_enable()
when there is a running scheduler.

Fixes: 26ae1b0 ("scx: Make scx_exit_info fields dynamically allocated")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi force-pushed the fix-null-scx_exit_info branch from 57d57d8 to fa75b1f Compare January 27, 2024 07:13
@Byte-Lab Byte-Lab merged commit 65c2651 into sched_ext Jan 28, 2024
2 checks passed
Byte-Lab added a commit that referenced this pull request Jan 28, 2024
In #129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
Byte-Lab added a commit that referenced this pull request Jan 28, 2024
In #129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
Byte-Lab added a commit that referenced this pull request Jan 28, 2024
In #129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
Byte-Lab added a commit that referenced this pull request Jan 29, 2024
In #129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
Byte-Lab added a commit that referenced this pull request Jan 29, 2024
In #129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab Byte-Lab deleted the fix-null-scx_exit_info branch March 14, 2024 18:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants