Skip to content

[LTS 9.2] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pvts-mat
Copy link
Contributor

CVE-2023-4921
VULN-6733

Problem

https://www.cve.org/CVERecord?id=CVE-2023-4921

A use-after-free vulnerability in the Linux kernel's net/sched: sch_qfq component can be exploited to achieve local privilege escalation. When the plug qdisc is used as a class of the qfq qdisc, sending network packets triggers use-after-free in qfq_dequeue() due to the incorrect .peek handler of sch_plug and lack of error checking in agg_dequeue()

Solution

A single commit was identified as a fix for this issue: 8fc134fee27f2263988ae38920bc03da416b03d8

kABI check: passed

python3 /mnt/code/kernel-dist-git/SOURCES/check-kabi \
        -k /mnt/code/kernel-dist-git/SOURCES/Module.kabi_$(uname -m) \
        -s /mnt/build_files/kernel-src-tree-ciqlts9_2-CVE-2023-4921/Module.symvers; echo $?

0

kernel-dist-git state:

Switched to branch 'el-9.2'
Your branch is up to date with 'origin/el-9.2'.

Boot test: passed

See Specific tests for implied boot test passing.

Kselftests: (in progress)

Specific tests: passed

Bug replication

The bug can be replicated with the following commands, as mention in the commit's message:

tc qdisc add dev lo root handle 1: qfq
tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
tc qdisc add dev lo parent 1:1 handle 2: plug
tc filter add dev lo parent 1: basic classid 1:1
ping -c1 127.0.0.1

The tests were performed on the referential and patched kernel.

Prerequisites

The tc commands above require the following kernel options to be enabled: CONFIG_NET_SCHED, CONFIG_NET_SCH_INGRESS, CONFIG_NET_SCH_QFQ, CONFIG_NET_SCH_PLUG, CONFIG_NET_CLS_BASIC.

This required change in the default configs/kernel-5.14.0-x86_64.config configuration file for the tested x86_64 platform.

CONFIG_NET_SCHED=y
CONFIG_NET_SCH_INGRESS=m
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_CLS_BASIC is not set

The following values were used:

CONFIG_NET_SCH_QFQ=y
CONFIG_NET_SCH_PLUG=y
CONFIG_NET_CLS_BASIC=y

Reference ciqlts9_2 (26e535ef9bf864de514c667b4f98e75354af4bde with necessary config changes)

Bug replicated successfully. Kernel crashed and machine automatically rebooted.

[root@ciqlts92 pvts]# tc qdisc add dev lo root handle 1: qfq
tc qdisc add dev lo root handle 1: qfq
[root@ciqlts92 pvts]# tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
[root@ciqlts92 pvts]# tc qdisc add dev lo parent 1:1 handle 2: plug
tc qdisc add dev lo parent 1:1 handle 2: plug
[root@ciqlts92 pvts]# tc filter add dev lo parent 1: basic classid 1:1
tc filter add dev lo parent 1: basic classid 1:1
[root@ciqlts92 pvts]# ping -c1 127.0.0.1
ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
[   34.471432] ------------[ cut here ]------------
[   34.473552] kernel BUG at include/linux/skbuff.h:2420!
[   34.475764] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   34.477888] CPU: 9 PID: 1566 Comm: ping Kdump: loaded Not tainted 5.14.0-ciqlts9_2 #5
[   34.481156] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[   34.483633] RIP: 0010:eth_type_trans+0xd3/0x140
[   34.484938] Code: 80 00 00 00 eb c1 8b 47 70 2b 47 74 48 8b 97 c8 00 00 00 83 f8 01 7e 1b 48 85 d2 74 06 66 83 3a ff 74 09 b8 00 04 00 00 eb a5 <0f> 0b b8 00 01 00 00 eb 9c 48 85 ff 74 eb 31 f6 b9 02 00 00 00 48
[   34.490165] RSP: 0018:ffff9d5ac0280e80 EFLAGS: 00010a87
[   34.491652] RAX: 0000000054488239 RBX: ffff8f7fc10e4000 RCX: ffff9d5ac0280f04
[   34.493273] RDX: ffff8f7fc888ee64 RSI: ffff8f7fc10e4000 RDI: ffff8f7fc10a0100
[   34.494690] RBP: ffff8f7fc10a0100 R08: ffff8f7fc47420ac R09: 0000000000000001
[   34.496115] R10: 0000000000000004 R11: ffffffffa1a060c0 R12: 0000000000000000
[   34.497543] R13: 0000000000000000 R14: ffff8f7fc10e4000 R15: ffff8f7fc10a0100
[   34.498970] FS:  00007fead6210b80(0000) GS:ffff8f871fc80000(0000) knlGS:0000000000000000
[   34.500592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.501746] CR2: 00007ffcaa98e000 CR3: 00000001081ea005 CR4: 0000000000370ee0
[   34.503062] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   34.504154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   34.505247] Call Trace:
[   34.505637]  <IRQ>
[   34.505963]  loopback_xmit+0xbb/0x120
[   34.506548]  dev_hard_start_xmit+0xc4/0x210
[   34.507206]  sch_direct_xmit+0x9e/0x370
[   34.507801]  __qdisc_run+0x65/0xd0
[   34.508367]  net_tx_action+0x1c0/0x270
[   34.508959]  __do_softirq+0xc7/0x2ac
[   34.509526]  do_softirq+0x63/0x90
[   34.510047]  </IRQ>
[   34.510382]  <TASK>
[   34.510719]  __local_bh_enable_ip+0x5a/0x70
[   34.511382]  ip_finish_output2+0x188/0x420
[   34.512032]  ip_push_pending_frames+0xa2/0xb0
[   34.512744]  ping_v4_sendmsg+0x435/0x800
[   34.513376]  ? find_next_iomem_res+0xb2/0x110
[   34.514067]  ? release_sock+0x40/0x90
[   34.514647]  ? sock_sendmsg+0x5b/0x70
[   34.515226]  sock_sendmsg+0x5b/0x70
[   34.515776]  __sys_sendto+0xf0/0x160
[   34.516357]  ? security_file_ioctl+0x2f/0x50
[   34.517040]  __x64_sys_sendto+0x20/0x30
[   34.517637]  do_syscall_64+0x59/0x90
[   34.518205]  ? handle_mm_fault+0xc5/0x2a0
[   34.518837]  ? do_user_addr_fault+0x1dd/0x6b0
[   34.519536]  ? exc_page_fault+0x62/0x150
[   34.520152]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   34.520946] RIP: 0033:0x7fead614fb1a
[   34.521517] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
[   34.524411] RSP: 002b:00007ffcaa8c9f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   34.525601] RAX: ffffffffffffffda RBX: 00007ffcaa8cb710 RCX: 00007fead614fb1a
[   34.526705] RDX: 0000000000000040 RSI: 00005574cac78e00 RDI: 0000000000000003
[   34.527827] RBP: 00005574cac78e00 R08: 00007ffcaa8cd98c R09: 0000000000000010
[   34.528939] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
[   34.530047] R13: 00007ffcaa8cb688 R14: 00007ffcaa8c9f30 R15: 00007ffcaa8cb710
[   34.531159]  </TASK>
[   34.531513] Modules linked in: rfkill vfat fat intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm kvm_intel iTCO_wdt kvm iTCO_vendor_support virtio_gpu irqbypass rapl virtio_dma_buf drm_shmem_helper drm_kms_helper i2c_i801 pcspkr syscopyarea i2c_smbus lpc_ich sysfillrect virtio_balloon sysimgblt fb_sys_fops joydev drm xfs libcrc32c sr_mod cdrom sg ahci libahci crct10dif_pclmul crc32_pclmul virtio_blk crc32c_intel virtio_console virtio_net net_failover ghash_clmulni_intel libata failover virtiofs serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod fuse
[    0.000000] Linux version 5.14.0-ciqlts9_2 (pvts@ciqlts92) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #5 SMP PREEMPT_DYNAMIC Mon Feb 10 22:07:16 UTC 2025
[    0.000000] The list of certified hardware and cloud instances for Enterprise Linux 9 can be viewed at the Red Hat Ecosystem Catalog, https://catalog.redhat.com.
[    0.000000] Command line: elfcorehdr=0x6f000000 BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-ciqlts9_2 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0  irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr hest_disable novmcoredd cma=0 hugetlb_cma=0 disable_cpu_apicid=0 iTCO_wdt.pretimeout=0
...

Full log:
bug-replication–ciqlts9_2.log

Patch

Patch efficacy verified successfully. Repeating the steps doesn't result in a crash and the machine doesn't reboot.

Full log:
bug-replication–ciqlts9_2-CVE-2023-4921.log

A warning can be observed

qfq_dequeue: non-workconserving leaf

issued at sch_qfq.c. The work-conserving queue discipline is a qdisc which never leaves the outbound interface in idle unless the queue is empty, while non-work-conserving qdiscs may delay packets, for example to shape the traffic1. The plug qdisc, being the leaf in the hierarchical qdiscs configuration used, is non-work-conserving, as it is able to suspend the packet flow. Multiple types of qdiscs are apparently not designed to work with non-work-conserving child qdiscs and issue similar warnings on skb == NULL condition: ets, hfsc, htb, drr. See also the message in b00355db3f88d96810a60011a30cfb2c3469409d

Patrick McHardy <kaber@trash.net> suggested:
> How about making this flag and the warning message (in a out-of-line
> function) globally available? Other qdiscs (f.i. HFSC) can't deal with
> inner non-work-conserving qdiscs as well.

and in 6d25d1dc76bf5943a5c1f4bb74d66d5eac58eb77, which includes qfq to the group above

A helper function for printing non-work-conserving alarms is added in
commit b00355db3f88 ("pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving
 warning handler."). In this commit, use qdisc_warn_nonwc() instead of
WARN_ONCE() to handle the non-work-conserving warning in qfq Qdisc.

The point is: the warning is related to the way the qdisc hierarchy has been defined in the bug replication script and not to the introduced changes.

Footnotes

1 https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.qdisc.terminology.html

jira VULN-6733
cve CVE-2023-4921
commit-author valis <sec@valis.email>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ ctrliq#4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <sec@valis.email>
	Signed-off-by: valis <sec@valis.email>
	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>
(cherry picked from commit 8fc134f)
	Signed-off-by: Marcin Wcisło <marcin.wcislo@conclusive.pl>
@pvts-mat pvts-mat marked this pull request as draft February 12, 2025 03:02
@pvts-mat pvts-mat marked this pull request as ready for review February 12, 2025 21:21
@PlaidCat
Copy link
Collaborator

PlaidCat commented Feb 12, 2025

I didn't catch this when assigning work but the CVE is not impacted on Rocky9 / RHEL9
The code is contained in CONFIG Flags that are not built by the default kernel
https://access.redhat.com/security/cve/CVE-2023-4921

@PlaidCat PlaidCat closed this Feb 12, 2025
github-actions bot pushed a commit that referenced this pull request May 1, 2025
JIRA: https://issues.redhat.com/browse/RHEL-68940

commit 031af50
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Wed, 4 Jan 2023 15:16:26 +0000

    arm64: cmpxchg_double*: hazard against entire exchange variable

    The inline assembly for arm64's cmpxchg_double*() implementations use a
    +Q constraint to hazard against other accesses to the memory location
    being exchanged. However, the pointer passed to the constraint is a
    pointer to unsigned long, and thus the hazard only applies to the first
    8 bytes of the location.

    GCC can take advantage of this, assuming that other portions of the
    location are unchanged, leading to a number of potential problems.

    This is similar to what we fixed back in commit:

      fee960b ("arm64: xchg: hazard against entire exchange variable")

    ... but we forgot to adjust cmpxchg_double*() similarly at the same
    time.

    The same problem applies, as demonstrated with the following test:

    | struct big {
    |         u64 lo, hi;
    | } __aligned(128);
    |
    | unsigned long foo(struct big *b)
    | {
    |         u64 hi_old, hi_new;
    |
    |         hi_old = b->hi;
    |         cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78);
    |         hi_new = b->hi;
    |
    |         return hi_old ^ hi_new;
    | }

    ... which GCC 12.1.0 compiles as:

    | 0000000000000000 <foo>:
    |    0:   d503233f        paciasp
    |    4:   aa0003e4        mov     x4, x0
    |    8:   1400000e        b       40 <foo+0x40>
    |    c:   d2800240        mov     x0, #0x12                       // #18
    |   10:   d2800681        mov     x1, #0x34                       // #52
    |   14:   aa0003e5        mov     x5, x0
    |   18:   aa0103e6        mov     x6, x1
    |   1c:   d2800ac2        mov     x2, #0x56                       // #86
    |   20:   d2800f03        mov     x3, #0x78                       // #120
    |   24:   48207c82        casp    x0, x1, x2, x3, [x4]
    |   28:   ca050000        eor     x0, x0, x5
    |   2c:   ca060021        eor     x1, x1, x6
    |   30:   aa010000        orr     x0, x0, x1
    |   34:   d2800000        mov     x0, #0x0                        // #0    <--- BANG
    |   38:   d50323bf        autiasp
    |   3c:   d65f03c0        ret
    |   40:   d2800240        mov     x0, #0x12                       // #18
    |   44:   d2800681        mov     x1, #0x34                       // #52
    |   48:   d2800ac2        mov     x2, #0x56                       // #86
    |   4c:   d2800f03        mov     x3, #0x78                       // #120
    |   50:   f9800091        prfm    pstl1strm, [x4]
    |   54:   c87f1885        ldxp    x5, x6, [x4]
    |   58:   ca0000a5        eor     x5, x5, x0
    |   5c:   ca0100c6        eor     x6, x6, x1
    |   60:   aa0600a6        orr     x6, x5, x6
    |   64:   b5000066        cbnz    x6, 70 <foo+0x70>
    |   68:   c8250c82        stxp    w5, x2, x3, [x4]
    |   6c:   35ffff45        cbnz    w5, 54 <foo+0x54>
    |   70:   d2800000        mov     x0, #0x0                        // #0     <--- BANG
    |   74:   d50323bf        autiasp
    |   78:   d65f03c0        ret

    Notice that at the lines with "BANG" comments, GCC has assumed that the
    higher 8 bytes are unchanged by the cmpxchg_double() call, and that
    `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and
    LL/SC versions of cmpxchg_double().

    This patch fixes the issue by passing a pointer to __uint128_t into the
    +Q constraint, ensuring that the compiler hazards against the entire 16
    bytes being modified.

    With this change, GCC 12.1.0 compiles the above test as:

    | 0000000000000000 <foo>:
    |    0:   f9400407        ldr     x7, [x0, #8]
    |    4:   d503233f        paciasp
    |    8:   aa0003e4        mov     x4, x0
    |    c:   1400000f        b       48 <foo+0x48>
    |   10:   d2800240        mov     x0, #0x12                       // #18
    |   14:   d2800681        mov     x1, #0x34                       // #52
    |   18:   aa0003e5        mov     x5, x0
    |   1c:   aa0103e6        mov     x6, x1
    |   20:   d2800ac2        mov     x2, #0x56                       // #86
    |   24:   d2800f03        mov     x3, #0x78                       // #120
    |   28:   48207c82        casp    x0, x1, x2, x3, [x4]
    |   2c:   ca050000        eor     x0, x0, x5
    |   30:   ca060021        eor     x1, x1, x6
    |   34:   aa010000        orr     x0, x0, x1
    |   38:   f9400480        ldr     x0, [x4, #8]
    |   3c:   d50323bf        autiasp
    |   40:   ca0000e0        eor     x0, x7, x0
    |   44:   d65f03c0        ret
    |   48:   d2800240        mov     x0, #0x12                       // #18
    |   4c:   d2800681        mov     x1, #0x34                       // #52
    |   50:   d2800ac2        mov     x2, #0x56                       // #86
    |   54:   d2800f03        mov     x3, #0x78                       // #120
    |   58:   f9800091        prfm    pstl1strm, [x4]
    |   5c:   c87f1885        ldxp    x5, x6, [x4]
    |   60:   ca0000a5        eor     x5, x5, x0
    |   64:   ca0100c6        eor     x6, x6, x1
    |   68:   aa0600a6        orr     x6, x5, x6
    |   6c:   b5000066        cbnz    x6, 78 <foo+0x78>
    |   70:   c8250c82        stxp    w5, x2, x3, [x4]
    |   74:   35ffff45        cbnz    w5, 5c <foo+0x5c>
    |   78:   f9400480        ldr     x0, [x4, #8]
    |   7c:   d50323bf        autiasp
    |   80:   ca0000e0        eor     x0, x7, x0
    |   84:   d65f03c0        ret

    ... sampling the high 8 bytes before and after the cmpxchg, and
    performing an EOR, as we'd expect.

    For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note
    that linux-4.9.y is oldest currently supported stable release, and
    mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run
    on my machines due to library incompatibilities.

    I've also used a standalone test to check that we can use a __uint128_t
    pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM
    3.9.1.

    Fixes: 5284e1b ("arm64: xchg: Implement cmpxchg_double")
    Fixes: e9a4b79 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU")
    Reported-by: Boqun Feng <boqun.feng@gmail.com>
    Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/
    Reported-by: Peter Zijlstra <peterz@infradead.org>
    Link: https://lore.kernel.org/lkml/Y6GXoO4qmH9OIZ5Q@hirez.programming.kicks-ass.net/
    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Cc: stable@vger.kernel.org
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Steve Capper <steve.capper@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20230104151626.3262137-1-mark.rutland@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants