forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4.14 2.0.x imx #50
Merged
otavio
merged 718 commits into
Freescale:4.14-2.0.x-imx
from
MaxKrummenacher:4.14-2.0.x-imx
Jul 15, 2019
Merged
4.14 2.0.x imx #50
otavio
merged 718 commits into
Freescale:4.14-2.0.x-imx
from
MaxKrummenacher:4.14-2.0.x-imx
Jul 15, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ Upstream commit df1d80a ] For devices from the SigmaDelta family we need to keep CS low when doing a conversion, since the device will use the MISO line as a interrupt to indicate that the conversion is complete. This is why the driver locks the SPI bus and when the SPI bus is locked keeps as long as a conversion is going on. The current implementation gets one small detail wrong though. CS is only de-asserted after the SPI bus is unlocked. This means it is possible for a different SPI device on the same bus to send a message which would be wrongfully be addressed to the SigmaDelta device as well. Make sure that the last SPI transfer that is done while holding the SPI bus lock de-asserts the CS signal. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Alexandru Ardelean <Alexandru.Ardelean@analog.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 536cc27 ] devm_regmap_init_i2c may fail and return NULL. The fix returns the error when it fails. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
…ocess_data [ Upstream commit 6f9ca1d ] When building with -Wsometimes-uninitialized, Clang warns: drivers/iio/common/ssp_sensors/ssp_iio.c:95:6: warning: variable 'calculated_time' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] While it isn't wrong, this will never be a problem because iio_push_to_buffers_with_timestamp only uses calculated_time on the same condition that it is assigned (when scan_timestamp is not zero). While iio_push_to_buffers_with_timestamp is marked as inline, Clang does inlining in the optimization stage, which happens after the semantic analysis phase (plus inline is merely a hint to the compiler). Fix this by just zero initializing calculated_time. Link: ClangBuiltLinux#394 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7659762 ] In case alloc_workqueue fails, the fix reports the error and returns to avoid NULL pointer dereference. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 003b686 ] 'hostcmd' is alloced by kzalloc, should be freed before leaving from the error handling cases, otherwise it will cause mem leak. Fixes: 3935ccc ("mwifiex: add cfg80211 testmode support") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 46953f9 ] In case kmemdup fails, the fix sets conn_info->req_ie_len and conn_info->resp_ie_len to zero to avoid buffer overflows. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d825db3 ] Clang warns about what is clearly a case of passing an uninitalized variable into a static function: drivers/net/wireless/broadcom/b43/phy_lp.c:1852:23: error: variable 'gains' is uninitialized when used here [-Werror,-Wuninitialized] lpphy_papd_cal(dev, gains, 0, 1, 30); ^~~~~ drivers/net/wireless/broadcom/b43/phy_lp.c:1838:2: note: variable 'gains' is declared here struct lpphy_tx_gains gains, oldgains; ^ 1 error generated. However, this function is empty, and its arguments are never evaluated, so gcc in contrast does not warn here. Both compilers behave in a reasonable way as far as I can tell, so we should change the code to avoid the warning everywhere. We could just eliminate the lpphy_papd_cal() function entirely, given that it has had the TODO comment in it for 10 years now and is rather unlikely to ever get done. I'm doing a simpler change here, and just pass the 'oldgains' variable in that has been initialized, based on the guess that this is what was originally meant. Fixes: 2c0d610 ("b43: LP-PHY: Begin implementing calibration & software RFKILL support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Larry Finger <Larry.Finger@lwfinger.net> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a9fd095 ] Leaving dev_init_lock mutex locked in probe causes BUG and a WARNING when kernel is compiled with CONFIG_PROVE_LOCKING. Convert mutex to completion which silences those warnings and improves code readability. Fix below errors when connecting the USB WiFi dongle: brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43143 for chip BCM43143/2 BUG: workqueue leaked lock or atomic: kworker/0:2/0x00000000/434 last function: hub_event 1 lock held by kworker/0:2/434: #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty Freescale#123 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: usb_hub_wq hub_event [<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14) [<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4) [<809c4324>] (dump_stack) from [<8014195c>] (process_one_work+0x710/0x808) [<8014195c>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) Exception stack(0xed1d9fb0 to 0xed1d9ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ====================================================== WARNING: possible circular locking dependency detected 4.19.23-00084-g454a789-dirty Freescale#123 Not tainted ------------------------------------------------------ kworker/0:2/434 is trying to acquire lock: e29cf799 ((wq_completion)"events"){+.+.}, at: process_one_work+0x174/0x808 but task is already holding lock: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> Freescale#2 (&devinfo->dev_init_lock){+.+.}: mutex_lock_nested+0x1c/0x24 brcmf_usb_probe+0x78/0x550 [brcmfmac] usb_probe_interface+0xc0/0x1bc really_probe+0x228/0x2c0 __driver_attach+0xe4/0xe8 bus_for_each_dev+0x68/0xb4 bus_add_driver+0x19c/0x214 driver_register+0x78/0x110 usb_register_driver+0x84/0x148 process_one_work+0x228/0x808 worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) -> Freescale#1 (brcmf_driver_work){+.+.}: worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) -> #0 ((wq_completion)"events"){+.+.}: process_one_work+0x1b8/0x808 worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) other info that might help us debug this: Chain exists of: (wq_completion)"events" --> brcmf_driver_work --> &devinfo->dev_init_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devinfo->dev_init_lock); lock(brcmf_driver_work); lock(&devinfo->dev_init_lock); lock((wq_completion)"events"); *** DEADLOCK *** 1 lock held by kworker/0:2/434: #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] stack backtrace: CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty Freescale#123 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: events request_firmware_work_func [<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14) [<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4) [<809c4324>] (dump_stack) from [<80172838>] (print_circular_bug+0x210/0x330) [<80172838>] (print_circular_bug) from [<80175940>] (__lock_acquire+0x160c/0x1a30) [<80175940>] (__lock_acquire) from [<8017671c>] (lock_acquire+0xe0/0x268) [<8017671c>] (lock_acquire) from [<80141404>] (process_one_work+0x1b8/0x808) [<80141404>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) Exception stack(0xed1d9fb0 to 0xed1d9ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 Signed-off-by: Piotr Figiel <p.figiel@camlintechnologies.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c80d26e ] brcmu_pkt_buf_free_skb emits WARNING when attempting to free a sk_buff which is part of any queue. After USB disconnect this may have happened when brcmf_fws_hanger_cleanup() is called as per-interface psq was never cleaned when removing the interface. Change brcmf_fws_macdesc_cleanup() in a way that it removes the corresponding packets from hanger table (to avoid double-free when brcmf_fws_hanger_cleanup() is called) and add a call to clean-up the interface specific packet queue. Below is a WARNING during USB disconnect with Raspberry Pi WiFi dongle running in AP mode. This was reproducible when the interface was transmitting during the disconnect and is fixed with this commit. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1171 at drivers/net/wireless/broadcom/brcm80211/brcmutil/utils.c:49 brcmu_pkt_buf_free_skb+0x3c/0x40 Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_mangle xt_connmark xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables usb_f_mass_storage usb_f_rndis u_ether cdc_acm smsc95xx usbnet ci_hdrc_imx ci_hdrc ulpi usbmisc_imx 8250_exar 8250_pci 8250 8250_base libcomposite configfs udc_core CPU: 0 PID: 1171 Comm: kworker/0:0 Not tainted 4.19.23-00075-gde33ed8 Freescale#99 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: usb_hub_wq hub_event [<8010ff84>] (unwind_backtrace) from [<8010bb64>] (show_stack+0x10/0x14) [<8010bb64>] (show_stack) from [<80840278>] (dump_stack+0x88/0x9c) [<80840278>] (dump_stack) from [<8011f5ec>] (__warn+0xfc/0x114) [<8011f5ec>] (__warn) from [<8011f71c>] (warn_slowpath_null+0x40/0x48) [<8011f71c>] (warn_slowpath_null) from [<805a476c>] (brcmu_pkt_buf_free_skb+0x3c/0x40) [<805a476c>] (brcmu_pkt_buf_free_skb) from [<805bb6c4>] (brcmf_fws_cleanup+0x1e4/0x22c) [<805bb6c4>] (brcmf_fws_cleanup) from [<805bc854>] (brcmf_fws_del_interface+0x58/0x68) [<805bc854>] (brcmf_fws_del_interface) from [<805b66ac>] (brcmf_remove_interface+0x40/0x150) [<805b66ac>] (brcmf_remove_interface) from [<805b6870>] (brcmf_detach+0x6c/0xb0) [<805b6870>] (brcmf_detach) from [<805bdbb8>] (brcmf_usb_disconnect+0x30/0x4c) [<805bdbb8>] (brcmf_usb_disconnect) from [<805e5d64>] (usb_unbind_interface+0x5c/0x1e0) [<805e5d64>] (usb_unbind_interface) from [<804aab10>] (device_release_driver_internal+0x154/0x1ec) [<804aab10>] (device_release_driver_internal) from [<804a97f4>] (bus_remove_device+0xcc/0xf8) [<804a97f4>] (bus_remove_device) from [<804a6fc0>] (device_del+0x118/0x308) [<804a6fc0>] (device_del) from [<805e488c>] (usb_disable_device+0xa0/0x1c8) [<805e488c>] (usb_disable_device) from [<805dcf98>] (usb_disconnect+0x70/0x1d8) [<805dcf98>] (usb_disconnect) from [<805ddd84>] (hub_event+0x464/0xf50) [<805ddd84>] (hub_event) from [<80135a70>] (process_one_work+0x138/0x3f8) [<80135a70>] (process_one_work) from [<80135d5c>] (worker_thread+0x2c/0x554) [<80135d5c>] (worker_thread) from [<8013b1a0>] (kthread+0x124/0x154) [<8013b1a0>] (kthread) from [<801010e8>] (ret_from_fork+0x14/0x2c) Exception stack(0xecf8dfb0 to 0xecf8dff8) dfa0: 00000000 00000000 00000000 00000000 dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ---[ end trace 38d234018e9e2a90 ]--- ------------[ cut here ]------------ Signed-off-by: Piotr Figiel <p.figiel@camlintechnologies.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit db3b9e2 ] It was observed that rarely during USB disconnect happening shortly after connect (before full initialization completes) usb_hub_wq would wait forever for the dev_init_lock to be unlocked. dev_init_lock would remain locked though because of infinite wait during usb_kill_urb: [ 2730.656472] kworker/0:2 D 0 260 2 0x00000000 [ 2730.660700] Workqueue: events request_firmware_work_func [ 2730.664807] [<809dca20>] (__schedule) from [<809dd164>] (schedule+0x4c/0xac) [ 2730.670587] [<809dd164>] (schedule) from [<8069af44>] (usb_kill_urb+0xdc/0x114) [ 2730.676815] [<8069af44>] (usb_kill_urb) from [<7f258b50>] (brcmf_usb_free_q+0x34/0xa8 [brcmfmac]) [ 2730.684833] [<7f258b50>] (brcmf_usb_free_q [brcmfmac]) from [<7f2517d4>] (brcmf_detach+0xa0/0xb8 [brcmfmac]) [ 2730.693557] [<7f2517d4>] (brcmf_detach [brcmfmac]) from [<7f251a34>] (brcmf_attach+0xac/0x3d8 [brcmfmac]) [ 2730.702094] [<7f251a34>] (brcmf_attach [brcmfmac]) from [<7f2587ac>] (brcmf_usb_probe_phase2+0x468/0x4a0 [brcmfmac]) [ 2730.711601] [<7f2587ac>] (brcmf_usb_probe_phase2 [brcmfmac]) from [<7f252888>] (brcmf_fw_request_done+0x194/0x220 [brcmfmac]) [ 2730.721795] [<7f252888>] (brcmf_fw_request_done [brcmfmac]) from [<805748e4>] (request_firmware_work_func+0x4c/0x88) [ 2730.731125] [<805748e4>] (request_firmware_work_func) from [<80141474>] (process_one_work+0x228/0x808) [ 2730.739223] [<80141474>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [ 2730.746105] [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [ 2730.752227] [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) [ 2733.099695] kworker/0:3 D 0 1065 2 0x00000000 [ 2733.103926] Workqueue: usb_hub_wq hub_event [ 2733.106914] [<809dca20>] (__schedule) from [<809dd164>] (schedule+0x4c/0xac) [ 2733.112693] [<809dd164>] (schedule) from [<809e2a8c>] (schedule_timeout+0x214/0x3e4) [ 2733.119621] [<809e2a8c>] (schedule_timeout) from [<809dde2c>] (wait_for_common+0xc4/0x1c0) [ 2733.126810] [<809dde2c>] (wait_for_common) from [<7f258d00>] (brcmf_usb_disconnect+0x1c/0x4c [brcmfmac]) [ 2733.135206] [<7f258d00>] (brcmf_usb_disconnect [brcmfmac]) from [<8069e0c8>] (usb_unbind_interface+0x5c/0x1e4) [ 2733.143943] [<8069e0c8>] (usb_unbind_interface) from [<8056d3e8>] (device_release_driver_internal+0x164/0x1fc) [ 2733.152769] [<8056d3e8>] (device_release_driver_internal) from [<8056c078>] (bus_remove_device+0xd0/0xfc) [ 2733.161138] [<8056c078>] (bus_remove_device) from [<8056977c>] (device_del+0x11c/0x310) [ 2733.167939] [<8056977c>] (device_del) from [<8069cba8>] (usb_disable_device+0xa0/0x1cc) [ 2733.174743] [<8069cba8>] (usb_disable_device) from [<8069507c>] (usb_disconnect+0x74/0x1dc) [ 2733.181823] [<8069507c>] (usb_disconnect) from [<80695e88>] (hub_event+0x478/0xf88) [ 2733.188278] [<80695e88>] (hub_event) from [<80141474>] (process_one_work+0x228/0x808) [ 2733.194905] [<80141474>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [ 2733.201724] [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [ 2733.207913] [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) It was traced down to a case where usb_kill_urb would be called on an URB structure containing more or less random data, including large number in its use_count. During the debugging it appeared that in brcmf_usb_free_q() the traversal over URBs' lists is not synchronized with operations on those lists in brcmf_usb_rx_complete() leading to handling brcmf_usbdev_info structure (holding lists' head) as lists' element and in result causing above problem. Fix it by walking through all URBs during brcmf_cancel_all_urbs using the arrays of requests instead of linked lists. Signed-off-by: Piotr Figiel <p.figiel@camlintechnologies.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 24d413a ] Fix a race which leads to an Oops with NULL pointer dereference. The dereference is in brcmf_config_dongle() when cfg_to_ndev() attempts to get net_device structure of interface with index 0 via if2bss mapping. This shouldn't fail because of check for bus being ready in brcmf_netdev_open(), but it's not synchronised with USB disconnect and there is a race: after the check the bus can be marked down and the mapping for interface 0 may be gone. Solve this by modifying disconnect handling so that the removal of mapping of ifidx to brcmf_if structure happens after netdev removal (which is synchronous with brcmf_netdev_open() thanks to rtln being locked in devinet_ioctl()). This assures brcmf_netdev_open() returns before the mapping is removed during disconnect. Unable to handle kernel NULL pointer dereference at virtual address 00000008 pgd = bcae2612 [00000008] *pgd=8be73831 Internal error: Oops: 17 [Freescale#1] PREEMPT SMP ARM Modules linked in: brcmfmac brcmutil nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_mangle xt_connmark xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables usb_f_mass_storage usb_f_rndis u_ether usb_serial_simple usbserial cdc_acm smsc95xx usbnet ci_hdrc_imx ci_hdrc usbmisc_imx ulpi 8250_exar 8250_pci 8250 8250_base libcomposite configfs udc_core [last unloaded: brcmutil] CPU: 2 PID: 24478 Comm: ifconfig Not tainted 4.19.23-00078-ga62866d-dirty Freescale#115 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) PC is at brcmf_cfg80211_up+0x94/0x29c [brcmfmac] LR is at brcmf_cfg80211_up+0x8c/0x29c [brcmfmac] pc : [<7f26a91c>] lr : [<7f26a914>] psr: a0070013 sp : eca99d28 ip : 00000000 fp : ee9c6c00 r10: 00000036 r9 : 00000000 r8 : ece4002c r7 : edb5b800 r6 : 00000000 r5 : 80f08448 r4 : edb5b968 r3 : ffffffff r2 : 00000000 r1 : 00000002 r0 : 00000000 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 7ca0c04a DAC: 00000051 Process ifconfig (pid: 24478, stack limit = 0xd9e85a0e) Stack: (0xeca99d28 to 0xeca9a000) 9d20: 00000000 80f873b0 0000000d 80f08448 eca99d68 50d45f32 9d40: 7f27de94 ece40000 80f08448 80f08448 7f27de94 ece4002c 00000000 00000036 9d60: ee9c6c00 7f27262c 00001002 50d45f32 ece40000 00000000 80f08448 80772008 9d80: 00000001 00001043 00001002 ece40000 00000000 50d45f32 ece40000 00000001 9da0: 80f08448 00001043 00001002 807723d0 00000000 50d45f32 80f08448 eca99e58 9dc0: 80f87113 50d45f32 80f08448 ece40000 ece40138 00001002 80f08448 00000000 9de0: 00000000 80772434 edbd5380 eca99e58 edbd5380 80f08448 ee9c6c0c 80805f70 9e00: 00000000 ede08e00 00008914 ece40000 00000014 ee9c6c0c 600c0013 00001043 9e20: 0208a8c0 ffffffff 00000000 50d45f32 eca98000 80f08448 7ee9fc38 00008914 9e40: 80f68e40 00000051 eca98000 00000036 00000003 80808b9c 6e616c77 00000030 9e60: 00000000 00000000 00001043 0208a8c0 ffffffff 00000000 80f08448 00000000 9e80: 00000000 816d8b20 600c0013 00000001 ede09320 801763d4 00000000 50d45f32 9ea0: eca98000 80f08448 7ee9fc38 50d45f32 00008914 80f08448 7ee9fc38 80f68e40 9ec0: ed531540 8074721c 00000800 00000001 00000000 6e616c77 00000030 00000000 9ee0: 00000000 00001002 0208a8c0 ffffffff 00000000 50d45f32 80f08448 7ee9fc38 9f00: ed531560 ec8fc900 80285a6c 80285138 edb910c0 00000000 ecd91008 ede08e00 9f20: 80f08448 00000000 00000000 816d8b20 600c0013 00000001 ede09320 801763d4 9f40: 00000000 50d45f32 00021000 edb91118 edb910c0 80f08448 01b29000 edb91118 9f60: eca99f7c 50d45f32 00021000 ec8fc900 00000003 ec8fc900 00008914 7ee9fc38 9f80: eca98000 00000036 00000003 80285a6c 00086364 7ee9fe1c 000000c3 00000036 9fa0: 801011c4 80101000 00086364 7ee9fe1c 00000003 00008914 7ee9fc38 00086364 9fc0: 00086364 7ee9fe1c 000000c3 00000036 0008630c 7ee9fe1c 7ee9fc38 00000003 9fe0: 000a42b8 7ee9fbd4 00019914 76e09acc 600c0010 00000003 00000000 00000000 [<7f26a91c>] (brcmf_cfg80211_up [brcmfmac]) from [<7f27262c>] (brcmf_netdev_open+0x74/0xe8 [brcmfmac]) [<7f27262c>] (brcmf_netdev_open [brcmfmac]) from [<80772008>] (__dev_open+0xcc/0x150) [<80772008>] (__dev_open) from [<807723d0>] (__dev_change_flags+0x168/0x1b4) [<807723d0>] (__dev_change_flags) from [<80772434>] (dev_change_flags+0x18/0x48) [<80772434>] (dev_change_flags) from [<80805f70>] (devinet_ioctl+0x67c/0x79c) [<80805f70>] (devinet_ioctl) from [<80808b9c>] (inet_ioctl+0x210/0x3d4) [<80808b9c>] (inet_ioctl) from [<8074721c>] (sock_ioctl+0x350/0x524) [<8074721c>] (sock_ioctl) from [<80285138>] (do_vfs_ioctl+0xb0/0x9b0) [<80285138>] (do_vfs_ioctl) from [<80285a6c>] (ksys_ioctl+0x34/0x5c) [<80285a6c>] (ksys_ioctl) from [<80101000>] (ret_fast_syscall+0x0/0x28) Exception stack(0xeca99fa8 to 0xeca99ff0) 9fa0: 00086364 7ee9fe1c 00000003 00008914 7ee9fc38 00086364 9fc0: 00086364 7ee9fe1c 000000c3 00000036 0008630c 7ee9fe1c 7ee9fc38 00000003 9fe0: 000a42b8 7ee9fbd4 00019914 76e09acc Code: e5970328 eb002021 e1a02006 e3a01002 (e5909008) ---[ end trace 5cbac2333f3ac5df ]--- Signed-off-by: Piotr Figiel <p.figiel@camlintechnologies.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a652e00 ] The IRQ is requested before the struct rtc is allocated and registered, but this struct is used in the IRQ handler. This may lead to a NULL pointer dereference. Switch to devm_rtc_allocate_device/rtc_register_device to allocate the rtc struct before requesting the IRQ. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60209d4 ] In case dev_alloc_skb fails, the fix safely returns to avoid potential NULL pointer dereference. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0487fff ] Currently if a regulator has "<name>-fixed-regulator" property in device tree, it will skip current limit initialization. This lead to a zero "max_uA" value in struct ufs_vreg. However, "regulator_set_load" operation shall be required on regulators which have valid current limits, otherwise a zero "max_uA" set by "regulator_set_load" may cause unexpected behavior when this regulator is enabled or set as high power mode. Similarly, in device's icc_level configuration flow, the target icc_level shall be updated if regulator also has valid current limit, otherwise a wrong icc_level will be calculated by zero "max_uA" and thus causes unexpected results after it is written to device. Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Reviewed-by: Avri Altman <avri.altman@wdc.com> Acked-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3b141e8 ] For regulators used by UFS, vcc, vccq and vccq2 will have voltage range initialized by ufshcd_populate_vreg(), however other regulators may have undefined voltage range if dt-bindings have no such definition. In above undefined case, both "min_uV" and "max_uV" fields in ufs_vreg struct will be zero values and these values will be configured on regulators in different power modes. Currently this may have no harm if both "min_uV" and "max_uV" always keep "zero values" because regulator_set_voltage() will always bypass such invalid values and return "good" results. However improper values shall be fixed to avoid potential bugs. Simply bypass voltage configuration if voltage range is not defined. Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Reviewed-by: Avri Altman <avri.altman@wdc.com> Acked-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 92606ec ] The call to of_get_next_child returns a node pointer with refcount incremented thus it must be explicitly decremented after the last usage. Detected by coccinelle with the following warnings: ./arch/arm64/kernel/cpu_ops.c:102:1-7: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 69, but without a corresponding object release within this function. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4a6c91f ] For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get overloaded and generate callouts to this code, and thus also when AC=1. Make it safe. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 88e4718 ] Occasionally GCC is less agressive with inlining and the following is observed: arch/x86/kernel/signal.o: warning: objtool: restore_sigcontext()+0x3cc: call to force_valid_ss.isra.5() with UACCESS enabled arch/x86/kernel/signal.o: warning: objtool: do_signal()+0x384: call to frame_uc_flags.isra.0() with UACCESS enabled Cure this by moving this code out of the AC=1 region, since it really isn't needed for the user access. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 67a0514 ] Objtool spotted that we call native_load_gs_index() with AC set. Re-arrange the code to avoid that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit de36e16 ] Current overlap checking cannot correctly handle a case which is baseminor < existing baseminor && baseminor + minorct > existing baseminor + minorct. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6734b29 ] port_pd is treated as le32 in declaration and read, fix assignment to be in le32 too. This change fixes the following compilation warnings. drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: warning: incorrect type in assignment (different base types) drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: expected restricted __le32 [usertype] port_pd drivers/infiniband/hw/hns/hns_roce_ah.c:67:24: got restricted __be32 [usertype] Fixes: 9a44353 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Lijun Ou <ouliun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 58e7515 ] As seen on some USB wireless keyboards manufactured by Primax, the HID parser was using some assumptions that are not always true. In this case it's s the fact that, inside the scope of a main item, an Usage Page will always precede an Usage. The spec is not pretty clear as 6.2.2.7 states "Any usage that follows is interpreted as a Usage ID and concatenated with the Usage Page". While 6.2.2.8 states "When the parser encounters a main item it concatenates the last declared Usage Page with a Usage to form a complete usage value." Being somewhat contradictory it was decided to match Window's implementation, which follows 6.2.2.8. In summary, the patch moves the Usage Page concatenation from the local item parsing function to the main item parsing function. Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Reviewed-by: Terry Junge <terry.junge@poly.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
…_put [ Upstream commit b820d52 ] The call to of_parse_phandle returns a node pointer with refcount incremented thus it must be explicitly decremented after the last usage. Detected by coccinelle with the following warnings: ./sound/soc/fsl/eukrea-tlv320.c:121:3-9: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 102, but without a correspo nding object release within this function. ./sound/soc/fsl/eukrea-tlv320.c:127:3-9: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 102, but without a correspo nding object release within this function. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: alsa-devel@alsa-project.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c705247 ] The call to of_parse_phandle returns a node pointer with refcount incremented thus it must be explicitly decremented after the last usage. Detected by coccinelle with the following warnings: ./sound/soc/fsl/fsl_utils.c:74:2-8: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 38, but without a corresponding object release within this function. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Cc: Timur Tabi <timur@kernel.org> Cc: Nicolin Chen <nicoleotsuka@gmail.com> Cc: Xiubo Li <Xiubo.Lee@gmail.com> Cc: Fabio Estevam <festevam@gmail.com> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: alsa-devel@alsa-project.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7649773 ] The use of zero-sized array causes undefined behaviour when it is not the last member in a structure. As it happens to be in this case. Also, the current code makes use of a language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as this one is a flexible array member, introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last. Which is beneficial to cultivate a high-quality code. Fixes: e48f129 ("[SCSI] cxgb3i: convert cdev->l2opt to use rcu to prevent NULL dereference") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
… percent [ Upstream commit 1f87b0c ] According to hidpp20_batterylevel_get_battery_info my Logitech K270 keyboard reports only 2 battery levels. This matches with what I've seen after testing with batteries at varying level of fullness, it always reports either 5% or 30%. Windows reports "battery good" for the 30% level. I've captured an USB trace of Windows reading the battery and it is getting the same info as the Linux hidpp code gets. Now that Linux handles these devices as hidpp devices, it reports the battery as being low as it treats anything under 31% as low, this leads to the user constantly getting a "Keyboard battery is low" warning from GNOME3, which is very annoying. This commit fixes this by changing the low threshold to anything under 30%, which I assume is what Windows does. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0191949 ] Fixes: SPI driver can be built as module so perform SPI controller reset on probe to make sure it is in valid state before initiating transfer. Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c03a0fd ] syzbot is hitting use-after-free bug in uinput module [1]. This is because kobject_uevent(KOBJ_REMOVE) is called again due to commit 0f4dafc ("Kobject: auto-cleanup on final unref") after memory allocation fault injection made kobject_uevent(KOBJ_REMOVE) from device_del() from input_unregister_device() fail, while uinput_destroy_device() is expecting that kobject_uevent(KOBJ_REMOVE) is not called after device_del() from input_unregister_device() completed. That commit intended to catch cases where nobody even attempted to send "remove" uevents. But there is no guarantee that an event will ultimately be sent. We are at the point of no return as far as the rest of the kernel is concerned; there are no repeats or do-overs. Also, it is not clear whether some subsystem depends on that commit. If no subsystem depends on that commit, it will be better to remove the state_{add,remove}_uevent_sent logic. But we don't want to risk a regression (in a patch which will be backported) by trying to remove that logic. Therefore, as a first step, let's avoid the use-after-free bug by making sure that kobject_uevent(KOBJ_REMOVE) won't be triggered twice. [1] https://syzkaller.appspot.com/bug?id=8b17c134fe938bbddd75a45afaa9e68af43a362d Reported-by: syzbot <syzbot+f648cfb7e0b52bf7ae32@syzkaller.appspotmail.com> Analyzed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Fixes: 0f4dafc ("Kobject: auto-cleanup on final unref") Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aeb0d0f ] devm_kcalloc may fail and return a null pointer. The fix returns -ENOMEM upon failures to avoid null pointer dereferences. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9c2ccc3 ] Smatch marks skb->data as untrusted so it warns that "evt_hdr->dlen" can copy up to 255 bytes and we only have room for two bytes. Even if this comes from the firmware and we trust it, the new policy generally is just to fix it as kernel hardenning. I can't test this code so I tried to be very conservative. I considered not allowing "evt_hdr->dlen == 1" because it doesn't initialize the whole variable but in the end I decided to allow it and manually initialized "asic_id" and "asic_ver" to zero. Fixes: e8454ff ("[media] drivers:media:radio: wl128x: FM Driver Common sources") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 338aa10 ] Fix the warning produced by gpiochip_set_irq_hooks() by allocating a dedicated IRQ chip per GPIO chip/port. Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Bartosz Golaszewski <bgolaszewski@baylibre.com> Cc: Chris Healy <cphealy@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Fabio Estevam <festevam@gmail.com> Cc: linux-gpio@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-imx@nxp.com Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8c43004 ] pcpu_find_block_fit() guarantees that a fit is found within PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it compares against the block's contig_hint. This can lead to incorrectly scanning past the end of the bitmap. The behavior was okay given the check after for bit_off >= end and the correctness of the hints from pcpu_find_block_fit(). This patch fixes this by bounding the end offset by the number of bits in a chunk. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
…R connections" This reverts commit 2fa7a15 which is commit d5bb334 upstream. Lots of people have reported issues with this patch, and as there does not seem to be a fix going into Linus's kernel tree any time soon, revert the commit in the stable trees so as to get people's machines working properly again. Reported-by: Vasily Khoruzhick <anarsoul@gmail.com> Reported-by: Hans de Goede <hdegoede@redhat.com> Cc: Jeremy Cline <jeremy@jcline.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…ntexts. (v3)" This reverts commit 140ae65 which is commit b30a43a upstream. Sven reports: Commit 1e07d63 ("drm/nouveau: add kconfig option to turn off nouveau legacy contexts. (v3)") has caused a build failure for me when I actually tried that option (CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT=n): ,---- | Kernel: arch/x86/boot/bzImage is ready (Freescale#1) | Building modules, stage 2. | MODPOST 290 modules | ERROR: "drm_legacy_mmap" [drivers/gpu/drm/nouveau/nouveau.ko] undefined! | scripts/Makefile.modpost:91: recipe for target '__modpost' failed `---- Upstream does not have that problem, as commit bed2dd8 ("drm/ttm: Quick-test mmap offset in ttm_bo_mmap()") has removed the use of drm_legacy_mmap from nouveau_ttm.c. Unfortunately that commit does not apply in 5.1.9. The ensuing discussion proposed a number of one-off patches, but no solid agreement was made, so just revert the commit for now to get people's systems building again. Reported-by: Sven Joachim <svenjoac@gmx.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Airlie <airlied@redhat.com> Cc: Thomas Backlund <tmb@mageia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89a4aac upstream. In the case of a normal sync update, the preparation of framebuffers (be it calling drm_atomic_helper_prepare_planes() or doing setups with drm_framebuffer_get()) are performed in the new_state and the respective cleanups are performed in the old_state. In the case of async updates, the preparation is also done in the new_state but the cleanups are done in the new_state (because updates are performed in place, i.e. in the current state). The current code blocks async udpates when the fb is changed, turning async updates into sync updates, slowing down cursor updates and introducing regressions in igt tests with errors of type: "CRITICAL: completed 97 cursor updated in a period of 30 flips, we expect to complete approximately 15360 updates, with the threshold set at 7680" Fb changes in async updates were prevented to avoid the following scenario: - Async update, oldfb = NULL, newfb = fb1, prepare fb1, cleanup fb1 - Async update, oldfb = fb1, newfb = fb2, prepare fb2, cleanup fb2 - Non-async commit, oldfb = fb2, newfb = fb1, prepare fb1, cleanup fb2 (wrong) Where we have a single call to prepare fb2 but double cleanup call to fb2. To solve the above problems, instead of blocking async fb changes, we place the old framebuffer in the new_state object, so when the code performs cleanups in the new_state it will cleanup the old_fb and we will have the following scenario instead: - Async update, oldfb = NULL, newfb = fb1, prepare fb1, no cleanup - Async update, oldfb = fb1, newfb = fb2, prepare fb2, cleanup fb1 - Non-async commit, oldfb = fb2, newfb = fb1, prepare fb1, cleanup fb2 Where calls to prepare/cleanup are balanced. Cc: <stable@vger.kernel.org> # v4.14+ Fixes: 25dc194 ("drm: Block fb changes for async plane updates") Suggested-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Helen Koike <helen.koike@collabora.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190603165610.24614-6-helen.koike@collabora.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c32ae3 upstream. The call of unsubscribe_port() which manages the group count and module refcount from delete_and_unsubscribe_port() looks racy; it's not covered by the group list lock, and it's likely a cause of the reported unbalance at port deletion. Let's move the call inside the group list_mutex to plug the hole. Reported-by: syzbot+e4c8abb920efa77bace9@syzkaller.appspotmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is the 4.14.126 stable release Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com> Conflicts: drivers/gpio/gpio-vf610.c: Follow commit 338aa10 gpio: vf610: Do not share irq_chip drivers/gpu/drm/bridge/adv7511/adv7511_drv.c: Follow commit 67793bd drm/bridge: adv7511: Fix low refresh rate selection Use drm_mode_vrefresh(mode) helper drivers/net/ethernet/freescale/fec_main.c: Keep downstream file. drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c Follow commit 46953f9 brcmfmac: fix missing checks for kmemdup sound/soc/fsl/Kconfig: Follow commit ea75122 ASoC: imx: fix fiq dependencies Logical Conflicts: sound/soc/fsl/fsl_sai.c: Revert upstream d7325abe29b as downstream fixed it differently drivers/clk/imx/clk-imx6sl.c Revert upstream bda9f84 as downstream implemented it differently 68c736e
Prevents: fs/proc/task_mmu.c:761:7: warning: ‘last_vma’ may be used uninitialized in this function [-Wmaybe-uninitialized] last_vma is guarded by 'rollup_mode = true' and thus never accessed uninitalized. Upstream in the meantime refactored the code and thus does not see the issue. Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
schnitzeltony
pushed a commit
to schnitzeltony/linux-fslc
that referenced
this pull request
Aug 12, 2019
[ Upstream commit 3f167e1 ] ipv4_pdp_add() is called in RCU read-side critical section. So GFP_KERNEL should not be used in the function. This patch make ipv4_pdp_add() to use GFP_ATOMIC instead of GFP_KERNEL. Test commands: gtp-link add gtp1 & gtp-tunnel add gtp1 v1 100 200 1.1.1.1 2.2.2.2 Splat looks like: [ 130.618881] ============================= [ 130.626382] WARNING: suspicious RCU usage [ 130.626994] 5.2.0-rc6+ Freescale#50 Not tainted [ 130.627622] ----------------------------- [ 130.628223] ./include/linux/rcupdate.h:266 Illegal context switch in RCU read-side critical section! [ 130.629684] [ 130.629684] other info that might help us debug this: [ 130.629684] [ 130.631022] [ 130.631022] rcu_scheduler_active = 2, debug_locks = 1 [ 130.632136] 4 locks held by gtp-tunnel/1025: [ 130.632925] #0: 000000002b93c8b7 (cb_lock){++++}, at: genl_rcv+0x15/0x40 [ 130.634159] Freescale#1: 00000000f17bc999 (genl_mutex){+.+.}, at: genl_rcv_msg+0xfb/0x130 [ 130.635487] Freescale#2: 00000000c644ed8e (rtnl_mutex){+.+.}, at: gtp_genl_new_pdp+0x18c/0x1150 [gtp] [ 130.636936] Freescale#3: 0000000007a1cde7 (rcu_read_lock){....}, at: gtp_genl_new_pdp+0x187/0x1150 [gtp] [ 130.638348] [ 130.638348] stack backtrace: [ 130.639062] CPU: 1 PID: 1025 Comm: gtp-tunnel Not tainted 5.2.0-rc6+ Freescale#50 [ 130.641318] Call Trace: [ 130.641707] dump_stack+0x7c/0xbb [ 130.642252] ___might_sleep+0x2c0/0x3b0 [ 130.642862] kmem_cache_alloc_trace+0x1cd/0x2b0 [ 130.643591] gtp_genl_new_pdp+0x6c5/0x1150 [gtp] [ 130.644371] genl_family_rcv_msg+0x63a/0x1030 [ 130.645074] ? mutex_lock_io_nested+0x1090/0x1090 [ 130.645845] ? genl_unregister_family+0x630/0x630 [ 130.646592] ? debug_show_all_locks+0x2d0/0x2d0 [ 130.647293] ? check_flags.part.40+0x440/0x440 [ 130.648099] genl_rcv_msg+0xa3/0x130 [ ... ] Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
schnitzeltony
pushed a commit
to schnitzeltony/linux-fslc
that referenced
this pull request
Aug 12, 2019
[ Upstream commit 1788b85 ] gtp_encap_destroy() is called twice. 1. When interface is deleted. 2. When udp socket is destroyed. either gtp->sk0 or gtp->sk1u could be freed by sock_put() in gtp_encap_destroy(). so, when gtp_encap_destroy() is called again, it would uses freed sk pointer. patch makes gtp_encap_destroy() to set either gtp->sk0 or gtp->sk1u to null. in addition, both gtp->sk0 and gtp->sk1u pointer are protected by rtnl_lock. so, rtnl_lock() is added. Test command: gtp-link add gtp1 & killall gtp-link ip link del gtp1 Splat looks like: [ 83.182767] BUG: KASAN: use-after-free in __lock_acquire+0x3a20/0x46a0 [ 83.184128] Read of size 8 at addr ffff8880cc7d5360 by task ip/1008 [ 83.185567] CPU: 1 PID: 1008 Comm: ip Not tainted 5.2.0-rc6+ Freescale#50 [ 83.188469] Call Trace: [ ... ] [ 83.200126] lock_acquire+0x141/0x380 [ 83.200575] ? lock_sock_nested+0x3a/0xf0 [ 83.201069] _raw_spin_lock_bh+0x38/0x70 [ 83.201551] ? lock_sock_nested+0x3a/0xf0 [ 83.202044] lock_sock_nested+0x3a/0xf0 [ 83.202520] gtp_encap_destroy+0x18/0xe0 [gtp] [ 83.203065] gtp_encap_disable.isra.14+0x13/0x50 [gtp] [ 83.203687] gtp_dellink+0x56/0x170 [gtp] [ 83.204190] rtnl_delete_link+0xb4/0x100 [ ... ] [ 83.236513] Allocated by task 976: [ 83.236925] save_stack+0x19/0x80 [ 83.237332] __kasan_kmalloc.constprop.3+0xa0/0xd0 [ 83.237894] kmem_cache_alloc+0xd8/0x280 [ 83.238360] sk_prot_alloc.isra.42+0x50/0x200 [ 83.238874] sk_alloc+0x32/0x940 [ 83.239264] inet_create+0x283/0xc20 [ 83.239684] __sock_create+0x2dd/0x540 [ 83.240136] __sys_socket+0xca/0x1a0 [ 83.240550] __x64_sys_socket+0x6f/0xb0 [ 83.240998] do_syscall_64+0x9c/0x450 [ 83.241466] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 83.242061] [ 83.242249] Freed by task 0: [ 83.242616] save_stack+0x19/0x80 [ 83.243013] __kasan_slab_free+0x111/0x150 [ 83.243498] kmem_cache_free+0x89/0x250 [ 83.244444] __sk_destruct+0x38f/0x5a0 [ 83.245366] rcu_core+0x7e9/0x1c20 [ 83.245766] __do_softirq+0x213/0x8fa Fixes: 1e3a3ab ("gtp: make GTP sockets in gtp_newlink optional") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
schnitzeltony
pushed a commit
to schnitzeltony/linux-fslc
that referenced
this pull request
Aug 12, 2019
[ Upstream commit a2bed90 ] Current gtp_newlink() could be called after unregister_pernet_subsys(). gtp_newlink() uses gtp_net but it can be destroyed by unregister_pernet_subsys(). So unregister_pernet_subsys() should be called after rtnl_link_unregister(). Test commands: #SHELL 1 while : do for i in {1..5} do ./gtp-link add gtp$i & done killall gtp-link done #SHELL 2 while : do modprobe -rv gtp done Splat looks like: [ 753.176631] BUG: KASAN: use-after-free in gtp_newlink+0x9b4/0xa5c [gtp] [ 753.177722] Read of size 8 at addr ffff8880d48f2458 by task gtp-link/7126 [ 753.179082] CPU: 0 PID: 7126 Comm: gtp-link Tainted: G W 5.2.0-rc6+ Freescale#50 [ 753.185801] Call Trace: [ 753.186264] dump_stack+0x7c/0xbb [ 753.186863] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.187583] print_address_description+0xc7/0x240 [ 753.188382] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.189097] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.189846] __kasan_report+0x12a/0x16f [ 753.190542] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.191298] kasan_report+0xe/0x20 [ 753.191893] gtp_newlink+0x9b4/0xa5c [gtp] [ 753.192580] ? __netlink_ns_capable+0xc3/0xf0 [ 753.193370] __rtnl_newlink+0xb9f/0x11b0 [ ... ] [ 753.241201] Allocated by task 7186: [ 753.241844] save_stack+0x19/0x80 [ 753.242399] __kasan_kmalloc.constprop.3+0xa0/0xd0 [ 753.243192] __kmalloc+0x13e/0x300 [ 753.243764] ops_init+0xd6/0x350 [ 753.244314] register_pernet_operations+0x249/0x6f0 [ ... ] [ 753.251770] Freed by task 7178: [ 753.252288] save_stack+0x19/0x80 [ 753.252833] __kasan_slab_free+0x111/0x150 [ 753.253962] kfree+0xc7/0x280 [ 753.254509] ops_free_list.part.11+0x1c4/0x2d0 [ 753.255241] unregister_pernet_operations+0x262/0x390 [ ... ] [ 753.285883] list_add corruption. next->prev should be prev (ffff8880d48f2458), but was ffff8880d497d878. (next. [ 753.287241] ------------[ cut here ]------------ [ 753.287794] kernel BUG at lib/list_debug.c:25! [ 753.288364] invalid opcode: 0000 [Freescale#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 753.289099] CPU: 0 PID: 7126 Comm: gtp-link Tainted: G B W 5.2.0-rc6+ Freescale#50 [ 753.291036] RIP: 0010:__list_add_valid+0x74/0xd0 [ 753.291589] Code: 48 39 da 75 27 48 39 f5 74 36 48 39 dd 74 31 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 89 d9 48b [ 753.293779] RSP: 0018:ffff8880cae8f398 EFLAGS: 00010286 [ 753.294401] RAX: 0000000000000075 RBX: ffff8880d497d878 RCX: 0000000000000000 [ 753.296260] RDX: 0000000000000075 RSI: 0000000000000008 RDI: ffffed10195d1e69 [ 753.297070] RBP: ffff8880cd250ae0 R08: ffffed101b4bff21 R09: ffffed101b4bff21 [ 753.297899] R10: 0000000000000001 R11: ffffed101b4bff20 R12: ffff8880d497d878 [ 753.298703] R13: 0000000000000000 R14: ffff8880cd250ae0 R15: ffff8880d48f2458 [ 753.299564] FS: 00007f5f79805740(0000) GS:ffff8880da400000(0000) knlGS:0000000000000000 [ 753.300533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 753.301231] CR2: 00007fe8c7ef4f10 CR3: 00000000b71a6006 CR4: 00000000000606f0 [ 753.302183] Call Trace: [ 753.302530] gtp_newlink+0x5f6/0xa5c [gtp] [ 753.303037] ? __netlink_ns_capable+0xc3/0xf0 [ 753.303576] __rtnl_newlink+0xb9f/0x11b0 [ 753.304092] ? rtnl_link_unregister+0x230/0x230 Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
LeBlue
pushed a commit
to LeBlue/linux-fslc
that referenced
this pull request
Feb 25, 2020
[ Upstream commit 3f167e1 ] ipv4_pdp_add() is called in RCU read-side critical section. So GFP_KERNEL should not be used in the function. This patch make ipv4_pdp_add() to use GFP_ATOMIC instead of GFP_KERNEL. Test commands: gtp-link add gtp1 & gtp-tunnel add gtp1 v1 100 200 1.1.1.1 2.2.2.2 Splat looks like: [ 130.618881] ============================= [ 130.626382] WARNING: suspicious RCU usage [ 130.626994] 5.2.0-rc6+ Freescale#50 Not tainted [ 130.627622] ----------------------------- [ 130.628223] ./include/linux/rcupdate.h:266 Illegal context switch in RCU read-side critical section! [ 130.629684] [ 130.629684] other info that might help us debug this: [ 130.629684] [ 130.631022] [ 130.631022] rcu_scheduler_active = 2, debug_locks = 1 [ 130.632136] 4 locks held by gtp-tunnel/1025: [ 130.632925] #0: 000000002b93c8b7 (cb_lock){++++}, at: genl_rcv+0x15/0x40 [ 130.634159] Freescale#1: 00000000f17bc999 (genl_mutex){+.+.}, at: genl_rcv_msg+0xfb/0x130 [ 130.635487] Freescale#2: 00000000c644ed8e (rtnl_mutex){+.+.}, at: gtp_genl_new_pdp+0x18c/0x1150 [gtp] [ 130.636936] Freescale#3: 0000000007a1cde7 (rcu_read_lock){....}, at: gtp_genl_new_pdp+0x187/0x1150 [gtp] [ 130.638348] [ 130.638348] stack backtrace: [ 130.639062] CPU: 1 PID: 1025 Comm: gtp-tunnel Not tainted 5.2.0-rc6+ Freescale#50 [ 130.641318] Call Trace: [ 130.641707] dump_stack+0x7c/0xbb [ 130.642252] ___might_sleep+0x2c0/0x3b0 [ 130.642862] kmem_cache_alloc_trace+0x1cd/0x2b0 [ 130.643591] gtp_genl_new_pdp+0x6c5/0x1150 [gtp] [ 130.644371] genl_family_rcv_msg+0x63a/0x1030 [ 130.645074] ? mutex_lock_io_nested+0x1090/0x1090 [ 130.645845] ? genl_unregister_family+0x630/0x630 [ 130.646592] ? debug_show_all_locks+0x2d0/0x2d0 [ 130.647293] ? check_flags.part.40+0x440/0x440 [ 130.648099] genl_rcv_msg+0xa3/0x130 [ ... ] Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
LeBlue
pushed a commit
to LeBlue/linux-fslc
that referenced
this pull request
Feb 25, 2020
[ Upstream commit a2bed90 ] Current gtp_newlink() could be called after unregister_pernet_subsys(). gtp_newlink() uses gtp_net but it can be destroyed by unregister_pernet_subsys(). So unregister_pernet_subsys() should be called after rtnl_link_unregister(). Test commands: #SHELL 1 while : do for i in {1..5} do ./gtp-link add gtp$i & done killall gtp-link done #SHELL 2 while : do modprobe -rv gtp done Splat looks like: [ 753.176631] BUG: KASAN: use-after-free in gtp_newlink+0x9b4/0xa5c [gtp] [ 753.177722] Read of size 8 at addr ffff8880d48f2458 by task gtp-link/7126 [ 753.179082] CPU: 0 PID: 7126 Comm: gtp-link Tainted: G W 5.2.0-rc6+ Freescale#50 [ 753.185801] Call Trace: [ 753.186264] dump_stack+0x7c/0xbb [ 753.186863] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.187583] print_address_description+0xc7/0x240 [ 753.188382] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.189097] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.189846] __kasan_report+0x12a/0x16f [ 753.190542] ? gtp_newlink+0x9b4/0xa5c [gtp] [ 753.191298] kasan_report+0xe/0x20 [ 753.191893] gtp_newlink+0x9b4/0xa5c [gtp] [ 753.192580] ? __netlink_ns_capable+0xc3/0xf0 [ 753.193370] __rtnl_newlink+0xb9f/0x11b0 [ ... ] [ 753.241201] Allocated by task 7186: [ 753.241844] save_stack+0x19/0x80 [ 753.242399] __kasan_kmalloc.constprop.3+0xa0/0xd0 [ 753.243192] __kmalloc+0x13e/0x300 [ 753.243764] ops_init+0xd6/0x350 [ 753.244314] register_pernet_operations+0x249/0x6f0 [ ... ] [ 753.251770] Freed by task 7178: [ 753.252288] save_stack+0x19/0x80 [ 753.252833] __kasan_slab_free+0x111/0x150 [ 753.253962] kfree+0xc7/0x280 [ 753.254509] ops_free_list.part.11+0x1c4/0x2d0 [ 753.255241] unregister_pernet_operations+0x262/0x390 [ ... ] [ 753.285883] list_add corruption. next->prev should be prev (ffff8880d48f2458), but was ffff8880d497d878. (next. [ 753.287241] ------------[ cut here ]------------ [ 753.287794] kernel BUG at lib/list_debug.c:25! [ 753.288364] invalid opcode: 0000 [Freescale#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 753.289099] CPU: 0 PID: 7126 Comm: gtp-link Tainted: G B W 5.2.0-rc6+ Freescale#50 [ 753.291036] RIP: 0010:__list_add_valid+0x74/0xd0 [ 753.291589] Code: 48 39 da 75 27 48 39 f5 74 36 48 39 dd 74 31 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 89 d9 48b [ 753.293779] RSP: 0018:ffff8880cae8f398 EFLAGS: 00010286 [ 753.294401] RAX: 0000000000000075 RBX: ffff8880d497d878 RCX: 0000000000000000 [ 753.296260] RDX: 0000000000000075 RSI: 0000000000000008 RDI: ffffed10195d1e69 [ 753.297070] RBP: ffff8880cd250ae0 R08: ffffed101b4bff21 R09: ffffed101b4bff21 [ 753.297899] R10: 0000000000000001 R11: ffffed101b4bff20 R12: ffff8880d497d878 [ 753.298703] R13: 0000000000000000 R14: ffff8880cd250ae0 R15: ffff8880d48f2458 [ 753.299564] FS: 00007f5f79805740(0000) GS:ffff8880da400000(0000) knlGS:0000000000000000 [ 753.300533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 753.301231] CR2: 00007fe8c7ef4f10 CR3: 00000000b71a6006 CR4: 00000000000606f0 [ 753.302183] Call Trace: [ 753.302530] gtp_newlink+0x5f6/0xa5c [gtp] [ 753.303037] ? __netlink_ns_capable+0xc3/0xf0 [ 753.303576] __rtnl_newlink+0xb9f/0x11b0 [ 753.304092] ? rtnl_link_unregister+0x230/0x230 Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
Jan 19, 2021
commit d68b295 upstream. Commit 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. In this case sometimes when allocating a new crypto request we may get a stacktrace like below: [ 210.103008][ C0] BUG: sleeping function called from invalid context at mm/mempool.c:381 [ 210.104746][ C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2602, name: fio [ 210.106599][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ Freescale#50 [ 210.108331][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.110212][ C0] Call Trace: [ 210.110921][ C0] <IRQ> [ 210.111527][ C0] dump_stack+0x7d/0xa3 [ 210.112411][ C0] ___might_sleep.cold+0x122/0x151 [ 210.113527][ C0] mempool_alloc+0x16b/0x2f0 [ 210.114524][ C0] ? __queue_work+0x515/0xde0 [ 210.115553][ C0] ? mempool_resize+0x700/0x700 [ 210.116586][ C0] ? crypt_endio+0x91/0x180 [ 210.117479][ C0] ? blk_update_request+0x757/0x1150 [ 210.118513][ C0] ? blk_mq_end_request+0x4b/0x480 [ 210.119572][ C0] ? blk_done_softirq+0x21d/0x340 [ 210.120628][ C0] ? __do_softirq+0x190/0x611 [ 210.121626][ C0] crypt_convert+0x29f9/0x4c00 [ 210.122668][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.123824][ C0] ? kasan_set_track+0x1c/0x30 [ 210.124858][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.125930][ C0] ? kmem_cache_free+0x104/0x470 [ 210.126973][ C0] ? crypt_endio+0x91/0x180 [ 210.127947][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.129165][ C0] blk_update_request+0x757/0x1150 [ 210.130231][ C0] blk_mq_end_request+0x4b/0x480 [ 210.131294][ C0] blk_done_softirq+0x21d/0x340 [ 210.132332][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.133289][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.134399][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.135458][ C0] __do_softirq+0x190/0x611 [ 210.136409][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.137447][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.138507][ C0] </IRQ> [ 210.139118][ C0] do_softirq_own_stack+0x37/0x40 [ 210.140191][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.141151][ C0] common_interrupt+0x74/0x120 [ 210.142171][ C0] asm_common_interrupt+0x1e/0x40 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context. Fixes: 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: stable@vger.kernel.org # v5.9+ Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
Jan 19, 2021
…tirq commit 8abec36 upstream. Commit 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][ C0] BUG: scheduling while atomic: fio/2602/0x00000102 [ 210.236701][ C0] Modules linked in: [ 210.237566][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ Freescale#50 [ 210.239292][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][ C0] Call Trace: [ 210.241946][ C0] <IRQ> [ 210.242561][ C0] dump_stack+0x7d/0xa3 [ 210.243466][ C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][ C0] __schedule+0x156f/0x20d0 [ 210.245518][ C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][ C0] schedule+0xd0/0x270 [ 210.247541][ C0] schedule_timeout+0x1fb/0x280 [ 210.248586][ C0] ? usleep_range+0x150/0x150 [ 210.249624][ C0] ? unpoison_range+0x3a/0x60 [ 210.250632][ C0] ? ____kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][ C0] ? unpoison_range+0x3a/0x60 [ 210.252958][ C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][ C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][ C0] ? usleep_range+0x150/0x150 [ 210.256192][ C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][ C0] ? blk_update_request+0x757/0x1150 [ 210.258582][ C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][ C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][ C0] wait_for_completion+0x4c/0x90 [ 210.261971][ C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][ C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][ C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][ C0] ? crypt_endio+0x91/0x180 [ 210.268327][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][ C0] blk_update_request+0x757/0x1150 [ 210.270563][ C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][ C0] blk_done_softirq+0x21d/0x340 [ 210.272775][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][ C0] __do_softirq+0x190/0x611 [ 210.277203][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][ C0] </IRQ> [ 210.280164][ C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][ C0] common_interrupt+0x74/0x120 [ 210.283376][ C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][ C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 19, 2021
[ Upstream commit 5bbf219 ] An out of bounds write happens when setting the default power state. KASAN sees this as: [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 Freescale#50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 kasan_report+0x170/0x1a8 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] radeon_pm_init+0x1019/0x1904 [radeon] rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 driver_probe_device+0xe6/0x135 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 do_one_initcall+0x149/0x2fd do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca __do_sys_finit_module+0xf1/0x140 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Without KASAN, this will manifest later when the kernel attempts to allocate memory that was stomped, since it collides with the inline slab freelist pointer: invalid opcode: 0000 [Freescale#1] SMP NOPTI CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G W 5.10.12-gentoo-E620 Freescale#2 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 RIP: 0010:kfree+0x115/0x230 Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7 RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246 RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80 RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000 R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100 FS: 00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0 Call Trace: __free_fdtable+0x16/0x1f put_files_struct+0x81/0x9b do_exit+0x433/0x94d do_group_exit+0xa6/0xa6 __x64_sys_exit_group+0xf/0xf do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe80ef64bea Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0. RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0 R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0 Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek ... Use a valid power_state index when initializing the "flags" and "misc" and "misc2" fields. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211537 Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: a48b9b4 ("drm/radeon/kms/pm: add asic specific callbacks for getting power state (v2)") Fixes: 79daedc ("drm/radeon/kms: minor pm cleanups") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 19, 2021
[ Upstream commit 5bbf219 ] An out of bounds write happens when setting the default power state. KASAN sees this as: [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 Freescale#50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 kasan_report+0x170/0x1a8 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] radeon_pm_init+0x1019/0x1904 [radeon] rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 driver_probe_device+0xe6/0x135 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 do_one_initcall+0x149/0x2fd do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca __do_sys_finit_module+0xf1/0x140 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Without KASAN, this will manifest later when the kernel attempts to allocate memory that was stomped, since it collides with the inline slab freelist pointer: invalid opcode: 0000 [Freescale#1] SMP NOPTI CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G W 5.10.12-gentoo-E620 Freescale#2 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 RIP: 0010:kfree+0x115/0x230 Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7 RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246 RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80 RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000 R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100 FS: 00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0 Call Trace: __free_fdtable+0x16/0x1f put_files_struct+0x81/0x9b do_exit+0x433/0x94d do_group_exit+0xa6/0xa6 __x64_sys_exit_group+0xf/0xf do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe80ef64bea Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0. RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0 R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0 Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek ... Use a valid power_state index when initializing the "flags" and "misc" and "misc2" fields. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211537 Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: a48b9b4 ("drm/radeon/kms/pm: add asic specific callbacks for getting power state (v2)") Fixes: 79daedc ("drm/radeon/kms: minor pm cleanups") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 19, 2021
[ Upstream commit 5bbf219 ] An out of bounds write happens when setting the default power state. KASAN sees this as: [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 Freescale#50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 kasan_report+0x170/0x1a8 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] radeon_pm_init+0x1019/0x1904 [radeon] rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 driver_probe_device+0xe6/0x135 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 do_one_initcall+0x149/0x2fd do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca __do_sys_finit_module+0xf1/0x140 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Without KASAN, this will manifest later when the kernel attempts to allocate memory that was stomped, since it collides with the inline slab freelist pointer: invalid opcode: 0000 [Freescale#1] SMP NOPTI CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G W 5.10.12-gentoo-E620 Freescale#2 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 RIP: 0010:kfree+0x115/0x230 Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7 RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246 RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80 RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000 R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100 FS: 00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0 Call Trace: __free_fdtable+0x16/0x1f put_files_struct+0x81/0x9b do_exit+0x433/0x94d do_group_exit+0xa6/0xa6 __x64_sys_exit_group+0xf/0xf do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe80ef64bea Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0. RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0 R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0 Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek ... Use a valid power_state index when initializing the "flags" and "misc" and "misc2" fields. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211537 Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: a48b9b4 ("drm/radeon/kms/pm: add asic specific callbacks for getting power state (v2)") Fixes: 79daedc ("drm/radeon/kms: minor pm cleanups") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 22, 2021
[ Upstream commit d5027ca ] Ritesh reported a bug [1] against UML, noting that it crashed on startup. The backtrace shows the following (heavily redacted): (gdb) bt ... Freescale#26 0x0000000060015b5d in sem_init () at ipc/sem.c:268 Freescale#27 0x00007f89906d92f7 in ?? () from /lib/x86_64-linux-gnu/libcom_err.so.2 Freescale#28 0x00007f8990ab8fb2 in call_init (...) at dl-init.c:72 ... Freescale#40 0x00007f89909bf3a6 in nss_load_library (...) at nsswitch.c:359 ... Freescale#44 0x00007f8990895e35 in _nss_compat_getgrnam_r (...) at nss_compat/compat-grp.c:486 Freescale#45 0x00007f8990968b85 in __getgrnam_r [...] Freescale#46 0x00007f89909d6b77 in grantpt [...] Freescale#47 0x00007f8990a9394e in __GI_openpty [...] Freescale#48 0x00000000604a1f65 in openpty_cb (...) at arch/um/os-Linux/sigio.c:407 Freescale#49 0x00000000604a58d0 in start_idle_thread (...) at arch/um/os-Linux/skas/process.c:598 Freescale#50 0x0000000060004a3d in start_uml () at arch/um/kernel/skas/process.c:45 Freescale#51 0x00000000600047b2 in linux_main (...) at arch/um/kernel/um_arch.c:334 Freescale#52 0x000000006000574f in main (...) at arch/um/os-Linux/main.c:144 indicating that the UML function openpty_cb() calls openpty(), which internally calls __getgrnam_r(), which causes the nsswitch machinery to get started. This loads, through lots of indirection that I snipped, the libcom_err.so.2 library, which (in an unknown function, "??") calls sem_init(). Now, of course it wants to get libpthread's sem_init(), since it's linked against libpthread. However, the dynamic linker looks up that symbol against the binary first, and gets the kernel's sem_init(). Hajime Tazaki noted that "objcopy -L" can localize a symbol, so the dynamic linker wouldn't do the lookup this way. I tried, but for some reason that didn't seem to work. Doing the same thing in the linker script instead does seem to work, though I cannot entirely explain - it *also* works if I just add "VERSION { { global: *; }; }" instead, indicating that something else is happening that I don't really understand. It may be that explicitly doing that marks them with some kind of empty version, and that's different from the default. Explicitly marking them with a version breaks kallsyms, so that doesn't seem to be possible. Marking all the symbols as local seems correct, and does seem to address the issue, so do that. Also do it for static link, nsswitch libraries could still be loaded there. [1] https://bugs.debian.org/983379 Reported-by: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Tested-By: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 22, 2021
[ Upstream commit d5027ca ] Ritesh reported a bug [1] against UML, noting that it crashed on startup. The backtrace shows the following (heavily redacted): (gdb) bt ... Freescale#26 0x0000000060015b5d in sem_init () at ipc/sem.c:268 Freescale#27 0x00007f89906d92f7 in ?? () from /lib/x86_64-linux-gnu/libcom_err.so.2 Freescale#28 0x00007f8990ab8fb2 in call_init (...) at dl-init.c:72 ... Freescale#40 0x00007f89909bf3a6 in nss_load_library (...) at nsswitch.c:359 ... Freescale#44 0x00007f8990895e35 in _nss_compat_getgrnam_r (...) at nss_compat/compat-grp.c:486 Freescale#45 0x00007f8990968b85 in __getgrnam_r [...] Freescale#46 0x00007f89909d6b77 in grantpt [...] Freescale#47 0x00007f8990a9394e in __GI_openpty [...] Freescale#48 0x00000000604a1f65 in openpty_cb (...) at arch/um/os-Linux/sigio.c:407 Freescale#49 0x00000000604a58d0 in start_idle_thread (...) at arch/um/os-Linux/skas/process.c:598 Freescale#50 0x0000000060004a3d in start_uml () at arch/um/kernel/skas/process.c:45 Freescale#51 0x00000000600047b2 in linux_main (...) at arch/um/kernel/um_arch.c:334 Freescale#52 0x000000006000574f in main (...) at arch/um/os-Linux/main.c:144 indicating that the UML function openpty_cb() calls openpty(), which internally calls __getgrnam_r(), which causes the nsswitch machinery to get started. This loads, through lots of indirection that I snipped, the libcom_err.so.2 library, which (in an unknown function, "??") calls sem_init(). Now, of course it wants to get libpthread's sem_init(), since it's linked against libpthread. However, the dynamic linker looks up that symbol against the binary first, and gets the kernel's sem_init(). Hajime Tazaki noted that "objcopy -L" can localize a symbol, so the dynamic linker wouldn't do the lookup this way. I tried, but for some reason that didn't seem to work. Doing the same thing in the linker script instead does seem to work, though I cannot entirely explain - it *also* works if I just add "VERSION { { global: *; }; }" instead, indicating that something else is happening that I don't really understand. It may be that explicitly doing that marks them with some kind of empty version, and that's different from the default. Explicitly marking them with a version breaks kallsyms, so that doesn't seem to be possible. Marking all the symbols as local seems correct, and does seem to address the issue, so do that. Also do it for static link, nsswitch libraries could still be loaded there. [1] https://bugs.debian.org/983379 Reported-by: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Tested-By: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
May 22, 2021
[ Upstream commit d5027ca ] Ritesh reported a bug [1] against UML, noting that it crashed on startup. The backtrace shows the following (heavily redacted): (gdb) bt ... Freescale#26 0x0000000060015b5d in sem_init () at ipc/sem.c:268 Freescale#27 0x00007f89906d92f7 in ?? () from /lib/x86_64-linux-gnu/libcom_err.so.2 Freescale#28 0x00007f8990ab8fb2 in call_init (...) at dl-init.c:72 ... Freescale#40 0x00007f89909bf3a6 in nss_load_library (...) at nsswitch.c:359 ... Freescale#44 0x00007f8990895e35 in _nss_compat_getgrnam_r (...) at nss_compat/compat-grp.c:486 Freescale#45 0x00007f8990968b85 in __getgrnam_r [...] Freescale#46 0x00007f89909d6b77 in grantpt [...] Freescale#47 0x00007f8990a9394e in __GI_openpty [...] Freescale#48 0x00000000604a1f65 in openpty_cb (...) at arch/um/os-Linux/sigio.c:407 Freescale#49 0x00000000604a58d0 in start_idle_thread (...) at arch/um/os-Linux/skas/process.c:598 Freescale#50 0x0000000060004a3d in start_uml () at arch/um/kernel/skas/process.c:45 Freescale#51 0x00000000600047b2 in linux_main (...) at arch/um/kernel/um_arch.c:334 Freescale#52 0x000000006000574f in main (...) at arch/um/os-Linux/main.c:144 indicating that the UML function openpty_cb() calls openpty(), which internally calls __getgrnam_r(), which causes the nsswitch machinery to get started. This loads, through lots of indirection that I snipped, the libcom_err.so.2 library, which (in an unknown function, "??") calls sem_init(). Now, of course it wants to get libpthread's sem_init(), since it's linked against libpthread. However, the dynamic linker looks up that symbol against the binary first, and gets the kernel's sem_init(). Hajime Tazaki noted that "objcopy -L" can localize a symbol, so the dynamic linker wouldn't do the lookup this way. I tried, but for some reason that didn't seem to work. Doing the same thing in the linker script instead does seem to work, though I cannot entirely explain - it *also* works if I just add "VERSION { { global: *; }; }" instead, indicating that something else is happening that I don't really understand. It may be that explicitly doing that marks them with some kind of empty version, and that's different from the default. Explicitly marking them with a version breaks kallsyms, so that doesn't seem to be possible. Marking all the symbols as local seems correct, and does seem to address the issue, so do that. Also do it for static link, nsswitch libraries could still be loaded there. [1] https://bugs.debian.org/983379 Reported-by: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Tested-By: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
LeBlue
pushed a commit
to LeBlue/linux-fslc
that referenced
this pull request
Jan 20, 2022
[ Upstream commit 5bbf219 ] An out of bounds write happens when setting the default power state. KASAN sees this as: [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 Freescale#50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 kasan_report+0x170/0x1a8 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] radeon_pm_init+0x1019/0x1904 [radeon] rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 driver_probe_device+0xe6/0x135 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 do_one_initcall+0x149/0x2fd do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca __do_sys_finit_module+0xf1/0x140 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Without KASAN, this will manifest later when the kernel attempts to allocate memory that was stomped, since it collides with the inline slab freelist pointer: invalid opcode: 0000 [Freescale#1] SMP NOPTI CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G W 5.10.12-gentoo-E620 Freescale#2 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 RIP: 0010:kfree+0x115/0x230 Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7 RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246 RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80 RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000 R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100 FS: 00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0 Call Trace: __free_fdtable+0x16/0x1f put_files_struct+0x81/0x9b do_exit+0x433/0x94d do_group_exit+0xa6/0xa6 __x64_sys_exit_group+0xf/0xf do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe80ef64bea Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0. RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0 R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0 Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek ... Use a valid power_state index when initializing the "flags" and "misc" and "misc2" fields. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211537 Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: a48b9b4 ("drm/radeon/kms/pm: add asic specific callbacks for getting power state (v2)") Fixes: 79daedc ("drm/radeon/kms: minor pm cleanups") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
LeBlue
pushed a commit
to LeBlue/linux-fslc
that referenced
this pull request
Jan 20, 2022
[ Upstream commit d5027ca ] Ritesh reported a bug [1] against UML, noting that it crashed on startup. The backtrace shows the following (heavily redacted): (gdb) bt ... Freescale#26 0x0000000060015b5d in sem_init () at ipc/sem.c:268 Freescale#27 0x00007f89906d92f7 in ?? () from /lib/x86_64-linux-gnu/libcom_err.so.2 Freescale#28 0x00007f8990ab8fb2 in call_init (...) at dl-init.c:72 ... Freescale#40 0x00007f89909bf3a6 in nss_load_library (...) at nsswitch.c:359 ... Freescale#44 0x00007f8990895e35 in _nss_compat_getgrnam_r (...) at nss_compat/compat-grp.c:486 Freescale#45 0x00007f8990968b85 in __getgrnam_r [...] Freescale#46 0x00007f89909d6b77 in grantpt [...] Freescale#47 0x00007f8990a9394e in __GI_openpty [...] Freescale#48 0x00000000604a1f65 in openpty_cb (...) at arch/um/os-Linux/sigio.c:407 Freescale#49 0x00000000604a58d0 in start_idle_thread (...) at arch/um/os-Linux/skas/process.c:598 Freescale#50 0x0000000060004a3d in start_uml () at arch/um/kernel/skas/process.c:45 Freescale#51 0x00000000600047b2 in linux_main (...) at arch/um/kernel/um_arch.c:334 Freescale#52 0x000000006000574f in main (...) at arch/um/os-Linux/main.c:144 indicating that the UML function openpty_cb() calls openpty(), which internally calls __getgrnam_r(), which causes the nsswitch machinery to get started. This loads, through lots of indirection that I snipped, the libcom_err.so.2 library, which (in an unknown function, "??") calls sem_init(). Now, of course it wants to get libpthread's sem_init(), since it's linked against libpthread. However, the dynamic linker looks up that symbol against the binary first, and gets the kernel's sem_init(). Hajime Tazaki noted that "objcopy -L" can localize a symbol, so the dynamic linker wouldn't do the lookup this way. I tried, but for some reason that didn't seem to work. Doing the same thing in the linker script instead does seem to work, though I cannot entirely explain - it *also* works if I just add "VERSION { { global: *; }; }" instead, indicating that something else is happening that I don't really understand. It may be that explicitly doing that marks them with some kind of empty version, and that's different from the default. Explicitly marking them with a version breaks kallsyms, so that doesn't seem to be possible. Marking all the symbols as local seems correct, and does seem to address the issue, so do that. Also do it for static link, nsswitch libraries could still be loaded there. [1] https://bugs.debian.org/983379 Reported-by: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Tested-By: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
Apr 14, 2022
commit 6c8e2a2 upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment Freescale#50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a9 ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
Apr 14, 2022
commit 6c8e2a2 upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment Freescale#50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a9 ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
zandrey
pushed a commit
to zandrey/linux-fslc
that referenced
this pull request
Apr 21, 2022
commit 6c8e2a2 upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment Freescale#50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a9 ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
otavio
pushed a commit
that referenced
this pull request
Oct 4, 2022
…sion-50.- [reg]: Roll back to revision #50.[Yandong] Signed-off-by: Ke Feng <ke.feng@verisilicon.com> Signed-off-by: yuan.tian <yuan.tian@nxp.com>
MrCry0
pushed a commit
to MrCry0/linux-fslc
that referenced
this pull request
Jul 14, 2023
[ Upstream commit ab1de7e ] The commit 4f7e723 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 Freescale#50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Fixes: 05c7b7a ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This merges v4.14.126 plus adds a commit on top which prevents a compiler warning I see when compiling imx_v7_defconfig.
P.S. For 4.14.126 exists a matching RT patch.