-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dwc_otg: more robust interrupt handling #330
Conversation
The dwc_otg driver will unmask certain interrupts on a transaction that previously halted in the error state in order to reset the QTD error count. The various fine-grained interrupt handlers do not consider that other interrupts besides themselves were unmasked. By disabling the two other interrupts only ever enabled in DMA mode for this purpose, we can avoid unnecessary function calls in the IRQ handler. This will also prevent an unneccesary FIQ interrupt from being generated if the FIQ is enabled.
In the case of a transaction to a device that had previously aborted due to an error, several interrupts are enabled to reset the error count when a device responds. This has the side-effect of making the FIQ thrash because the hardware will generate multiple instances of a NAK on an IN bulk/interrupt endpoint and multiple instances of ACK on an OUT bulk/interrupt endpoint. Make the FIQ mask and clear the associated interrupts. Additionally, on non-split transactions make sure that only unmasked interrupts are cleared. This caused a hard-to-trigger but serious race condition when you had the combination of an endpoint awaiting error recovery and a transaction completed on an endpoint - due to the sequencing and timing of interrupts generated by the dwc_otg core, it was possible to confuse the IRQ handler.
I can say that this patch definitely squashes the issues that I've seen at least - I can artificially create endpoints that are very lossy (via the world's worst 10-port hub and several pendrives) which completely breaks USB with plenty of the above error message, and will continuously reset/probe/error/reset devices (as it should do for dodgy connections) after the patch. |
I'm testing now, and no problems so far. I'll keep you updated. |
I just got this:
I'd booted from nfs and had xbmc running in the background, but idle. I also get these failures without this PR. |
The pastebin chopped off the registers in that error which is the important bit. Can you still get at the message log? How frequent are these errors now vs before? |
No sign of "handle_hc_chhltd_intr_dma" in /var/log/* I would say it used to occur every couple of hours (with work setup). No bluetooth - this is what I had (and still have) connected:
|
The only alternative could be something similar happening with a split transaction. I did specifically test this before and after with a FS device known to produce 2 data toggle errors on the first transaction after a reset, result was negative for this error before and after which led me to believe that split transactions weren't affected. Data toggle errors halt the channel in split transactions and they are now handled exactly like HS transaction errors, so I used this as an example repeatable test. |
3 positives from forum users: http://www.raspberrypi.org/phpBB3/viewtopic.php?f=28&t=49223&start=25 I have now tested split transactions using the same tactic as I generated multiple errors for USB2.0 devices - I have not seen any FS/LS devices cause the error condition to appear. |
I've not had the handle_hc_chhltd_intr_dma message repeat. I'm happy to accept this PR, and produce updated firmware for wider testing. |
Yep, It's clearly an improvement Gordon |
dwc_otg: more robust interrupt handling
I caught it again:
Running on latest "next" kernel:
|
Hm I suspect that what's happening in this case is the delay between ack/xfercomp and chhltd is becoming large. I suspect that the dwc DMA controller is having a lot of work to do because of an NFS root which delays the halt - provoking the race between the FIQ and the IRQ. Given that the bogus channel halt isn't really a disaster, I should probably add a retry mechanism similar to what I did for split data toggle errors... |
…iable declaration Networking prefers this style, so warn when it's not used. Networking uses: void foo(int bar) { int baz; code... } not void foo(int bar) { int baz; code... } There are a limited number of false positives when using macros to declare variables like: WARNING: networking uses a blank line after declarations #330: FILE: net/ipv4/inet_hashtables.c:330: + int dif = sk->sk_bound_dev_if; + INET_ADDR_COOKIE(acookie, saddr, daddr) Signed-off-by: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
See: raspberrypi/linux#330 kernel: fix for Oops in dwc_otg_dump_spram. See raspberrypi#21 firmware: fix, part 2, for audo sync when pcm decoder combines multiple packets. See http://forum.xbmc.org/showthread.php?tid=140518&pid=1463506#pid1463506
With completion using swait and so rawlocks we don't need this anymore. Further, bisect thinks this patch is responsible for: |BUG: unable to handle kernel NULL pointer dereference at (null) |IP: [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |PGD 0 |Oops: 0000 [raspberrypi#1] PREEMPT SMP DEBUG_PAGEALLOC |Dumping ftrace buffer: | (ftrace buffer empty) |Modules linked in: |CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.1+ raspberrypi#330 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 |task: ffff88013ae64b00 ti: ffff88013ae74000 task.ti: ffff88013ae74000 |RIP: 0010:[<ffffffff81082123>] [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |RSP: 0000:ffff88013ae77eb8 EFLAGS: 00010082 |RAX: 0000000000000001 RBX: ffffffff81c2cf20 RCX: 0000001050fb52fb |RDX: 0000001050fb52fb RSI: 000000105117ca1e RDI: 00000000001c7723 |RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 |R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffff |R13: ffffffff81c2cee0 R14: 0000000000000000 R15: 0000000000000001 |FS: 0000000000000000(0000) GS:ffff88013b200000(0000) knlGS:0000000000000000 |CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b |CR2: 0000000000000000 CR3: 0000000001c09000 CR4: 00000000000006e0 |Stack: | ffffffff810c446d ffff88013ae77f00 ffffffff8107d8dd 000000000000000a | 0000000000000001 0000000000000000 0000000000000000 0000000000000000 | 0000000000000000 ffff88013ae77f10 ffffffff8107d90e ffff88013ae77f20 |Call Trace: | [<ffffffff810c446d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 | [<ffffffff8107d8dd>] ? notifier_call_chain+0x5d/0x80 | [<ffffffff8107d90e>] ? __raw_notifier_call_chain+0xe/0x10 | [<ffffffff810598a3>] ? cpu_notify+0x23/0x40 | [<ffffffff8105a7b8>] ? notify_cpu_starting+0x28/0x30 during hotplug. The rawlocks need to remain however. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
With completion using swait and so rawlocks we don't need this anymore. Further, bisect thinks this patch is responsible for: |BUG: unable to handle kernel NULL pointer dereference at (null) |IP: [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |PGD 0 |Oops: 0000 [raspberrypi#1] PREEMPT SMP DEBUG_PAGEALLOC |Dumping ftrace buffer: | (ftrace buffer empty) |Modules linked in: |CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.1+ raspberrypi#330 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 |task: ffff88013ae64b00 ti: ffff88013ae74000 task.ti: ffff88013ae74000 |RIP: 0010:[<ffffffff81082123>] [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |RSP: 0000:ffff88013ae77eb8 EFLAGS: 00010082 |RAX: 0000000000000001 RBX: ffffffff81c2cf20 RCX: 0000001050fb52fb |RDX: 0000001050fb52fb RSI: 000000105117ca1e RDI: 00000000001c7723 |RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 |R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffff |R13: ffffffff81c2cee0 R14: 0000000000000000 R15: 0000000000000001 |FS: 0000000000000000(0000) GS:ffff88013b200000(0000) knlGS:0000000000000000 |CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b |CR2: 0000000000000000 CR3: 0000000001c09000 CR4: 00000000000006e0 |Stack: | ffffffff810c446d ffff88013ae77f00 ffffffff8107d8dd 000000000000000a | 0000000000000001 0000000000000000 0000000000000000 0000000000000000 | 0000000000000000 ffff88013ae77f10 ffffffff8107d90e ffff88013ae77f20 |Call Trace: | [<ffffffff810c446d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 | [<ffffffff8107d8dd>] ? notifier_call_chain+0x5d/0x80 | [<ffffffff8107d90e>] ? __raw_notifier_call_chain+0xe/0x10 | [<ffffffff810598a3>] ? cpu_notify+0x23/0x40 | [<ffffffff8105a7b8>] ? notify_cpu_starting+0x28/0x30 during hotplug. The rawlocks need to remain however. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
See: raspberrypi/linux#330 kernel: fix for Oops in dwc_otg_dump_spram. See raspberrypi#21 firmware: fix, part 2, for audo sync when pcm decoder combines multiple packets. See http://forum.xbmc.org/showthread.php?tid=140518&pid=1463506#pid1463506
Geminilake can output two pixels per clock, and that affects the maximum scaling factor for its scalers. Take that into account and avoid the following warning: WARNING: CPU: 1 PID: 593 at drivers/gpu/drm/i915/intel_display.c:13223 skl_max_scale.part.129+0x78/0x80 [i915] WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock) Modules linked in: x86_pkg_temp_thermal i915 coretemp kvm_intel kvm i2c_algo_bit drm_kms_helper irqbypass crct10dif_pclmul prime_numbers crc32_pclmul drm ghash_clmulni_intel shpchp tpm_tis tpm_tis_core tpm nfsd authw CPU: 1 PID: 593 Comm: kworker/u8:3 Tainted: G W 4.10.0-rc8ander+ raspberrypi#330 Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0035.B33.1702150552 02/15/2017 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 skl_max_scale.part.129+0x78/0x80 [i915] intel_check_primary_plane+0xa6/0xc0 [i915] intel_plane_atomic_check_with_state+0xd1/0x1a0 [i915] ? drm_printk+0xb5/0xc0 [drm] intel_plane_atomic_check+0x3d/0x80 [i915] drm_atomic_helper_check_planes+0x7c/0x200 [drm_kms_helper] intel_atomic_check+0xa5b/0x11a0 [i915] drm_atomic_check_only+0x353/0x600 [drm] ? drm_atomic_add_affected_connectors+0x10c/0x120 [drm] drm_atomic_commit+0x18/0x50 [drm] restore_fbdev_mode+0x14c/0x2a0 [drm_kms_helper] drm_fb_helper_restore_fbdev_mode_unlocked+0x34/0x80 [drm_kms_helper] drm_fb_helper_set_par+0x2d/0x60 [drm_kms_helper] intel_fbdev_set_par+0x1a/0x70 [i915] fbcon_init+0x582/0x610 visual_init+0xd6/0x130 do_bind_con_driver+0x1da/0x3c0 do_take_over_console+0x116/0x180 do_fbcon_takeover+0x5c/0xb0 fbcon_event_notify+0x772/0x8a0 ? __blocking_notifier_call_chain+0x35/0x70 notifier_call_chain+0x4a/0x70 __blocking_notifier_call_chain+0x4d/0x70 blocking_notifier_call_chain+0x16/0x20 fb_notifier_call_chain+0x1b/0x20 register_framebuffer+0x278/0x360 drm_fb_helper_initial_config+0x253/0x440 [drm_kms_helper] intel_fbdev_initial_config+0x18/0x30 [i915] async_run_entry_fn+0x39/0x170 process_one_work+0x212/0x670 ? process_one_work+0x197/0x670 worker_thread+0x4e/0x490 kthread+0x101/0x140 ? process_one_work+0x670/0x670 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 v2: s/max_pixclk/max_dotclk/ (Ville) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170223071600.14356-3-ander.conselvan.de.oliveira@intel.com
commit 44d1745 upstream. Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 44d1745 upstream. Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 44d1745 upstream. Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This PR fixes issues such as the error message
popping up in the log since
db4fad7.
I would prefer a round of testing by people that have seen this issue before merging for general consumption, I will advertise as necessary in the relevant places.