Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from torvalds:master #339

Merged
merged 30 commits into from
Jan 26, 2021
Merged

[pull] master from torvalds:master #339

merged 30 commits into from
Jan 26, 2021

Conversation

pull[bot]
Copy link

@pull pull bot commented Jan 26, 2021

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

David Collins and others added 30 commits January 8, 2021 18:22
The final step in regulator_register() is to call
regulator_resolve_supply() for each registered regulator
(including the one in the process of being registered).  The
regulator_resolve_supply() function first checks if rdev->supply
is NULL, then it performs various steps to try to find the supply.
If successful, rdev->supply is set inside of set_supply().

This procedure can encounter a race condition if two concurrent
tasks call regulator_register() near to each other on separate CPUs
and one of the regulators has rdev->supply_name specified.  There
is currently nothing guaranteeing atomicity between the rdev->supply
check and set steps.  Thus, both tasks can observe rdev->supply==NULL
in their regulator_resolve_supply() calls.  This then results in
both creating a struct regulator for the supply.  One ends up
actually stored in rdev->supply and the other is lost (though still
present in the supply's consumer_list).

Here is a kernel log snippet showing the issue:

[   12.421768] gpu_cc_gx_gdsc: supplied by pm8350_s5_level
[   12.425854] gpu_cc_gx_gdsc: supplied by pm8350_s5_level
[   12.429064] debugfs: Directory 'regulator.4-SUPPLY' with parent
               '17a00000.rsc:rpmh-regulator-gfxlvl-pm8350_s5_level'
               already present!

Avoid this race condition by holding the rdev->mutex lock inside
of regulator_resolve_supply() while checking and setting
rdev->supply.

Signed-off-by: David Collins <collinsd@codeaurora.org>
Link: https://lore.kernel.org/r/1610068562-4410-1-git-send-email-collinsd@codeaurora.org
Signed-off-by: Mark Brown <broonie@kernel.org>
It appears that while we are now able to properly hide PMU
registers from the guest when a PMU isn't available (either
because none has been configured, the host doesn't have
the PMU support compiled in, or that the HW doesn't have
one at all), we are still exposing more than we should to
userspace.

Introduce a visibility callback gating all the PMU registers,
which covers both usrespace and guest.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that all PMU registers are gated behind a .visibility callback,
remove the other checks against an absent PMU.

Signed-off-by: Marc Zyngier <maz@kernel.org>
The KVM/arm64 PSCI relay assumes that SYSTEM_OFF and SYSTEM_RESET should
not return, as dictated by the PSCI spec. However, there is firmware out
there which breaks this assumption, leading to a hyp panic. Make KVM
more robust to broken firmware by allowing these to return.

Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201229160059.64135-1-dbrazdil@google.com
The reg_to_encoding() macro is a wrapper over sys_reg() and conveniently
takes a sys_reg_desc or a sys_reg_params argument and returns the 32 bit
register encoding. Use it instead of calling sys_reg() directly.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210106144218.110665-1-alexandru.elisei@arm.com
Release master that have been previously allocated if the number of
chipselect is invalid.

Fixes: 8e04187 ("spi: altera: add SPI core parameters support via platform data.")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210120082635.49304-1-bianpan2016@163.com
Signed-off-by: Mark Brown <broonie@kernel.org>
KASAN in HW_TAGS mode will store MTE tags in the top byte of the
pointer. When computing the offset for TPIDR_EL2 we don't want anything
in the top byte, so remove the tag to ensure the computation is correct
no matter what the tag.

Fixes: 94ab5b6 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Steven Price <steven.price@arm.com>
[maz: added comment]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210108161254.53674-1-steven.price@arm.com
When running on v8.0 HW, make sure we don't try to advertise
events in the 0x4000-0x403f range.

Cc: stable@vger.kernel.org
Fixes: 88865be ("KVM: arm64: Mask out filtered events in PCMEID{0,1}_EL1")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210121105636.1478491-1-maz@kernel.org
Add missing stubs to regulator/consumer.h in order to fix COMPILE_TEST
of the kernel. In particular this should fix compile-testing of OPP core
because of a missing stub for regulator_sync_voltage().

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Link: https://lore.kernel.org/r/20210120205844.12658-1-digetx@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The use of a tagged address could be pretty confusing for the
whole memslot infrastructure as well as the MMU notifiers.

Forbid it altogether, as it never quite worked the first place.

Cc: stable@vger.kernel.org
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
With commit eaa7995 (regulator: core: avoid
regulator_resolve_supply() race condition) we started holding the rdev
lock while resolving supplies, an operation that requires holding the
regulator_list_mutex. This results in lockdep warnings since in other
places we take the list mutex then the mutex on an individual rdev.

Since the goal is to make sure that we don't call set_supply() twice
rather than a concern about the cost of resolution pull the rdev lock
and check for duplicate resolution down to immediately before we do the
set_supply() and drop it again once the allocation is done.

Fixes: eaa7995 (regulator: core: avoid regulator_resolve_supply() race condition)
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20210122132042.10306-1-broonie@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Add compatible string for Cisco device present on the Cisco Petra
platform.

Signed-off-by: Daniel Walker <danielwa@cisco.com>
Cc: xe-linux-external@cisco.com
Link: https://lore.kernel.org/r/20210121231237.30664-2-danielwa@cisco.com
Signed-off-by: Mark Brown <broonie@kernel.org>
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.11, take #2

- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
The documentation classifies KVM_ENABLE_CAP with KVM_CAP_ENABLE_CAP_VM
as a vcpu ioctl, which is incorrect. Fix it by specifying it as a VM
ioctl.

Fixes: e5d83c7 ("kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic")
Signed-off-by: Quentin Perret <qperret@google.com>
Message-Id: <20210108165349.747359-1-qperret@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add compile-time asserts in rsvd_bits() to guard against KVM passing in
garbage hardcoded values, and cap the upper bound at '63' for dynamic
values to prevent generating a mask that would overflow a u64.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113204515.3473079-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…esh()

Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:

shift exponent 197 is too large for 64-bit type 'long long unsigned int'

Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348
 kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177
 kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308
 kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709
 kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
 vfs_ioctl fs/ioctl.c:48 [inline]
 __do_sys_ioctl fs/ioctl.c:753 [inline]
 __se_sys_ioctl fs/ioctl.c:739 [inline]
 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…ch_events[]

The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as
0x0300 in the intel_perfmon_event_map[]. Correct its usage.

Fixes: 62079d8 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The injection process of smi has two steps:

    Qemu                        KVM
Step1:
    cpu->interrupt_request &= \
        ~CPU_INTERRUPT_SMI;
    kvm_vcpu_ioctl(cpu, KVM_SMI)

                                call kvm_vcpu_ioctl_smi() and
                                kvm_make_request(KVM_REQ_SMI, vcpu);

Step2:
    kvm_vcpu_ioctl(cpu, KVM_RUN, 0)

                                call process_smi() if
                                kvm_check_request(KVM_REQ_SMI, vcpu) is
                                true, mark vcpu->arch.smi_pending = true;

The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.

Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Update various words, including the wrong parameter name and the vague
description of the usage of "slot" field.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Message-Id: <20201208043439.895-1-yuzenghui@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:

 CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
 CPU-1979 [002] 89.871218: kvm_exit:  reason MSR_WRITE
 CPU-1979 [002] 89.871259: kvm_exit:  reason MSR_WRITE

It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.

Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).

A trace obtained with this patch applied looks like this:

 CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
 CPU-14295 [000] 8388.395392: kvm_exit:  reason MSR_WRITE
 CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
 CPU-14295 [000] 8388.395503: kvm_exit:  reason EXTERNAL_INTERRUPT

Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.

Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12.  This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.

However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it.  To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.

Fixes: 7952d76 ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the
GRPs' dirty bits are set from time zero and never cleared, i.e. will
always be seen as dirty.  The obvious alternative would be to clear
the dirty bits when appropriate, but removing the dirty checks is
desirable as it allows reverting GPR dirty+available tracking, which
adds overhead to all flavors of x86 VMs.

Note, unconditionally writing the GPRs in the GHCB is tacitly allowed
by the GHCB spec, which allows the hypervisor (or guest) to provide
unnecessary info; it's the guest's responsibility to consume only what
it needs (the hypervisor is untrusted after all).

  The guest and hypervisor can supply additional state if desired but
  must not rely on that additional state being provided.

Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs
to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty.  Per
commit de3cd11 ("KVM: x86: Omit caching logic for always-available
GPRs"), tracking for GPRs noticeably impacts KVM's code footprint.

This reverts commit 1c04d8c.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
… VMX

VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode.  Therefore we cannot
WARN in that case.

However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time.  Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.

Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…out"

This reverts commit d3921cb.

Chris Wilson reports that it causes boot problems:

 "We have half a dozen or so different machines in CI that are silently
  failing to boot, that we believe is bisected to this patch"

and the CI team confirmed that a revert fixed the issues.

The cause is unknown for now, so let's revert it.

Link: https://lore.kernel.org/lkml/161160687463.28991.354987542182281928@build.alporthouse.com/
Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
…inux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "The main thing here is a change to make sure that we don't try to
  double resolve the supply of a regulator if we have two probes going
  on simultaneously, plus an incremental fix on top of that to resolve a
  lockdep issue it introduced.

  There's also a patch from Dmitry Osipenko adding stubs for some
  functions to avoid build issues in consumers in some configurations"

* tag 'regulator-fix-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: Fix lockdep warning resolving supplies
  regulator: consumer: Add missing stubs to regulator/consumer.h
  regulator: core: avoid regulator_resolve_supply() race condition
…ernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "One new device ID here, plus an error handling fix - nothing
  remarkable in either"

* tag 'spi-fix-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spidev: Add cisco device compatible
  spi: altera: Fix memory leak on error path
Pull kvm fixes from Paolo Bonzini:

 -  x86 bugfixes

 - Documentation fixes

 - Avoid performance regression due to SEV-ES patches

 - ARM:
     - Don't allow tagged pointers to point to memslots
     - Filter out ARMv8.1+ PMU events on v8.0 hardware
     - Hide PMU registers from userspace when no PMU is configured
     - More PMU cleanups
     - Don't try to handle broken PSCI firmware
     - More sys_reg() to reg_to_encoding() conversions

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
  KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"
  KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest
  KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
  kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
  KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG
  KVM: x86: get smi pending status correctly
  KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
  KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
  KVM: x86: Add more protection against undefined behavior in rsvd_bits()
  KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
  KVM: Forbid the use of tagged userspace addresses for memslots
  KVM: arm64: Filter out v8.1+ events on v8.0 HW
  KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag
  KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg()
  KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return
  KVM: arm64: Simplify handling of absent PMU system registers
  KVM: arm64: Hide PMU registers from userspace when not available
Remove the magical "repo-abbrev" comment added when this file was
introduced in e0ab1ec ([PATCH] add .mailmap for proper
git-shortlog output, 2007-02-14).

It's been an undocumented feature of git-shortlog(1), originally added
to git for Linus's use. Since then he's no longer using it[1], and
I've removed the feature in git.git's 4e168333a87 (shortlog: remove
unused(?) "repo-abbrev" feature, 2021-01-12). It's on the "master"
branch, but not yet in a release version.

Let's also remove it from linux.git, both as a heads-up to any
potential users of it in linux.git whose use would be broken sooner
than later by git itself, and because it'll eventually be entirely
redundant.

1. https://lore.kernel.org/git/CAHk-=wixHyBKZVUcxq+NCWMbkrX0xnppb7UCopRWw1+oExYpYw@mail.gmail.com/

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
@pull pull bot added the ⤵️ pull label Jan 26, 2021
@pull pull bot merged commit 2ab38c1 into vchong:master Jan 26, 2021
pull bot pushed a commit that referenced this pull request Jul 12, 2024
Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240708133130.11609-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
pull bot pushed a commit that referenced this pull request Jan 4, 2025
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
 #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
 #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x90
 __might_resched.cold+0x1f7/0x23d
 ? __pfx___might_resched+0x10/0x10
 ? vsnprintf+0xdeb/0x18f0
 __mutex_lock+0xf4/0x1220
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_vsnprintf+0x10/0x10
 ? __pfx___mutex_lock+0x10/0x10
 ? snprintf+0xa5/0xe0
 ? xas_load+0x1ce/0x3f0
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
 nvmet_req_find_ns+0x24e/0x300 [nvmet]
 nvmet_req_init+0x694/0xd40 [nvmet]
 ? blk_mq_start_request+0x11c/0x750
 ? nvme_setup_cmd+0x369/0x990 [nvme_core]
 nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
 __blk_mq_issue_directly+0xe2/0x1d0
 ? __pfx___blk_mq_issue_directly+0x10/0x10
 ? blk_mq_request_issue_directly+0xc2/0x140
 blk_mq_plug_issue_direct+0x13f/0x630
 ? lock_acquire+0x2d/0xc0
 ? blk_mq_flush_plug_list+0xa75/0x1950
 blk_mq_flush_plug_list+0xa9d/0x1950
 ? __pfx_blk_mq_flush_plug_list+0x10/0x10
 ? __pfx_mpage_readahead+0x10/0x10
 __blk_flush_plug+0x278/0x4d0
 ? __pfx___blk_flush_plug+0x10/0x10
 ? lock_release+0x460/0x7a0
 blk_finish_plug+0x4e/0x90
 read_pages+0x51b/0xbc0
 ? __pfx_read_pages+0x10/0x10
 ? lock_release+0x460/0x7a0
 page_cache_ra_unbounded+0x326/0x5c0
 force_page_cache_ra+0x1ea/0x2f0
 filemap_get_pages+0x59e/0x17b0
 ? __pfx_filemap_get_pages+0x10/0x10
 ? lock_is_held_type+0xd5/0x130
 ? __pfx___might_resched+0x10/0x10
 ? find_held_lock+0x2d/0x110
 filemap_read+0x317/0xb70
 ? up_write+0x1ba/0x510
 ? __pfx_filemap_read+0x10/0x10
 ? inode_security+0x54/0xf0
 ? selinux_file_permission+0x36d/0x420
 blkdev_read_iter+0x143/0x3b0
 vfs_read+0x6ac/0xa20
 ? __pfx_vfs_read+0x10/0x10
 ? __pfx_vm_mmap_pgoff+0x10/0x10
 ? __pfx___seccomp_filter+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
 </TASK>

Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 5053639
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.

To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.

So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).

Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 5053639 ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants