Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update from torvalds'es repository #2

Closed
wants to merge 21 commits into from

Conversation

aanisov
Copy link

@aanisov aanisov commented Nov 9, 2016

No description provided.

Colin Ian King and others added 21 commits October 24, 2016 06:05
If dev_kcalloc fails to allocate hw_dev->groups then the current
exit path is a direct return, causing a leak of resources such
as hwdev and ida is not removed.  Fix this by exiting via the
free_hwmon exit path that performs the necessary resource cleanup.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
When report count is more than one and report size is not 4 bytes, then we
need some packing into result buffer from the caller of function
sensor_hub_get_feature.
By default the value extracted from a field is 4 bytes from hid core
(using hid_hw_request(hsdev->hdev, report, HID_REQ_GET_REPORT)), even
if report size if less than 4 byte. So when we copy data to user buffer in
sensor_hub_get_feature, we need to only copy report size bytes even
when report count is more than 1. This is
not an issue for most of the sensor hub fields as report count will be 1
where we already copy only report size bytes, but some string fields
like description, it is a problem as the report count will be more than 1.
For example:
    Field(6)
      Physical(Sensor.OtherCustom)
      Application(Sensor.Sensor)
      Usage(11)
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
        Sensor.0306
      Report Size(16)
      Report Count(11)

Here since the report size is 2 bytes, we will have 2 additional bytes of
0s copied into user buffer, if we directly copy to user buffer from
report->field[]->value

This change will copy report size bytes into the buffer of caller for each
usage report->field[]->value. So for example without this change, the
data displayed for a custom sensor field "sensor-model":

76 00 101 00 110 00 111 00 118 00 111
(truncated to report count of 11)

With change

76 101 110 111 118 111 32 89 111 103 97
("Lenovo Yoga" in ASCII )

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Fix

  drivers/hid/intel-ish-hid/ipc/pci-ish.c:247:12: warning: ‘ish_suspend’ defined but not used [-Wunused-function]
   static int ish_suspend(struct device *device)
              ^
  drivers/hid/intel-ish-hid/ipc/pci-ish.c:282:12: warning: ‘ish_resume’ defined but not used [-Wunused-function]
   static int ish_resume(struct device *device)
            ^
by sticking them in the CONFIG_PM range too.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: linux-input@vger.kernel.org
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Like many similar devices it needs a quirk to work.
Issuing the request gets the device into an irrecoverable state.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
CC: stable@vger.kernel.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Same operations are done in ish_hw_start() and _ish_hw_reset() to
wakeup ISH device. Consolidate them by introducing a new function
ish_wakeup() and move the code there.

Signed-off-by: Even Xu <even.xu@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Add a new function ish_disable_dma() and move DMA disable operations
here, so that this functionality can be reused.

Signed-off-by: Even Xu <even.xu@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When built as a module, modprobe followed by rmmod can fail because
DMA was still active. So to fix this, DMA needs to be disabled during
module exit.

This change disables DMA during modules exit and change the ISH PCI
device status to D3.

Signed-off-by: Even Xu <even.xu@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
On some platforms ISH interrupt is shared, which causes request_irq to
fail. This requires IRQF_SHARED irq flag.

But IRQF_NO_SUSPEND and IRQF_SHARED should not be used together, so
removed IRQF_NO_SUSPEND flag. Anyway this driver doesn't require
IRQF_NO_SUSPEND, as this interrupt is not required during "noirq" phases
of suspending and resuming devices as well as during the time when
nonboot CPUs are taken offline and brought back online.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
User is unable to access to input-X-yyy and feature-X-yyy where
X is a hex value and more than 9 (e.g. input-a-yyy, feature-b-yyy) in HID
sensor custom sysfs interface.
This is because when creating the attribute, the attribute index is
written to using %x (hex). However, when reading and writing values into
the attribute, the attribute index is scanned using %d (decimal). Hence,
user is unable to access to attributes with index in hex values
(e.g. 'a', 'b', 'c') but able to access to attributes with index in
decimal values (e.g. 1, 2, 3,..).
This fix will change input-%d-%x-%s and feature-%d-%x-%s to input-%x-%x-%s
and feature-%x-%x-%s in show_values() and store_values() accordingly.

Signed-off-by: Ooi, Joyce <joyce.ooi@intel.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Commit efd9e03 ("arm64: Use static keys for CPU features")
introduced support for static keys in asm/cpufeature.h, including
linux/jump_label.h. When CC_HAVE_ASM_GOTO is not defined, this causes a
circular dependency via linux/atomic.h, asm/lse.h and asm/cpufeature.h.

This patch moves the capability macros out out of asm/cpufeature.h into
a separate asm/cpucaps.h and modifies some of the #includes accordingly.

Fixes: efd9e03 ("arm64: Use static keys for CPU features")
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
openrisc qemu tests fail with the following crash.

Unable to handle kernel access at virtual address 0xc0300c34

Oops#: 0001
CPU #: 0
   PC: c016c710    SR: 0000ae67    SP: c1017e04
   GPR00: 00000000 GPR01: c1017e04 GPR02: c0300c34 GPR03: c0300c34
   GPR04: 00000000 GPR05: c0300cb0 GPR06: c0300c34 GPR07: 000000ff
   GPR08: c107f074 GPR09: c0199ef4 GPR10: c1016000 GPR11: 00000000
   GPR12: 00000000 GPR13: c107f044 GPR14: c0473774 GPR15: 07ce0000
   GPR16: 00000000 GPR17: c107ed8a GPR18: 00009600 GPR19: c107f044
   GPR20: c107ee74 GPR21: 00000003 GPR22: c0473770 GPR23: 00000033
   GPR24: 000000bf GPR25: 00000019 GPR26: c046400c GPR27: 00000001
   GPR28: c0464028 GPR29: c1018000 GPR30: 00000006 GPR31: ccf37483
     RES: 00000000 oGPR11: ffffffff
     Process swapper (pid: 1, stackpage=c1001960)

     Stack: Stack dump [0xc1017cf8]:
     sp + 00: 0xc1017e04
     sp + 04: 0xc0300c34
     sp + 08: 0xc0300c34
     sp + 12: 0x00000000
...

Bisect points to commit d2ec3f7 ("pty: make ptmx file ops read-only
after init"). Fix by defining __ro_after_init for the openrisc
architecture, similar to parisc.

Fixes: d2ec3f7 ("pty: make ptmx file ops read-only after init")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Stafford Horne <shorne@gmail.com>
…/git/jikos/hid

Pull HID fixes from Jiri Kosina:

 - modprobe-after-rmmod load failure bugfix for intel-ish, from Even Xu

 - IRQ probing bugfix for intel-ish, from Srinivas Pandruvada

 - attribute parsing fix in hid-sensor, from Ooi, Joyce

 - other small misc fixes / quirky device additions

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
  HID: sensor: fix attributes in HID sensor interface
  HID: intel-ish-hid: request_irq failure
  HID: intel-ish-hid: Fix driver reinit failure
  HID: intel-ish-hid: Move DMA disable code to new function
  HID: intel-ish-hid: consolidate ish wake up operation
  HID: usbhid: add ATEN CS962 to list of quirky devices
  HID: intel-ish-hid: Fix !CONFIG_PM build warning
  HID: sensor-hub: Fix packing of result buffer for feature report
…linux/kernel/git/groeck/linux-staging

Pull hwmon fix from Guenter Roeck:
 "Fix resource leak on devm_kcalloc failure"

* tag 'hwmon-for-linus-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (core) fix resource leak on devm_kcalloc failure
…cm/linux/kernel/git/groeck/linux-staging

Pull openrisc fix from Guenter Roeck:
 "Fix openrisc crash caused by ro_init changes"

* tag 'openrisc-for-linus-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  openrisc: Define __ro_after_init to avoid crash
…git/arm64/linux

Pull arm64 fix from Will Deacon:
 "It's been pretty quiet on the fixes side of things for us, but Artem
  reported a build failure introduced during the merge window that
  appears with older GCCs that do not support asm goto. The fix is
  bigger than I'd like, but it's a mechnical move of some constants to
  break an include dependency between atomic.h and jump_label.h when
  !HAVE_JUMP_LABEL.

  Summary:

   - Fix build failure on compilers without asm goto"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Fix circular include of asm/lse.h through linux/jump_label.h
The 32-bit ARM DMA configuration code predates the IOMMU core's default
domain functionality, and instead relies on allocating its own domains
and attaching any devices using the generic IOMMU binding to them.
Unfortunately, it does this relatively early on in the creation of the
device, before we've seen our add_device callback, which leads us to
attempt to operate on a half-configured master.

To avoid a crash, check for this situation on attach, but refuse to
play, as there's nothing we can do. This at least allows VFIO to keep
working for people who update their 32-bit DTs to the generic binding,
albeit with a few (innocuous) warnings from the DMA layer on boot.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We now delay installing our per-bus iommu_ops until we know an SMMU has
successfully probed, as they don't serve much purpose beforehand, and
doing so also avoids fights between multiple IOMMU drivers in a single
kernel. However, the upshot of passing the return value of bus_set_iommu()
back from our probe function is that if there happens to be more than
one SMMUv3 device in a system, the second and subsequent probes will
wind up returning -EBUSY to the driver core and getting torn down again.

Avoid re-setting ops if ours are already installed, so that any genuine
failures stand out.

Fixes: 08d4ca2 ("iommu/arm-smmu: Support non-PCI devices with SMMUv3")
CC: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
CC: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
We seem to have forgotten to check that iommu_fwspecs actually belong to
us before we go ahead and dereference their private data. Oops.

Fixes: 021bb84 ("iommu/arm-smmu: Wire up generic configuration support")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When we iterate a master's config entries, what we generally care
about is the entry's stream map index, rather than the entry index
itself, so it's nice to have the iterator automatically assign the
former from the latter. Unfortunately, booting with KASAN reveals
the oversight that using a simple comma operator results in the
entry index being dereferenced before being checked for validity,
so we always access one element past the end of the fwspec array.

Flip things around so that the check always happens before the index
may be dereferenced.

Fixes: adfec2e ("iommu/arm-smmu: Convert to iommu_fwspec")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
It turns out that the disable_dmar_iommu() code-path tried
to get the device_domain_lock recursivly, which will
dead-lock when this code runs on dmar removal. Fix both
code-paths that could lead to the dead-lock.

Fixes: 55d9404 ('iommu/vt-d: Get rid of domain->iommu_lock')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
…x/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:

 - Four patches from Robin Murphy fix several issues with the recently
   merged generic DT-bindings support for arm-smmu drivers

 - A fix for a dead-lock issue in the VT-d driver, which shows up on
   iommu hotplug

* tag 'iommu-fixes-v4.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix dead-locks in disable_dmar_iommu() path
  iommu/arm-smmu: Fix out-of-bounds dereference
  iommu/arm-smmu: Check that iommu_fwspecs are ours
  iommu/arm-smmu: Don't inadvertently reject multiple SMMUv3s
  iommu/arm-smmu: Work around ARM DMA configuration
@aanisov aanisov closed this Nov 9, 2016
@aanisov aanisov reopened this Nov 9, 2016
@aanisov aanisov closed this Nov 9, 2016
aanisov pushed a commit that referenced this pull request Nov 17, 2016
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ torvalds#336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
) at:
[<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
 #0: [   36.220068]  (
sock_diag_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
 #1: [   36.220068]  (
sock_diag_table_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
 #2: [   36.220068]  (
nlk->cb_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ torvalds#336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
 dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
 [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
 [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad20207 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aanisov pushed a commit that referenced this pull request Nov 17, 2016
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192).  ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up.

The total credit should include 3 times of "num_dx_leaves" from
ocfs2_dx_dir_rebalance(), because 2 times will be consumed in
ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in
ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() ->
ocfs2_dx_dir_format_cluster().  But only two times is included in
ocfs2_dx_dir_rebalance_credits(), fix it.

This can cause read-only fs(v4.1+) or panic for mainline linux depending
on mount option.

  ------------[ cut here ]------------
  kernel BUG at fs/ocfs2/journal.c:775!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
  RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
  RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
  RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
  RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
  R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
  R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
  FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
  Call Trace:
    ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
    ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
    ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
    ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
    ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
    ocfs2_mknod+0x5a2/0x1400 [ocfs2]
    ocfs2_create+0x73/0x180 [ocfs2]
    vfs_create+0xd8/0x100
    lookup_open+0x185/0x1c0
    do_last+0x36d/0x780
    path_openat+0x92/0x470
    do_filp_open+0x4a/0xa0
    do_sys_open+0x11a/0x230
    SyS_open+0x1e/0x20
    system_call_fastpath+0x12/0x71
  Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
  RIP  ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  ---[ end trace 91ac5312a6ee1288 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aanisov pushed a commit that referenced this pull request Dec 5, 2016
Missing initialization of udp_tunnel_sock_cfg causes to following
kernel panic, while kernel tries to execute gro_receive().

While being there, we converted udp_port_cfg to use the same
initialization scheme as udp_tunnel_sock_cfg.

------------[ cut here ]------------
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffffffffa0588c50
IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163
Oops: 0011 [#1] SMP
Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000
RIP: 0010:[<ffffffffa0588c50>]
[<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP: 0018:ffff880237343c80  EFLAGS: 00010282
RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712
RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00
RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262
R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578
FS:  0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0
Stack:
ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014
000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb
0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000
Call Trace:
<IRQ>
[<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130
[<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0
[<ffffffff81611373>] inet_gro_receive+0x1d3/0x270
[<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0
[<ffffffff81595188>] napi_gro_receive+0x38/0x120
[<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core]
[<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core]
[<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core]
[<ffffffff815949a0>] net_rx_action+0x160/0x380
[<ffffffff816a9197>] __do_softirq+0xd7/0x2c5
[<ffffffff81085c35>] irq_exit+0xf5/0x100
[<ffffffff816a8f16>] do_IRQ+0x56/0xd0
[<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c
<EOI>
[<ffffffff81061f96>] ? native_safe_halt+0x6/0x10
[<ffffffff81037ade>] default_idle+0x1e/0xd0
[<ffffffff8103828f>] arch_cpu_idle+0xf/0x20
[<ffffffff810c37dc>] default_idle_call+0x3c/0x50
[<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0
[<ffffffff81050d8c>] start_secondary+0x15c/0x1a0
RIP  [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP <ffff880237343c80>
CR2: ffffffffa0588c50
---[ end trace 489ee31fa7614ac5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

Fixes: 8700e3e ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
aanisov pushed a commit that referenced this pull request Dec 5, 2016
Tushar Dave says:

====================
sparc: Enable sun4v hypervisor PCI IOMMU v2 APIs and ATU

ATU (Address Translation Unit) is a new IOMMU in SPARC supported with
sun4v hypervisor PCI IOMMU v2 APIs.

Current SPARC IOMMU supports only 32bit address ranges and one TSB
per PCIe root complex that has a 2GB per root complex DVMA space
limit. The limit has become a scalability bottleneck nowadays that
a typical 10G/40G NIC can consume 500MB DVMA space per instance.
When DVMA resource is exhausted, devices will not be usable
since the driver can't allocate DVMA.

For example, we recently experienced legacy IOMMU limitation while
using i40e driver in system with large number of CPUs (e.g. 128).
Four ports of i40e, each request 128 QP (Queue Pairs). Each queue has
512 (default) descriptors. So considering only RX queues (because RX
premap DMA buffers), i40e takes 4*128*512 number of DMA entries in
IOMMU table. Legacy IOMMU can have at max (2G/8K)- 1 entries available
in table. So bringing up four instance of i40e alone saturate existing
IOMMU resource.

ATU removes bottleneck by allowing guest os to create IOTSB of size
32G (or more) with 64bit address ranges available in ATU HW. 32G is
more than enough DVMA space to be shared by all PCIe devices under
root complex contrast to 2G space provided by legacy IOMMU.

ATU allows PCIe devices to use 64bit DMA addressing. Devices
which choose to use 32bit DMA mask will continue to work with the
existing legacy IOMMU.

The patch set is tested on sun4v (T1000, T2000, T3, T4, T5, T7, S7)
and sun4u SPARC.

Thanks.
-Tushar

v2->v3:
- Patch #5 addresses comment by Joe Perches.
 -- use %s, __func__ instead of embedding the function name.

v1->v2:
- Patch #2 addresses comments by Dave M.
 -- use page allocator to allocate IOTSB.
 -- use true/false with boolean variables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
aanisov pushed a commit that referenced this pull request Jan 5, 2017
When SCSI EH invokes zFCP's callbacks for eh_device_reset_handler() and
eh_target_reset_handler(), it expects us to relent the ownership over
the given scsi_cmnd and all other scsi_cmnds within the same scope - LUN
or target - when returning with SUCCESS from the callback ('release'
them).  SCSI EH can then reuse those commands.

We did not follow this rule to release commands upon SUCCESS; and if
later a reply arrived for one of those supposed to be released commands,
we would still make use of the scsi_cmnd in our ingress tasklet. This
will at least result in undefined behavior or a kernel panic because of
a wrong kernel pointer dereference.

To fix this, we NULLify all pointers to scsi_cmnds (struct zfcp_fsf_req
*)->data in the matching scope if a TMF was successful. This is done
under the locks (struct zfcp_adapter *)->abort_lock and (struct
zfcp_reqlist *)->lock to prevent the requests from being removed from
the request-hashtable, and the ingress tasklet from making use of the
scsi_cmnd-pointer in zfcp_fsf_fcp_cmnd_handler().

For cases where a reply arrives during SCSI EH, but before we get a
chance to NULLify the pointer - but before we return from the callback
-, we assume that the code is protected from races via the CAS operation
in blk_complete_request() that is called in scsi_done().

The following stacktrace shows an example for a crash resulting from the
previous behavior:

Unable to handle kernel pointer dereference at virtual kernel address fffffee17a672000
Oops: 0038 [#1] SMP
CPU: 2 PID: 0 Comm: swapper/2 Not tainted
task: 00000003f7ff5be0 ti: 00000003f3d38000 task.ti: 00000003f3d38000
Krnl PSW : 0404d00180000000 00000000001156b0 (smp_vcpu_scheduled+0x18/0x40)
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA:3
Krnl GPRS: 000000200000007e 0000000000000000 fffffee17a671fd8 0000000300000015
           ffffffff80000000 00000000005dfde8 07000003f7f80e00 000000004fa4e800
           000000036ce8d8f8 000000036ce8d9c0 00000003ece8fe00 ffffffff969c9e93
           00000003fffffffd 000000036ce8da10 00000000003bf134 00000003f3b07918
Krnl Code: 00000000001156a2: a7190000        lghi    %r1,0
           00000000001156a6: a7380015        lhi    %r3,21
          #00000000001156aa: e32050000008    ag    %r2,0(%r5)
          >00000000001156b0: 482022b0        lh    %r2,688(%r2)
           00000000001156b4: ae123000        sigp    %r1,%r2,0(%r3)
           00000000001156b8: b2220020        ipm    %r2
           00000000001156bc: 8820001c        srl    %r2,28
           00000000001156c0: c02700000001    xilf    %r2,1
Call Trace:
([<0000000000000000>] 0x0)
 [<000003ff807bdb8e>] zfcp_fsf_fcp_cmnd_handler+0x3de/0x490 [zfcp]
 [<000003ff807be30a>] zfcp_fsf_req_complete+0x252/0x800 [zfcp]
 [<000003ff807c0a48>] zfcp_fsf_reqid_check+0xe8/0x190 [zfcp]
 [<000003ff807c194e>] zfcp_qdio_int_resp+0x66/0x188 [zfcp]
 [<000003ff80440c64>] qdio_kick_handler+0xdc/0x310 [qdio]
 [<000003ff804463d0>] __tiqdio_inbound_processing+0xf8/0xcd8 [qdio]
 [<0000000000141fd4>] tasklet_action+0x9c/0x170
 [<0000000000141550>] __do_softirq+0xe8/0x258
 [<000000000010ce0a>] do_softirq+0xba/0xc0
 [<000000000014187c>] irq_exit+0xc4/0xe8
 [<000000000046b526>] do_IRQ+0x146/0x1d8
 [<00000000005d6a3c>] io_return+0x0/0x8
 [<00000000005d6422>] vtime_stop_cpu+0x4a/0xa0
([<0000000000000000>] 0x0)
 [<0000000000103d8a>] arch_cpu_idle+0xa2/0xb0
 [<0000000000197f94>] cpu_startup_entry+0x13c/0x1f8
 [<0000000000114782>] smp_start_secondary+0xda/0xe8
 [<00000000005d6efe>] restart_int_handler+0x56/0x6c
 [<0000000000000000>] 0x0
Last Breaking-Event-Address:
 [<00000000003bf12e>] arch_spin_lock_wait+0x56/0xb0

Suggested-by: Steffen Maier <maier@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Fixes: ea127f9 ("[PATCH] s390 (7/7): zfcp host adapter.") (tglx/history.git)
Cc: <stable@vger.kernel.org> #2.6.32+
Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
aanisov pushed a commit that referenced this pull request Jan 5, 2017
…t level

Since quite a while, Linux issues enough SCSI commands per scsi_device
which successfully return with FCP_RESID_UNDER, FSF_FCP_RSP_AVAILABLE,
and SAM_STAT_GOOD.  This floods the HBA trace area and we cannot see
other and important HBA trace records long enough.

Therefore, do not trace HBA response errors for pure benign residual
under counts at the default trace level.

This excludes benign residual under count combined with other validity
bits set in FCP_RSP_IU, such as FCP_SNS_LEN_VAL.  For all those other
cases, we still do want to see both the HBA record and the corresponding
SCSI record by default.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: a54ca0f ("[SCSI] zfcp: Redesign of the debug tracing for HBA records.")
Cc: <stable@vger.kernel.org> #2.6.37+
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
aanisov pushed a commit that referenced this pull request Jan 5, 2017
It is unavoidable that zfcp_scsi_queuecommand() has to finish requests
with DID_IMM_RETRY (like fc_remote_port_chkready()) during the time
window when zfcp detected an unavailable rport but
fc_remote_port_delete(), which is asynchronous via
zfcp_scsi_schedule_rport_block(), has not yet blocked the rport.

However, for the case when the rport becomes available again, we should
prevent unblocking the rport too early.  In contrast to other FCP LLDDs,
zfcp has to open each LUN with the FCP channel hardware before it can
send I/O to a LUN.  So if a port already has LUNs attached and we
unblock the rport just after port recovery, recoveries of LUNs behind
this port can still be pending which in turn force
zfcp_scsi_queuecommand() to unnecessarily finish requests with
DID_IMM_RETRY.

This also opens a time window with unblocked rport (until the followup
LUN reopen recovery has finished).  If a scsi_cmnd timeout occurs during
this time window fc_timed_out() cannot work as desired and such command
would indeed time out and trigger scsi_eh. This prevents a clean and
timely path failover.  This should not happen if the path issue can be
recovered on FC transport layer such as path issues involving RSCNs.

Fix this by only calling zfcp_scsi_schedule_rport_register(), to
asynchronously trigger fc_remote_port_add(), after all LUN recoveries as
children of the rport have finished and no new recoveries of equal or
higher order were triggered meanwhile.  Finished intentionally includes
any recovery result no matter if successful or failed (still unblock
rport so other successful LUNs work).  For simplicity, we check after
each finished LUN recovery if there is another LUN recovery pending on
the same port and then do nothing.  We handle the special case of a
successful recovery of a port without LUN children the same way without
changing this case's semantics.

For debugging we introduce 2 new trace records written if the rport
unblock attempt was aborted due to still unfinished or freshly triggered
recovery. The records are only written above the default trace level.

Benjamin noticed the important special case of new recovery that can be
triggered between having given up the erp_lock and before calling
zfcp_erp_action_cleanup() within zfcp_erp_strategy().  We must avoid the
following sequence:

ERP thread                 rport_work      other context
-------------------------  --------------  --------------------------------
port is unblocked, rport still blocked,
 due to pending/running ERP action,
 so ((port->status & ...UNBLOCK) != 0)
 and (port->rport == NULL)
unlock ERP
zfcp_erp_action_cleanup()
case ZFCP_ERP_ACTION_REOPEN_LUN:
zfcp_erp_try_rport_unblock()
((status & ...UNBLOCK) != 0) [OLD!]
                                           zfcp_erp_port_reopen()
                                           lock ERP
                                           zfcp_erp_port_block()
                                           port->status clear ...UNBLOCK
                                           unlock ERP
                                           zfcp_scsi_schedule_rport_block()
                                           port->rport_task = RPORT_DEL
                                           queue_work(rport_work)
                           zfcp_scsi_rport_work()
                           (port->rport_task != RPORT_ADD)
                           port->rport_task = RPORT_NONE
                           zfcp_scsi_rport_block()
                           if (!port->rport) return
zfcp_scsi_schedule_rport_register()
port->rport_task = RPORT_ADD
queue_work(rport_work)
                           zfcp_scsi_rport_work()
                           (port->rport_task == RPORT_ADD)
                           port->rport_task = RPORT_NONE
                           zfcp_scsi_rport_register()
                           (port->rport == NULL)
                           rport = fc_remote_port_add()
                           port->rport = rport;

Now the rport was erroneously unblocked while the zfcp_port is blocked.
This is another situation we want to avoid due to scsi_eh
potential. This state would at least remain until the new recovery from
the other context finished successfully, or potentially forever if it
failed.  In order to close this race, we take the erp_lock inside
zfcp_erp_try_rport_unblock() when checking the status of zfcp_port or
LUN.  With that, the possible corresponding rport state sequences would
be: (unblock[ERP thread],block[other context]) if the ERP thread gets
erp_lock first and still sees ((port->status & ...UNBLOCK) != 0),
(block[other context],NOP[ERP thread]) if the ERP thread gets erp_lock
after the other context has already cleard ...UNBLOCK from port->status.

Since checking fields of struct erp_action is unsafe because they could
have been overwritten (re-used for new recovery) meanwhile, we only
check status of zfcp_port and LUN since these are only changed under
erp_lock elsewhere. Regarding the check of the proper status flags (port
or port_forced are similar to the shown adapter recovery):

[zfcp_erp_adapter_shutdown()]
zfcp_erp_adapter_reopen()
 zfcp_erp_adapter_block()
  * clear UNBLOCK ---------------------------------------+
 zfcp_scsi_schedule_rports_block()                       |
 write_lock_irqsave(&adapter->erp_lock, flags);-------+  |
 zfcp_erp_action_enqueue()                            |  |
  zfcp_erp_setup_act()                                |  |
   * set ERP_INUSE -----------------------------------|--|--+
 write_unlock_irqrestore(&adapter->erp_lock, flags);--+  |  |
.context-switch.                                         |  |
zfcp_erp_thread()                                        |  |
 zfcp_erp_strategy()                                     |  |
  write_lock_irqsave(&adapter->erp_lock, flags);------+  |  |
  ...                                                 |  |  |
  zfcp_erp_strategy_check_target()                    |  |  |
   zfcp_erp_strategy_check_adapter()                  |  |  |
    zfcp_erp_adapter_unblock()                        |  |  |
     * set UNBLOCK -----------------------------------|--+  |
  zfcp_erp_action_dequeue()                           |     |
   * clear ERP_INUSE ---------------------------------|-----+
  ...                                                 |
  write_unlock_irqrestore(&adapter->erp_lock, flags);-+

Hence, we should check for both UNBLOCK and ERP_INUSE because they are
interleaved.  Also we need to explicitly check ERP_FAILED for the link
down case which currently does not clear the UNBLOCK flag in
zfcp_fsf_link_down_info_eval().

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 8830271 ("[SCSI] zfcp: Dont fail SCSI commands when transitioning to blocked fc_rport")
Fixes: a2fa0ae ("[SCSI] zfcp: Block FC transport rports early on errors")
Fixes: 5f852be ("[SCSI] zfcp: Fix deadlock between zfcp ERP and SCSI")
Fixes: 338151e ("[SCSI] zfcp: make use of fc_remote_port_delete when target port is unavailable")
Fixes: 3859f6a ("[PATCH] zfcp: add rports to enable scsi_add_device to work again")
Cc: <stable@vger.kernel.org> #2.6.32+
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
aanisov pushed a commit that referenced this pull request Jan 16, 2017
…s enabled

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
	kworker/u4:5 cpuset=/ mems_allowed=0
	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
	[...]
	Mem-Info:
	active_anon:58685 inactive_anon:90 isolated_anon:0
	 active_file:274324 inactive_file:281962 isolated_file:0
	 unevictable:0 dirty:649 writeback:0 unstable:0
	 slab_reclaimable:40662 slab_unreclaimable:17754
	 mapped:7382 shmem:202 pagetables:351 bounce:0
	 free:206736 free_pcp:332 free_cma:0
	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
	lowmem_reserve[]: 0 813 3474 3474
	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
	lowmem_reserve[]: 0 0 21292 21292
	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request.  Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.

The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense.  We can simply end up
always seeing the resulting active and inactive counts 0 and return
false.  This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled.  Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca70723 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31 ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Nils Holland <nholland@tisys.org>
Tested-by: Nils Holland <nholland@tisys.org>
Reported-by: Klaus Ethgen <Klaus@Ethgen.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aanisov pushed a commit that referenced this pull request Jan 30, 2017
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.

Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:

1) Reserve a hotplug state for LTTNG

2) Add a dynamic range for the prepare states.

While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.

Add a dynamic range which allows LTTNG to request states in the prepare
stage.

Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
aanisov pushed a commit that referenced this pull request Feb 6, 2017
Syzkaller fuzzer managed to trigger this:

    BUG: sleeping function called from invalid context at mm/shmem.c:852
    in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged
    3 locks held by khugepaged/529:
     #0:  (shrinker_rwsem){++++..}, at: [<ffffffff818d7ef1>] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451
     #1:  (&type->s_umount_key#29){++++..}, at: [<ffffffff81a63630>] trylock_super+0x20/0x100 fs/super.c:392
     #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] spin_lock include/linux/spinlock.h:302 [inline]
     #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427
    CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ torvalds#201
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
       shmem_undo_range+0xb20/0x2710 mm/shmem.c:852
       shmem_truncate_range+0x27/0xa0 mm/shmem.c:939
       shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030
       evict+0x46e/0x980 fs/inode.c:553
       iput_final fs/inode.c:1515 [inline]
       iput+0x589/0xb20 fs/inode.c:1542
       shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446
       shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512
       super_cache_scan+0x376/0x450 fs/super.c:106
       do_shrink_slab mm/vmscan.c:378 [inline]
       shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481
       shrink_slab mm/vmscan.c:2592 [inline]
       shrink_node+0x2c7/0x870 mm/vmscan.c:2592
       shrink_zones mm/vmscan.c:2734 [inline]
       do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776
       try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982
       __perform_reclaim mm/page_alloc.c:3301 [inline]
       __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline]
       __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683
       __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848
       __alloc_pages include/linux/gfp.h:426 [inline]
       __alloc_pages_node include/linux/gfp.h:439 [inline]
       khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750
       collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955
       khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208
       khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline]
       khugepaged_do_scan mm/khugepaged.c:1808 [inline]
       khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853
       kthread+0x326/0x3f0 kernel/kthread.c:227
       ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

The iput() from atomic context was a bad idea: if after igrab() somebody
else calls iput() and we left with the last inode reference, our iput()
would lead to inode eviction and therefore sleeping.

This patch should fix the situation.

Link: http://lkml.kernel.org/r/20170131093141.GA15899@node.shutemov.name
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iartemenko pushed a commit that referenced this pull request Feb 20, 2018
rcar_dmac_sleep_suspend uses the spin_lock/spin_unlock functions to acquire
lock by commit 9d867d5 ("dmaengine: rcar-dmac: Support S2RAM"),
the same lock is also acquired in rcar_dmac_isr_channel of the interrupt
handler.

But, rcar_dmac_sleep_suspend is called with the interrupt enabled from
the suspend callback of the power manager interface, If an interrupt occurs
while suspend is acquiring a lock, it may cause a deadlock.

=================================
[ INFO: inconsistent lock state ]
4.9.0-yocto-standard #1 Not tainted
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
sh/2967 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&rchan->lock)->rlock){?.....}, at: [<ffff0000084cf1c0>] rcar_dmac_sleep_suspend+0x50/0x120
state was registered at:

[<ffff000008109014>] mark_lock+0x1c4/0x6c8

[<ffff00000810a638>] __lock_acquire+0xba0/0x1728

[<ffff00000810b51c>] lock_acquire+0x4c/0x68

[<ffff0000089e67e8>] _raw_spin_lock+0x40/0x58

[<ffff0000084cf530>] rcar_dmac_isr_channel+0x20/0x1e8

[<ffff000008115e74>] __handle_irq_event_percpu+0x9c/0x128

[<ffff000008115f1c>] handle_irq_event_percpu+0x1c/0x58

[<ffff000008115fa0>] handle_irq_event+0x48/0x78

[<ffff0000081198d8>] handle_fasteoi_irq+0xb8/0x1b0

[<ffff000008114f14>] generic_handle_irq+0x24/0x38

[<ffff0000081155dc>] __handle_domain_irq+0x5c/0xb8

[<ffff000008081588>] gic_handle_irq+0x58/0xb0

[<ffff0000080827b4>] el1_irq+0xb4/0x12c

[<ffff00000882b5f0>] cpuidle_enter_state+0x158/0x228

[<ffff00000882b6f8>] cpuidle_enter+0x18/0x20

[<ffff000008102948>] call_cpuidle+0x18/0x38

[<ffff000008102b84>] cpu_startup_entry+0x13c/0x1e0

[<ffff0000089dfea0>] rest_init+0x148/0x158

[<ffff000008e50b54>] start_kernel+0x38c/0x3a0

[<ffff000008e501d8>] __primary_switched+0x5c/0x64
irq event stamp: 41905
hardirqs last  enabled at (41905): [<ffff0000089e2fac>] mutex_lock_nested+0x2ec/0x390
hardirqs last disabled at (41904): [<ffff0000089e2d38>] mutex_lock_nested+0x78/0x390
softirqs last  enabled at (41496): [<ffff0000080c41c0>] __do_softirq+0x218/0x288
softirqs last disabled at (41489): [<ffff0000080c4594>] irq_exit+0xbc/0xf0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&rchan->lock)->rlock);
  <Interrupt>
    lock(&(&rchan->lock)->rlock);

 *** DEADLOCK ***

5 locks held by sh/2967:
 #0:  (sb_writers#4){.+.+.+}, at: [<ffff000008204400>] vfs_write+0x168/0x1b8
 #1:  (&of->mutex){+.+.+.}, at: [<ffff000008287600>] kernfs_fop_write+0x88/0x1e8
 #2:  (s_active#88){.+.+.+}, at: [<ffff000008287608>] kernfs_fop_write+0x90/0x1e8
 #3:  (pm_mutex){+.+.+.}, at: [<ffff000008110e34>] pm_suspend+0x54/0x268
 #4:  (&dev->mutex){......}, at: [<ffff0000085d8e84>] __device_suspend+0xcc/0x298

stack backtrace:
CPU: 3 PID: 2967 Comm: sh Not tainted 4.9.0-yocto-standard-00002-g658096e81b08 #1
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 es2.0 (DT)
Call trace:
[<ffff000008088938>] dump_backtrace+0x0/0x1a8
[<ffff000008088af4>] show_stack+0x14/0x20
[<ffff0000083bc54c>] dump_stack+0xb4/0xf0
[<ffff000008187538>] print_usage_bug.part.24+0x264/0x27c
[<ffff000008108fa0>] mark_lock+0x150/0x6c8
[<ffff00000810a100>] __lock_acquire+0x668/0x1728
[<ffff00000810b51c>] lock_acquire+0x4c/0x68
[<ffff0000089e67e8>] _raw_spin_lock+0x40/0x58
[<ffff0000084cf1c0>] rcar_dmac_sleep_suspend+0x50/0x120
[<ffff0000085d8440>] dpm_run_callback.isra.7+0x20/0x68
[<ffff0000085d8ec8>] __device_suspend+0x110/0x298
[<ffff0000085da13c>] dpm_suspend+0x114/0x248
[<ffff0000085da568>] dpm_suspend_start+0x70/0x80
[<ffff000008110a28>] suspend_devices_and_enter+0xb8/0x470
[<ffff000008110fd4>] pm_suspend+0x1f4/0x268
[<ffff00000810fbe0>] state_store+0x80/0x100
[<ffff0000083bec0c>] kobj_attr_store+0x14/0x28
[<ffff0000082884e0>] sysfs_kf_write+0x60/0x70
[<ffff000008287630>] kernfs_fop_write+0xb8/0x1e8
[<ffff000008203524>] __vfs_write+0x1c/0x110
[<ffff000008204338>] vfs_write+0xa0/0x1b8
[<ffff00000820572c>] SyS_write+0x44/0xa0
[<ffff000008082f4c>] __sys_trace_return+0x0/0x4

This patch replaces spin_lock/spin_unlock with
spin_lock_irqsave/spin_unlock_irqrestore.

Fixes: 9d867d5 ("dmaengine: rcar-dmac: Support S2RAM")
Signed-off-by: Takeshi Kihara <takeshi.kihara.df@renesas.com>
otyshchenko1 pushed a commit that referenced this pull request May 7, 2019
…-xhci0-usb3 to v4.9/rcar-ctc

* commit 'd92cf210d86995efa9fa176aa893702411e68596':
  Activate XHCI (USB3.0) controller on r8a
  usb: host: xhci-plat: add firmware for the R-Car M3-W xHCI controllers
  usb: host: xhci-plat: add support for the R-Car H3 xHCI controllers
  xhci-rcar: add firmware for R-Car H2/M2 USB 3.0 host controller

(cherry picked from commit 980bba68f4409bb131913c09e3ab374a570fd81d)
(cherry picked from commit b904874e9ce5fb99bff74447a4a341045f431c3f)
otyshchenko1 pushed a commit that referenced this pull request May 7, 2019
During system reboot or halt, with lockdep enabled:

    ================================
    WARNING: inconsistent lock state
    4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41 Tainted: G        W
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    reboot/2779 [HC0[0]:SC0[0]:HE1:SE1] takes:
    0000000098ae4ad3 (&(&rchan->lock)->rlock){?.-.}, at: rcar_dmac_shutdown+0x58/0x6c
    {IN-HARDIRQ-W} state was registered at:
      lock_acquire+0x208/0x238
      _raw_spin_lock+0x40/0x54
      rcar_dmac_isr_channel+0x28/0x200
      __handle_irq_event_percpu+0x1c0/0x3c8
      handle_irq_event_percpu+0x34/0x88
      handle_irq_event+0x48/0x78
      handle_fasteoi_irq+0xc4/0x12c
      generic_handle_irq+0x18/0x2c
      __handle_domain_irq+0xa8/0xac
      gic_handle_irq+0x78/0xbc
      el1_irq+0xec/0x1c0
      arch_cpu_idle+0xe8/0x1bc
      default_idle_call+0x2c/0x30
      do_idle+0x144/0x234
      cpu_startup_entry+0x20/0x24
      rest_init+0x27c/0x290
      start_kernel+0x430/0x45c
    irq event stamp: 12177
    hardirqs last  enabled at (12177): [<ffffff800881d804>] _raw_spin_unlock_irq+0x2c/0x4c
    hardirqs last disabled at (12176): [<ffffff800881d638>] _raw_spin_lock_irq+0x1c/0x60
    softirqs last  enabled at (11948): [<ffffff8008081da8>] __do_softirq+0x160/0x4ec
    softirqs last disabled at (11935): [<ffffff80080ec948>] irq_exit+0xa0/0xfc

    other info that might help us debug this:
     Possible unsafe locking scenario:

	   CPU0
	   ----
      lock(&(&rchan->lock)->rlock);
      <Interrupt>
	lock(&(&rchan->lock)->rlock);

     *** DEADLOCK ***

    3 locks held by reboot/2779:
     #0: 00000000bfabfa74 (reboot_mutex){+.+.}, at: sys_reboot+0xdc/0x208
     #1: 00000000c75d8c3a (&dev->mutex){....}, at: device_shutdown+0xc8/0x1c4
     #2: 00000000ebec58ec (&dev->mutex){....}, at: device_shutdown+0xd8/0x1c4

    stack backtrace:
    CPU: 6 PID: 2779 Comm: reboot Tainted: G        W         4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41
    Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ (DT)
    Call trace:
     dump_backtrace+0x0/0x148
     show_stack+0x14/0x1c
     dump_stack+0xb0/0xf0
     print_usage_bug.part.26+0x1c4/0x27c
     mark_lock+0x38c/0x610
     __lock_acquire+0x3fc/0x14d4
     lock_acquire+0x208/0x238
     _raw_spin_lock+0x40/0x54
     rcar_dmac_shutdown+0x58/0x6c
     platform_drv_shutdown+0x20/0x2c
     device_shutdown+0x160/0x1c4
     kernel_restart_prepare+0x34/0x3c
     kernel_restart+0x14/0x5c
     sys_reboot+0x160/0x208
     el0_svc_naked+0x30/0x34

rcar_dmac_stop_all_chan() takes the channel lock while stopping a
channel, but does not disable interrupts, leading to a deadlock when a
DMAC interrupt comes in.  Before, the same code block was called from an
interrupt handler, hence taking the spinlock was sufficient.

Fix this by disabling local interrupts while taking the spinlock.

Fixes: 9203dbe ("dmaengine: rcar-dmac: don't use DMAC error interrupt")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
(cherry picked from commit 45c9a60)
Signed-off-by: Hiroyuki Yokoyama <hiroyuki.yokoyama.vx@renesas.com>
arminn pushed a commit that referenced this pull request Nov 28, 2019
Sqlite user			Background GC
				- move_data_block
				  : move page #1
				 - f2fs_is_atomic_file
- f2fs_ioc_start_atomic_write
- f2fs_ioc_commit_atomic_write
 - commit_inmem_pages
   : commit page #1 & set node #2 dirty
				 - f2fs_submit_page_write
				  - f2fs_update_data_blkaddr
				   - set_page_dirty
				     : set node #2 dirty
 - f2fs_do_sync_file
  - fsync_node_pages
   : commit node #1 & node #2, then sudden power-cut

In a race case, we may check FI_ATOMIC_FILE flag before starting atomic
write flow, then we will commit meta data before data with reversed
order, after a sudden pow-cut, database transaction will be inconsistent.

So we'd better to exclude gc/atomic_write to each other by using lock
instead of flag checking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
arminn pushed a commit that referenced this pull request Nov 28, 2019
syzbot has tested the proposed patch but the reproducer still triggered crash:
kernel BUG at fs/f2fs/inode.c:LINE!

F2FS-fs (loop1): invalid crc value
F2FS-fs (loop5): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop5): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop5): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:238!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4886 Comm: syz-executor1 Not tainted 4.17.0-rc1+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:do_read_inode fs/f2fs/inode.c:238 [inline]
RIP: 0010:f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313
RSP: 0018:ffff8801c44a70e8 EFLAGS: 00010293
RAX: ffff8801ce208040 RBX: ffff8801b3621080 RCX: ffffffff82eace18
F2FS-fs (loop2): Magic Mismatch, valid(0xf2f52010) - read(0x0)
RDX: 0000000000000000 RSI: ffffffff82eaf047 RDI: 0000000000000007
RBP: ffff8801c44a7410 R08: ffff8801ce208040 R09: ffffed0039ee4176
R10: ffffed0039ee4176 R11: ffff8801cf720bb7 R12: ffff8801c0efa000
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f753aa9d700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
------------[ cut here ]------------
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel BUG at fs/f2fs/inode.c:238!
CR2: 0000000001b03018 CR3: 00000001c8b74000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_fill_super+0x4377/0x7bf0 fs/f2fs/super.c:2842
 mount_bdev+0x30c/0x3e0 fs/super.c:1165
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1268
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2517 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2847
 ksys_mount+0x12d/0x140 fs/namespace.c:3063
 __do_sys_mount fs/namespace.c:3077 [inline]
 __se_sys_mount fs/namespace.c:3074 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457daa
RSP: 002b:00007f753aa9cba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 0000000000457daa
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f753aa9cbf0
RBP: 0000000000000064 R08: 0000000020016a00 R09: 0000000020000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000064 R14: 00000000006fcb80 R15: 0000000000000000
RIP: do_read_inode fs/f2fs/inode.c:238 [inline] RSP: ffff8801c44a70e8
RIP: f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313 RSP: ffff8801c44a70e8
invalid opcode: 0000 [#2] SMP KASAN
---[ end trace 1cbcbec2156680bc ]---

Reported-and-tested-by: syzbot+41a1b341571f0952badb@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
iusyk pushed a commit that referenced this pull request Feb 7, 2021
…adlock

This patch fixes deadlock warning in removing/rescanning through sysfs
when CONFIG_PROVE_LOCKING is enabled.

The issue can be reproduced by these steps:
1. Enable CONFIG_PROVE_LOCKING via defconfig or menuconfig
2. Insert Ethernet card into PCIe CH0 and start up.
   After kernel starting up, execute the following command.
   echo 1 > /sys/class/pci_bus/0000\:00/device/0000\:00\:00.0/remove
3. Rescan PCI device by this command
   echo 1 > /sys/class/pci_bus/0000\:00/rescan

The deadlock warnings will occur.
============================================
WARNING: possible recursive locking detected
4.14.70-ltsi-yocto-standard #27 Not tainted
--------------------------------------------
sh/3402 is trying to acquire lock:
 (kn->count#78){++++}, at: kernfs_remove_by_name_ns+0x50/0xa8

but task is already holding lock:
 (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(kn->count#78);
  lock(kn->count#78);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by sh/3402:
 #0:  (sb_writers#4){.+.+}, at: vfs_write+0x198/0x1b0
 #1:  (&of->mutex){+.+.}, at: kernfs_fop_write+0x108/0x210
 #2:  (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130
 #3:  (pci_rescan_remove_lock){+.+.}, at: pci_lock_rescan_remove+0x1c/0x28

stack backtrace:
CPU: 3 PID: 3402 Comm: sh Not tainted 4.14.70-ltsi-yocto-standard #27
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795
ES3.0+ with 8GiB (4 x 2 GiB) (DT)
Call trace:
dump_backtrace+0x0/0x3d8
show_stack+0x14/0x20
dump_stack+0xbc/0xf4
__lock_acquire+0x930/0x18a8
lock_acquire+0x48/0x68
__kernfs_remove+0x280/0x2f8
kernfs_remove_by_name_ns+0x50/0xa8
remove_files.isra.0+0x38/0x78
sysfs_remove_group+0x4c/0xa0
sysfs_remove_groups+0x38/0x60
device_remove_attrs+0x54/0x78
device_del+0x1ac/0x308
pci_remove_bus_device+0x78/0xf8
pci_remove_bus_device+0x34/0xf8
pci_stop_and_remove_bus_device_locked+0x24/0x38
remove_store+0x6c/0x78
dev_attr_store+0x18/0x28
sysfs_kf_write+0x4c/0x78
kernfs_fop_write+0x138/0x210
__vfs_write+0x18/0x118
vfs_write+0xa4/0x1b0
SyS_write+0x48/0xb0

This warning occurs due to a self-deletion attribute using in the sysfs
PCI device directory. This kind of attribute is really tricky,
it does not allow pci framework drop this attribute until all active
.show() and .store() callbacks have finished unless
sysfs_break_active_protection() is called.
Hence this patch avoids writing into this attribute triggers a deadlock.

Referrence commit 5b55b24 ("scsi: core: Avoid that SCSI device
removal through sysfs triggers a deadlock")
of scsi driver

Signed-off-by: Tho Vu <tho.vu.wh@rvc.renesas.com>
varder pushed a commit to varder/linux that referenced this pull request Jun 8, 2021
Prarit reported that depending on the affinity setting the

 ' irq $N: Affinity broken due to vector space exhaustion.'

message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.

Shung-Hsi provided traces and analysis which pinpoints the problem:

The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:

 1) Try the intersection of affinity mask and node mask
 2) Try the node mask
 3) Try the full affinity mask
 4) Try the full online mask

Obviously xen-troops#2 and xen-troops#3 are in the wrong order as the requested affinity
mask has to take precedence.

In the observed cases xen-troops#1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.

Revert the order of xen-troops#2 and xen-troops#3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.

If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.

Fixes: d6ffc6a ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
arminn pushed a commit to arminn/linux that referenced this pull request Aug 9, 2021
commit 510b80a upstream.

When user space brings PKRU into init state, then the kernel handling is
broken:

  T1 user space
     xsave(state)
     state.header.xfeatures &= ~XFEATURE_MASK_PKRU;
     xrstor(state)

  T1 -> kernel
     schedule()
       XSAVE(S) -> T1->xsave.header.xfeatures[PKRU] == 0
       T1->flags |= TIF_NEED_FPU_LOAD;

       wrpkru();

     schedule()
       ...
       pk = get_xsave_addr(&T1->fpu->state.xsave, XFEATURE_PKRU);
       if (pk)
	 wrpkru(pk->pkru);
       else
	 wrpkru(DEFAULT_PKRU);

Because the xfeatures bit is 0 and therefore the value in the xsave
storage is not valid, get_xsave_addr() returns NULL and switch_to()
writes the default PKRU. -> FAIL xen-troops#1!

So that wrecks any copy_to/from_user() on the way back to user space
which hits memory which is protected by the default PKRU value.

Assumed that this does not fail (pure luck) then T1 goes back to user
space and because TIF_NEED_FPU_LOAD is set it ends up in

  switch_fpu_return()
      __fpregs_load_activate()
        if (!fpregs_state_valid()) {
  	 load_XSTATE_from_task();
        }

But if nothing touched the FPU between T1 scheduling out and back in,
then the fpregs_state is still valid which means switch_fpu_return()
does nothing and just clears TIF_NEED_FPU_LOAD. Back to user space with
DEFAULT_PKRU loaded. -> FAIL xen-troops#2!

The fix is simple: if get_xsave_addr() returns NULL then set the
PKRU value to 0 instead of the restrictive default PKRU value in
init_pkru_value.

 [ bp: Massage in minor nitpicks from folks. ]

Fixes: 0cecca9 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144346.045616965@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
arminn pushed a commit to arminn/linux that referenced this pull request Aug 9, 2021
commit f9dfb5e upstream.

The XSAVE init code initializes all enabled and supported components with
XRSTOR(S) to init state. Then it XSAVEs the state of the components back
into init_fpstate which is used in several places to fill in the init state
of components.

This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
those use the init optimization and skip writing state of components which
are in init state. So init_fpstate.xsave still contains all zeroes after
this operation.

There are two ways to solve that:

   1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
      XSAVES is enabled because XSAVES uses compacted format.

   2) Save the components which are known to have a non-zero init state by other
      means.

Looking deeper, xen-troops#2 is the right thing to do because all components the
kernel supports have all-zeroes init state except the legacy features (FP,
SSE). Those cannot be hard coded because the states are not identical on all
CPUs, but they can be saved with FXSAVE which avoids all conditionals.

Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
a BUILD_BUG_ON() which reminds developers to validate that a newly added
component has all zeroes init state. As a bonus remove the now unused
copy_xregs_to_kernel_booting() crutch.

The XSAVE and reshuffle method can still be implemented in the unlikely
case that components are added which have a non-zero init state and no
other means to save them. For now, FXSAVE is just simple and good enough.

  [ bp: Fix a typo or two in the text. ]

Fixes: 6bad06b ("x86, xsave: Use xsaveopt in context-switch path when supported")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
arminn pushed a commit to arminn/linux that referenced this pull request Aug 9, 2021
commit f54b3ca upstream.

This reverts commit 1815d9c.

Unfortunately this inverts the locking hierarchy, so back to the
drawing board. Full lockdep splat below:

======================================================
WARNING: possible circular locking dependency detected
5.13.0-rc7-CI-CI_DRM_10254+ xen-troops#1 Not tainted
------------------------------------------------------
kms_frontbuffer/1087 is trying to acquire lock:
ffff88810dcd01a8 (&dev->master_mutex){+.+.}-{3:3}, at: drm_is_current_master+0x1b/0x40
but task is already holding lock:
ffff88810dcd0488 (&dev->mode_config.mutex){+.+.}-{3:3}, at: drm_mode_getconnector+0x1c6/0x4a0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> xen-troops#2 (&dev->mode_config.mutex){+.+.}-{3:3}:
       __mutex_lock+0xab/0x970
       drm_client_modeset_probe+0x22e/0xca0
       __drm_fb_helper_initial_config_and_unlock+0x42/0x540
       intel_fbdev_initial_config+0xf/0x20 [i915]
       async_run_entry_fn+0x28/0x130
       process_one_work+0x26d/0x5c0
       worker_thread+0x37/0x380
       kthread+0x144/0x170
       ret_from_fork+0x1f/0x30
-> xen-troops#1 (&client->modeset_mutex){+.+.}-{3:3}:
       __mutex_lock+0xab/0x970
       drm_client_modeset_commit_locked+0x1c/0x180
       drm_client_modeset_commit+0x1c/0x40
       __drm_fb_helper_restore_fbdev_mode_unlocked+0x88/0xb0
       drm_fb_helper_set_par+0x34/0x40
       intel_fbdev_set_par+0x11/0x40 [i915]
       fbcon_init+0x270/0x4f0
       visual_init+0xc6/0x130
       do_bind_con_driver+0x1e5/0x2d0
       do_take_over_console+0x10e/0x180
       do_fbcon_takeover+0x53/0xb0
       register_framebuffer+0x22d/0x310
       __drm_fb_helper_initial_config_and_unlock+0x36c/0x540
       intel_fbdev_initial_config+0xf/0x20 [i915]
       async_run_entry_fn+0x28/0x130
       process_one_work+0x26d/0x5c0
       worker_thread+0x37/0x380
       kthread+0x144/0x170
       ret_from_fork+0x1f/0x30
-> #0 (&dev->master_mutex){+.+.}-{3:3}:
       __lock_acquire+0x151e/0x2590
       lock_acquire+0xd1/0x3d0
       __mutex_lock+0xab/0x970
       drm_is_current_master+0x1b/0x40
       drm_mode_getconnector+0x37e/0x4a0
       drm_ioctl_kernel+0xa8/0xf0
       drm_ioctl+0x1e8/0x390
       __x64_sys_ioctl+0x6a/0xa0
       do_syscall_64+0x39/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of: &dev->master_mutex --> &client->modeset_mutex --> &dev->mode_config.mutex
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&dev->mode_config.mutex);
                               lock(&client->modeset_mutex);
                               lock(&dev->mode_config.mutex);
  lock(&dev->master_mutex);
arminn pushed a commit to arminn/linux that referenced this pull request Aug 9, 2021
[ Upstream commit 85e8b03 ]

syzbot complained in neigh_reduce(), because rcu_read_lock_bh()
is treated differently than rcu_read_lock()

WARNING: suspicious RCU usage
5.13.0-rc6-syzkaller #0 Not tainted
-----------------------------
include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by kworker/0:0/5:
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247
 xen-troops#1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251
 xen-troops#2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180

stack backtrace:
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events ipvlan_process_multicast
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 __in6_dev_get include/net/addrconf.h:313 [inline]
 __in6_dev_get include/net/addrconf.h:311 [inline]
 neigh_reduce drivers/net/vxlan.c:2167 [inline]
 vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919
 __netdev_start_xmit include/linux/netdevice.h:4944 [inline]
 netdev_start_xmit include/linux/netdevice.h:4958 [inline]
 xmit_one net/core/dev.c:3654 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670
 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246
 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287
 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422
 kthread+0x3b1/0x4a0 kernel/kthread.c:313
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Fixes: f564f45 ("vxlan: add ipv6 proxy support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
arminn pushed a commit to arminn/linux that referenced this pull request Aug 9, 2021
[ Upstream commit d676598 ]

Patch was based on wrong presumption that be_poll can be called only
from bh context. It reintroducing old regression (also reverted) and
causing deadlock when we use netconsole with benet in bonding.

Old revert: commit 072a9c4 ("netpoll: revert 6bdb7fe and fix
be_poll() instead")

[  331.269715] bond0: (slave enp0s7f0): Releasing backup interface
[  331.270121] CPU: 4 PID: 1479 Comm: ifenslave Not tainted 5.13.0-rc7+ xen-troops#2
[  331.270122] Call Trace:
[  331.270122] [c00000001789f200] [c0000000008c505c] dump_stack+0x100/0x174 (unreliable)
[  331.270124] [c00000001789f240] [c008000001238b9c] be_poll+0x64/0xe90 [be2net]
[  331.270125] [c00000001789f330] [c000000000d1e6e4] netpoll_poll_dev+0x174/0x3d0
[  331.270127] [c00000001789f400] [c008000001bc167c] bond_poll_controller+0xb4/0x130 [bonding]
[  331.270128] [c00000001789f450] [c000000000d1e624] netpoll_poll_dev+0xb4/0x3d0
[  331.270129] [c00000001789f520] [c000000000d1ed88] netpoll_send_skb+0x448/0x470
[  331.270130] [c00000001789f5d0] [c0080000011f14f8] write_msg+0x180/0x1b0 [netconsole]
[  331.270131] [c00000001789f640] [c000000000230c0c] console_unlock+0x54c/0x790
[  331.270132] [c00000001789f7b0] [c000000000233098] vprintk_emit+0x2d8/0x450
[  331.270133] [c00000001789f810] [c000000000234758] vprintk+0xc8/0x270
[  331.270134] [c00000001789f850] [c000000000233c28] printk+0x40/0x54
[  331.270135] [c00000001789f870] [c000000000ccf908] __netdev_printk+0x150/0x198
[  331.270136] [c00000001789f910] [c000000000ccfdb4] netdev_info+0x68/0x94
[  331.270137] [c00000001789f950] [c008000001bcbd70] __bond_release_one+0x188/0x6b0 [bonding]
[  331.270138] [c00000001789faa0] [c008000001bcc6f4] bond_do_ioctl+0x42c/0x490 [bonding]
[  331.270139] [c00000001789fb60] [c000000000d0d17c] dev_ifsioc+0x17c/0x400
[  331.270140] [c00000001789fbc0] [c000000000d0db70] dev_ioctl+0x390/0x890
[  331.270141] [c00000001789fc10] [c000000000c7c76c] sock_do_ioctl+0xac/0x1b0
[  331.270142] [c00000001789fc90] [c000000000c7ffac] sock_ioctl+0x31c/0x6e0
[  331.270143] [c00000001789fd60] [c0000000005b9728] sys_ioctl+0xf8/0x150
[  331.270145] [c00000001789fdb0] [c0000000000336c0] system_call_exception+0x160/0x2f0
[  331.270146] [c00000001789fe10] [c00000000000d35c] system_call_common+0xec/0x278
[  331.270147] --- interrupt: c00 at 0x7fffa6c6ec00
[  331.270147] NIP:  00007fffa6c6ec00 LR: 0000000105c4185c CTR: 0000000000000000
[  331.270148] REGS: c00000001789fe80 TRAP: 0c00   Not tainted  (5.13.0-rc7+)
[  331.270148] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000428  XER: 00000000
[  331.270155] IRQMASK: 0
[  331.270156] GPR00: 0000000000000036 00007fffd494d5b0 00007fffa6d57100 0000000000000003
[  331.270158] GPR04: 0000000000008991 00007fffd494d6d0 0000000000000008 00007fffd494f28c
[  331.270161] GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
[  331.270164] GPR12: 0000000000000000 00007fffa6dfa220 0000000000000000 0000000000000000
[  331.270167] GPR16: 0000000105c44880 0000000000000000 0000000105c60088 0000000105c60318
[  331.270170] GPR20: 0000000105c602c0 0000000105c44560 0000000000000000 0000000000000000
[  331.270172] GPR24: 00007fffd494dc50 00007fffd494d6a8 0000000105c60008 00007fffd494d6d0
[  331.270175] GPR28: 00007fffd494f27e 0000000105c6026c 00007fffd494f284 0000000000000000
[  331.270178] NIP [00007fffa6c6ec00] 0x7fffa6c6ec00
[  331.270178] LR [0000000105c4185c] 0x105c4185c
[  331.270179] --- interrupt: c00

This reverts commit d0d006a.

Fixes: d0d006a ("be2net: disable bh with spin_lock in be_process_mcc")
Signed-off-by: Petr Oros <poros@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
lorc pushed a commit that referenced this pull request May 26, 2022
…adlock

This patch fixes deadlock warning in removing/rescanning through sysfs
when CONFIG_PROVE_LOCKING is enabled.

The issue can be reproduced by these steps:
1. Enable CONFIG_PROVE_LOCKING via defconfig or menuconfig
2. Insert Ethernet card into PCIe CH0 and start up.
   After kernel starting up, execute the following command.
   echo 1 > /sys/class/pci_bus/0000\:00/device/0000\:00\:00.0/remove
3. Rescan PCI device by this command
   echo 1 > /sys/class/pci_bus/0000\:00/bus_rescan

The deadlock warnings will occur.
============================================
WARNING: possible recursive locking detected
4.14.70-ltsi-yocto-standard #27 Not tainted
--------------------------------------------
sh/3402 is trying to acquire lock:
 (kn->count#78){++++}, at: kernfs_remove_by_name_ns+0x50/0xa8

but task is already holding lock:
 (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(kn->count#78);
  lock(kn->count#78);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by sh/3402:
 #0:  (sb_writers#4){.+.+}, at: vfs_write+0x198/0x1b0
 #1:  (&of->mutex){+.+.}, at: kernfs_fop_write+0x108/0x210
 #2:  (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130
 #3:  (pci_rescan_remove_lock){+.+.}, at: pci_lock_rescan_remove+0x1c/0x28

stack backtrace:
CPU: 3 PID: 3402 Comm: sh Not tainted 4.14.70-ltsi-yocto-standard #27
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795
ES3.0+ with 8GiB (4 x 2 GiB) (DT)
Call trace:
dump_backtrace+0x0/0x3d8
show_stack+0x14/0x20
dump_stack+0xbc/0xf4
__lock_acquire+0x930/0x18a8
lock_acquire+0x48/0x68
__kernfs_remove+0x280/0x2f8
kernfs_remove_by_name_ns+0x50/0xa8
remove_files.isra.0+0x38/0x78
sysfs_remove_group+0x4c/0xa0
sysfs_remove_groups+0x38/0x60
device_remove_attrs+0x54/0x78
device_del+0x1ac/0x308
pci_remove_bus_device+0x78/0xf8
pci_remove_bus_device+0x34/0xf8
pci_stop_and_remove_bus_device_locked+0x24/0x38
remove_store+0x6c/0x78
dev_attr_store+0x18/0x28
sysfs_kf_write+0x4c/0x78
kernfs_fop_write+0x138/0x210
__vfs_write+0x18/0x118
vfs_write+0xa4/0x1b0
SyS_write+0x48/0xb0

This warning occurs due to a self-deletion attribute using in the sysfs
PCI device directory. This kind of attribute is really tricky,
it does not allow pci framework drop this attribute until all active
.show() and .store() callbacks have finished unless
sysfs_break_active_protection() is called.
Hence this patch avoids writing into this attribute triggers a deadlock.

Referrence commit 5b55b24 ("scsi: core: Avoid that SCSI device
removal through sysfs triggers a deadlock")
of scsi driver

Signed-off-by: Tho Vu <tho.vu.wh@renesas.com>
Signed-off-by: Hoang Vo <hoang.vo.eb@renesas.com>
lorc pushed a commit that referenced this pull request Jul 16, 2024
…adlock

This patch fixes deadlock warning in removing/rescanning through sysfs
when CONFIG_PROVE_LOCKING is enabled.

The issue can be reproduced by these steps:
1. Enable CONFIG_PROVE_LOCKING via defconfig or menuconfig
2. Insert Ethernet card into PCIe CH0 and start up.
   After kernel starting up, execute the following command.
   echo 1 > /sys/class/pci_bus/0000\:00/device/0000\:00\:00.0/remove
3. Rescan PCI device by this command
   echo 1 > /sys/class/pci_bus/0000\:00/bus_rescan

The deadlock warnings will occur.
============================================
WARNING: possible recursive locking detected
4.14.70-ltsi-yocto-standard #27 Not tainted
--------------------------------------------
sh/3402 is trying to acquire lock:
 (kn->count#78){++++}, at: kernfs_remove_by_name_ns+0x50/0xa8

but task is already holding lock:
 (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(kn->count#78);
  lock(kn->count#78);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by sh/3402:
 #0:  (sb_writers#4){.+.+}, at: vfs_write+0x198/0x1b0
 #1:  (&of->mutex){+.+.}, at: kernfs_fop_write+0x108/0x210
 #2:  (kn->count#78){++++}, at: kernfs_remove_self+0xe0/0x130
 #3:  (pci_rescan_remove_lock){+.+.}, at: pci_lock_rescan_remove+0x1c/0x28

stack backtrace:
CPU: 3 PID: 3402 Comm: sh Not tainted 4.14.70-ltsi-yocto-standard #27
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795
ES3.0+ with 8GiB (4 x 2 GiB) (DT)
Call trace:
dump_backtrace+0x0/0x3d8
show_stack+0x14/0x20
dump_stack+0xbc/0xf4
__lock_acquire+0x930/0x18a8
lock_acquire+0x48/0x68
__kernfs_remove+0x280/0x2f8
kernfs_remove_by_name_ns+0x50/0xa8
remove_files.isra.0+0x38/0x78
sysfs_remove_group+0x4c/0xa0
sysfs_remove_groups+0x38/0x60
device_remove_attrs+0x54/0x78
device_del+0x1ac/0x308
pci_remove_bus_device+0x78/0xf8
pci_remove_bus_device+0x34/0xf8
pci_stop_and_remove_bus_device_locked+0x24/0x38
remove_store+0x6c/0x78
dev_attr_store+0x18/0x28
sysfs_kf_write+0x4c/0x78
kernfs_fop_write+0x138/0x210
__vfs_write+0x18/0x118
vfs_write+0xa4/0x1b0
SyS_write+0x48/0xb0

This warning occurs due to a self-deletion attribute using in the sysfs
PCI device directory. This kind of attribute is really tricky,
it does not allow pci framework drop this attribute until all active
.show() and .store() callbacks have finished unless
sysfs_break_active_protection() is called.
Hence this patch avoids writing into this attribute triggers a deadlock.

Referrence commit 5b55b24 ("scsi: core: Avoid that SCSI device
removal through sysfs triggers a deadlock")
of scsi driver

Signed-off-by: Tho Vu <tho.vu.wh@renesas.com>
Signed-off-by: Hoang Vo <hoang.vo.eb@renesas.com>
Signed-off-by: Tin Tran <tin.tran.xk@renesas.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.