Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bcm2836 irq #3

Closed
wants to merge 27 commits into from
Closed

Conversation

andreamerello
Copy link

Fixed local IRQ base irq number was incorrect.
RPI2 boots now :)

NOTE: The of_iomap of 2nd register address space works only appending the DTB to the zImage, selecting CONFIG_ARM_APPENDED_DTB in kernel CFG: strong suspect that the bootloader (at least mine) pretends to patch the DTB while loading it, messing it up.

IPI send implemented (hopefully) in complementary manner to the current IPI receive code. Yet to be tested.

lategoodbye and others added 24 commits February 25, 2015 14:36
Since the prefix is already in use, we need to add it in the
vendor list.

Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
This patch adds root compatible properties for the following boards:
- Raspberry Pi Model A
- Raspberry Pi Model A+
- Raspberry Pi Model B
- Raspberry Pi Model B (no P5)
- Raspberry Pi Model B rev2
- Raspberry Pi Model B+
- Raspberry Pi Compute Module

Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
This new header file defines pincontrol constants to use
from bcm2835 DTS files for pincontrol properties option.

Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
This patch converts all bcm2835 dts and dtsi files to use the pinctrl
header file.

Reviewed-by: Stephen Warren <swarren@wwwdotorg.org>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Device tree node names should contain the node's reg property address value.
The i2c0 node was apparently forgotten in commit 25b2f1b (ARM: bcm2835:
node name unit address cleanup).

Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Eric Anholt <eric@anholt.net>
The 2836 dts will inherit all of this, because it shares the same
peripherals as 2835.

Signed-off-by: Eric Anholt <eric@anholt.net>
This is not hooked up to anything yet, but it means that our later
2836-specific changes will be more obvious.  The bcm2836-rpi-2-b.dts
actually comes from bcm2835-rpi-2-b-plus, which apparently is the
closest.

Signed-off-by: Eric Anholt <eric@anholt.net>
The bus address for peripherals comes from the platform.h diff between
2708 and 2709, and the CPU's DT definition is based on the 2709 DT.

Signed-off-by: Eric Anholt <eric@anholt.net>
This port to the Raspberry Pi 2 comes up with SD card and network
working.  It doesn't yet support SMP, the ARM local timer, or PMU
events, and serial is fixed in a follow-on patch.

v2: Implement Arnd's feedback to not split to ARCH_BCM2836, and
    instead use more conditionals in ARCH_BCM2835.  Also reduce diff
    between 2835 and 2836.

Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
We were previously relying on the fixed clock registration in
clk-bcm2835, but there doesn't seem to be any real reason to not just
define it in the DT (and for the 2836 port, I would have needed to
change the clock's physical address in clk-bcm2835.c).  Also, because
we weren't registering the apb_pclk in clk-bcm2835 as a clock device,
we were picking up the uart clock node as apb_pclk by accident.

This gets serial working on 2836.

Signed-off-by: Eric Anholt <eric@anholt.net>
Implement BCM2835 mailbox support as a device registered with the
general purpose mailbox framework. Implementation based on commits by
Lubomir Rintel [1], Suman Anna and Jassi Brar [2] on which to base the
implementation.

[1] http://lists.infradead.org/pipermail/linux-rpi-kernel/2013-April/000528.html
[2] http://lists.infradead.org/pipermail/linux-rpi-kernel/2013-May/000546.html

v2: Squashed Craig's work for review, carried over to new version of
    Mailbox framework (changes by Lubomir)

v3: Fix multi-line comment style.  Refer to the documentation by
    filename.  Only declare one MODULE_AUTHOR.  Alphabetize includes.
    Drop some excessive dev_dbg()s (changes by anholt).

v4: Use the new bcm2835_peripheral_read_workaround(), drop the
    unnecessary wmb()s, make the messages be a pointer to u32, rather
    than u32-cast-as-pointer, fold in small static functions, drop
    extra error messages, clean up sizeof() arg for malloc, disable
    interrupts on unload (changes by anholt).

v5: Minor style fix.  Drop peripheral_read_workaround() since it's
    been noted that readl() covers the non-interrupt half of it, and I
    found that we would need to resolve read ordering in the interrupt
    handler anyway (changes by anholt).

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Craig McGeachie <slapdau@yahoo.com.au>
Signed-off-by: Suman Anna <s-anna@ti.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Jassi Brar <jassisinghbrar@gmail.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: linux-rpi-kernel@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Eric Anholt <eric@anholt.net>
Acked-by: Lee Jones <lee.jones@linaro.org>
If we've declared a power domain in the OF, and the OF node is found
but the requested domain hasn't been registered on it yet, then we
probably have just tried to probe before the power domain driver has.
Defer our device's probe until it shows up.

Signed-off-by: Eric Anholt <eric@anholt.net>
This gives us control of the 3 power domains provided through the
firmware's mailbox.

v2: Convert from a function call to enable USB to an actual power
    domain so that USB's need for power can be expressed only in the
    devicetree, and add support for the other power domains at the
    same time. Move from drivers/mailbox/ to arch/arm/mach-*, where
    other power domain drivers appear to live.

v3: Fix some style, drop useless error message, rely on core return
    for probe defer, don't bother to devm_kfree() in probe error path.

v4: Fix copy-and-pasted bit twiddling for power off.  Fix return value
    of bcm_mbox_set_power() in case of repeated success.

Signed-off-by: Eric Anholt <eric@anholt.net>

Conflicts:
	arch/arm/mach-bcm/Kconfig
	arch/arm/mach-bcm/Makefile
…the DT.

This gets the USB power domain turned on so that DWC2 can probe, even
if the bootloader didn't turn it on for us (which is the case if you
use the RPi firmware without chainloading U-Boot).

Signed-off-by: Eric Anholt <eric@anholt.net>
@anholt
Copy link
Owner

anholt commented May 5, 2015

Could you provide a Signed-off-by according to Documentation/SubmittingPatches? In the comments here would be fine.

@andreamerello
Copy link
Author

I'm sorry, I just used to sign patches generated off my local git tree; I'm quite new to pull-request-based workflow and I didn't notice the signed-off-by was missing in commits themselves...

Please use this for the commits in this pull-request:
Signed-off-by: Andrea Merello andrea.merello@gmail.com

BTW I did some testing about IPIs code in this pull-request, and it seems working. I could see IPIs traveling across CPUs during boot by using debug prints (I'm booting all four cores with code in my bcm2386-smp branch, (but SMP boot does not reach completion yet)).

LOCAL_TIMER_INT_CONTROL0 (and well as it's siblings for other cores)
are actually bitfield.
We need to convert the integer irq index accordingly.

Signed-off-by: Andrea Merello <andrea.merello@gmail.com>
Masking/unmasking IRQs has to be done in different manner depending
by the IRQ number (most notably depending if it falls inside the
per-core IRQ bank), and IRQ index off the first per-core IRQ must
eventually be extracted.

Static constants seems to refer to HW irq numbers, but they was
compared to irq_data->irq. Change in irq_data->hw_irq.

Signed-off-by: Andrea Merello <andrea.merello@gmail.com>
No any functional change

Signed-off-by: Andrea Merello <andrea.merello@gmail.com>
anholt pushed a commit that referenced this pull request Jan 17, 2018
…s instances left.

commit b69f63e upstream.

Unregistering the driver before calling cpuhp_remove_multi_state() removes
any remaining hotplug cpu instances so __cpuhp_remove_state_cpuslocked()
doesn't emit this warning:

[  268.748362] Error: Removing state 147 which has instances left.
[  268.748373] ------------[ cut here ]------------
[  268.748386] WARNING: CPU: 2 PID: 5476 at kernel/cpu.c:1734 __cpuhp_remove_state_cpuslocked+0x454/0x4f0
[  268.748389] Modules linked in: arm_ccn(-) [last unloaded: arm_ccn]
[  268.748403] CPU: 2 PID: 5476 Comm: rmmod Tainted: G        W       4.14.0-rc4+ #3
[  268.748406] Hardware name: AMD Seattle/Seattle, BIOS 10:18:39 Dec  8 2016
[  268.748410] task: ffff8001a18ca000 task.stack: ffff80019c120000
[  268.748416] PC is at __cpuhp_remove_state_cpuslocked+0x454/0x4f0
[  268.748421] LR is at __cpuhp_remove_state_cpuslocked+0x448/0x4f0
[  268.748425] pc : [<ffff2000081729ec>] lr : [<ffff2000081729e0>] pstate: 60000145
[  268.748427] sp : ffff80019c127d30
[  268.748430] x29: ffff80019c127d30 x28: ffff8001a18ca000
[  268.748437] x27: ffff20000c2cb000 x26: 1fffe4000042d490
[  268.748443] x25: ffff20000216a480 x24: 0000000000000000
[  268.748449] x23: ffff20000b08e000 x22: 0000000000000001
[  268.748455] x21: 0000000000000093 x20: 00000000000016f8
[  268.748460] x19: ffff20000c2cbb80 x18: 0000ffffb5fe7c58
[  268.748466] x17: 00000000004402d0 x16: 1fffe40001864f01
[  268.748472] x15: ffff20000c4bf8b0 x14: 0000000000000000
[  268.748477] x13: 0000000000007032 x12: ffff20000829ae48
[  268.748483] x11: ffff20000c4bf000 x10: 0000000000000004
[  268.748488] x9 : 0000000000006fbc x8 : ffff20000c318a40
[  268.748494] x7 : 0000000000000000 x6 : ffff040001864f02
[  268.748500] x5 : 0000000000000000 x4 : 0000000000000000
[  268.748505] x3 : 0000000000000007 x2 : dfff200000000000
[  268.748510] x1 : 000000000000ad3d x0 : 00000000000001f0
[  268.748516] Call trace:
[  268.748521] Exception stack(0xffff80019c127bf0 to 0xffff80019c127d30)
[  268.748526] 7be0:                                   00000000000001f0 000000000000ad3d
[  268.748531] 7c00: dfff200000000000 0000000000000007 0000000000000000 0000000000000000
[  268.748535] 7c20: ffff040001864f02 0000000000000000 ffff20000c318a40 0000000000006fbc
[  268.748539] 7c40: 0000000000000004 ffff20000c4bf000 ffff20000829ae48 0000000000007032
[  268.748544] 7c60: 0000000000000000 ffff20000c4bf8b0 1fffe40001864f01 00000000004402d0
[  268.748548] 7c80: 0000ffffb5fe7c58 ffff20000c2cbb80 00000000000016f8 0000000000000093
[  268.748553] 7ca0: 0000000000000001 ffff20000b08e000 0000000000000000 ffff20000216a480
[  268.748557] 7cc0: 1fffe4000042d490 ffff20000c2cb000 ffff8001a18ca000 ffff80019c127d30
[  268.748562] 7ce0: ffff2000081729e0 ffff80019c127d30 ffff2000081729ec 0000000060000145
[  268.748566] 7d00: 00000000000001f0 0000000000000000 0001000000000000 0000000000000000
[  268.748569] 7d20: ffff80019c127d30 ffff2000081729ec
[  268.748575] [<ffff2000081729ec>] __cpuhp_remove_state_cpuslocked+0x454/0x4f0
[  268.748580] [<ffff200008172adc>] __cpuhp_remove_state+0x54/0x80
[  268.748597] [<ffff20000215dd84>] arm_ccn_exit+0x2c/0x70 [arm_ccn]
[  268.748604] [<ffff20000834cfbc>] SyS_delete_module+0x5a4/0x708
[  268.748607] Exception stack(0xffff80019c127ec0 to 0xffff80019c128000)
[  268.748612] 7ec0: 0000000019bb7258 0000000000000800 ba64d0fb3d26a800 00000000000000da
[  268.748616] 7ee0: 0000ffffb6144e28 0000ffffcd95b409 fefefefefefefeff 7f7f7f7f7f7f7f7f
[  268.748621] 7f00: 000000000000006a 1999999999999999 0000ffffb6179000 0000000000bbcc6d
[  268.748625] 7f20: 0000ffffb6176b98 0000ffffcd95c2d0 0000ffffb5fe7b58 0000ffffb6163000
[  268.748630] 7f40: 0000ffffb60ad3e0 00000000004402d0 0000ffffb5fe7c58 0000000019bb71f0
[  268.748634] 7f60: 0000ffffcd95c740 0000000000000000 0000000019bb71f0 0000000000416700
[  268.748639] 7f80: 0000000000000000 00000000004402e8 0000000019bb6010 0000ffffcd95c748
[  268.748643] 7fa0: 0000000000000000 0000ffffcd95c460 00000000004113a8 0000ffffcd95c460
[  268.748648] 7fc0: 0000ffffb60ad3e8 0000000080000000 0000000019bb7258 000000000000006a
[  268.748652] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  268.748657] [<ffff200008084f9c>] __sys_trace_return+0x0/0x4
[  268.748661] ---[ end trace a996d358dcaa7f9c ]---

Fixes: 8df0387 ("bus/arm-ccn: Use cpu-hp's multi instance support instead custom list")
Signed-off-by: Kim Phillips <kim.phillips@arm.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Jan 17, 2018
[ Upstream commit 36b6f9f ]

Lockdep warns about a potential deadlock:

[   66.782842] ======================================================
[   66.782888] WARNING: possible circular locking dependency detected
[   66.782937] 4.14.0-rc2-test-test+ #48 Not tainted
[   66.782983] ------------------------------------------------------
[   66.783052] umount/336 is trying to acquire lock:
[   66.783117]  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff81032395>] rdt_kill_sb+0x215/0x390
[   66.783193]
               but task is already holding lock:
[   66.783244]  (rdtgroup_mutex){+.+.}, at: [<ffffffff810321b6>] rdt_kill_sb+0x36/0x390
[   66.783305]
               which lock already depends on the new lock.

[   66.783364]
               the existing dependency chain (in reverse order) is:
[   66.783419]
               -> #3 (rdtgroup_mutex){+.+.}:
[   66.783467]        __lock_acquire+0x1293/0x13f0
[   66.783509]        lock_acquire+0xaf/0x220
[   66.783543]        __mutex_lock+0x71/0x9b0
[   66.783575]        mutex_lock_nested+0x1b/0x20
[   66.783610]        intel_rdt_online_cpu+0x3b/0x430
[   66.783649]        cpuhp_invoke_callback+0xab/0x8e0
[   66.783687]        cpuhp_thread_fun+0x7a/0x150
[   66.783722]        smpboot_thread_fn+0x1cc/0x270
[   66.783764]        kthread+0x16e/0x190
[   66.783794]        ret_from_fork+0x27/0x40
[   66.783825]
               -> #2 (cpuhp_state){+.+.}:
[   66.783870]        __lock_acquire+0x1293/0x13f0
[   66.783906]        lock_acquire+0xaf/0x220
[   66.783938]        cpuhp_issue_call+0x102/0x170
[   66.783974]        __cpuhp_setup_state_cpuslocked+0x154/0x2a0
[   66.784023]        __cpuhp_setup_state+0xc7/0x170
[   66.784061]        page_writeback_init+0x43/0x67
[   66.784097]        pagecache_init+0x43/0x4a
[   66.784131]        start_kernel+0x3ad/0x3f7
[   66.784165]        x86_64_start_reservations+0x2a/0x2c
[   66.784204]        x86_64_start_kernel+0x72/0x75
[   66.784241]        verify_cpu+0x0/0xfb
[   66.784270]
               -> #1 (cpuhp_state_mutex){+.+.}:
[   66.784319]        __lock_acquire+0x1293/0x13f0
[   66.784355]        lock_acquire+0xaf/0x220
[   66.784387]        __mutex_lock+0x71/0x9b0
[   66.784419]        mutex_lock_nested+0x1b/0x20
[   66.784454]        __cpuhp_setup_state_cpuslocked+0x52/0x2a0
[   66.784497]        __cpuhp_setup_state+0xc7/0x170
[   66.784535]        page_alloc_init+0x28/0x30
[   66.784569]        start_kernel+0x148/0x3f7
[   66.784602]        x86_64_start_reservations+0x2a/0x2c
[   66.784642]        x86_64_start_kernel+0x72/0x75
[   66.784678]        verify_cpu+0x0/0xfb
[   66.784707]
               -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   66.784759]        check_prev_add+0x32f/0x6e0
[   66.784794]        __lock_acquire+0x1293/0x13f0
[   66.784830]        lock_acquire+0xaf/0x220
[   66.784863]        cpus_read_lock+0x3d/0xb0
[   66.784896]        rdt_kill_sb+0x215/0x390
[   66.784930]        deactivate_locked_super+0x3e/0x70
[   66.784968]        deactivate_super+0x40/0x60
[   66.785003]        cleanup_mnt+0x3f/0x80
[   66.785034]        __cleanup_mnt+0x12/0x20
[   66.785070]        task_work_run+0x8b/0xc0
[   66.785103]        exit_to_usermode_loop+0x94/0xa0
[   66.786804]        syscall_return_slowpath+0xe8/0x150
[   66.788502]        entry_SYSCALL_64_fastpath+0xab/0xad
[   66.790194]
               other info that might help us debug this:

[   66.795139] Chain exists of:
                 cpu_hotplug_lock.rw_sem --> cpuhp_state --> rdtgroup_mutex

[   66.800035]  Possible unsafe locking scenario:

[   66.803267]        CPU0                    CPU1
[   66.804867]        ----                    ----
[   66.806443]   lock(rdtgroup_mutex);
[   66.808002]                                lock(cpuhp_state);
[   66.809565]                                lock(rdtgroup_mutex);
[   66.811110]   lock(cpu_hotplug_lock.rw_sem);
[   66.812608]
                *** DEADLOCK ***

[   66.816983] 2 locks held by umount/336:
[   66.818418]  #0:  (&type->s_umount_key#35){+.+.}, at: [<ffffffff81229738>] deactivate_super+0x38/0x60
[   66.819922]  #1:  (rdtgroup_mutex){+.+.}, at: [<ffffffff810321b6>] rdt_kill_sb+0x36/0x390

When the resctrl filesystem is unmounted the locks should be obtain in the
locks in the same order as was done when the cpus came online:

      cpu_hotplug_lock before rdtgroup_mutex.

This also requires to switch the static_branch_disable() calls to the
_cpulocked variant because now cpu hotplug lock is held already.

[ tglx: Switched to cpus_read_[un]lock ]

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Acked-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/cc292e76be073f7260604651711c47b09fd0dc81.1508490116.git.reinette.chatre@intel.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Jan 17, 2018
commit 4991c09 upstream.

While testing on a large CPU system, detected the following RCU stall
many times over the span of the workload.  This problem is solved by
adding a cond_resched() in the change_pmd_range() function.

  INFO: rcu_sched detected stalls on CPUs/tasks:
   154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
   (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
  Sending NMI from CPU 955 to CPUs 154:
  NMI backtrace for cpu 154
  CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
  NIP:  c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
  REGS: 00000000a4b0fb44 TRAP: 0501   Not tainted  (4.15.0-rc3+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22422082  XER: 00000000
  CFAR: 00000000006cf8f0 SOFTE: 1
  GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
  GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
  GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
  GPR12: ffffffffffffffff c00000000fb25100
  NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
  LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
  Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
  Instruction dump:
  60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
  e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008

Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Jan 30, 2018
commit ae59c3f upstream.

The rdma_ah_find_type() accesses the port array based on an index
controlled by userspace. The existing bounds check is after the first use
of the index, so userspace can generate an out of bounds access, as shown
by the KASN report below.

==================================================================
BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0xa8/0x3b0
Read of size 4 at addr ffff880019ae2268 by task ibv_rc_pingpong/409

CPU: 0 PID: 409 Comm: ibv_rc_pingpong Not tainted 4.15.0-rc2-00031-gb60a3faf5b83-dirty #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
 dump_stack+0xe9/0x18f
 print_address_description+0xa2/0x350
 kasan_report+0x3a5/0x400
 to_rdma_ah_attr+0xa8/0x3b0
 mlx5_ib_query_qp+0xd35/0x1330
 ib_query_qp+0x8a/0xb0
 ib_uverbs_query_qp+0x237/0x7f0
 ib_uverbs_write+0x617/0xd80
 __vfs_write+0xf7/0x500
 vfs_write+0x149/0x310
 SyS_write+0xca/0x190
 entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x7fe9c7a275a0
RSP: 002b:00007ffee5498738 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fe9c7ce4b00 RCX: 00007fe9c7a275a0
RDX: 0000000000000018 RSI: 00007ffee5498800 RDI: 0000000000000003
RBP: 000055d0c8d3f010 R08: 00007ffee5498800 R09: 0000000000000018
R10: 00000000000000ba R11: 0000000000000246 R12: 0000000000008000
R13: 0000000000004fb0 R14: 000055d0c8d3f050 R15: 00007ffee5498560

Allocated by task 1:
 __kmalloc+0x3f9/0x430
 alloc_mad_private+0x25/0x50
 ib_mad_post_receive_mads+0x204/0xa60
 ib_mad_init_device+0xa59/0x1020
 ib_register_device+0x83a/0xbc0
 mlx5_ib_add+0x50e/0x5c0
 mlx5_add_device+0x142/0x410
 mlx5_register_interface+0x18f/0x210
 mlx5_ib_init+0x56/0x63
 do_one_initcall+0x15b/0x270
 kernel_init_freeable+0x2d8/0x3d0
 kernel_init+0x14/0x190
 ret_from_fork+0x24/0x30

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at ffff880019ae2000
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 104 bytes to the right of
 512-byte region [ffff880019ae2000, ffff880019ae2200)
The buggy address belongs to the page:
page:000000005d674e18 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x4000000000008100(slab|head)
raw: 4000000000008100 0000000000000000 0000000000000000 00000001000c000c
raw: dead000000000100 dead000000000200 ffff88001a402000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff880019ae2100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880019ae2180: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>ffff880019ae2200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                          ^
 ffff880019ae2280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880019ae2300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint

Fixes: 44c5848 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Feb 5, 2018
…l calls

[ Upstream commit 9faaff5 ]

Provide a different lockdep key for rxrpc_call::user_mutex when the call is
made on a kernel socket, such as by the AFS filesystem.

The problem is that lockdep registers a false positive between userspace
calling the sendmsg syscall on a user socket where call->user_mutex is held
whilst userspace memory is accessed whereas the AFS filesystem may perform
operations with mmap_sem held by the caller.

In such a case, the following warning is produced.

======================================================
WARNING: possible circular locking dependency detected
4.14.0-fscache+ raspberrypi#243 Tainted: G            E
------------------------------------------------------
modpost/16701 is trying to acquire lock:
 (&vnode->io_lock){+.+.}, at: [<ffffffffa000fc40>] afs_begin_vnode_operation+0x33/0x77 [kafs]

but task is already holding lock:
 (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){++++}:
       __might_fault+0x61/0x89
       _copy_from_iter_full+0x40/0x1fa
       rxrpc_send_data+0x8dc/0xff3
       rxrpc_do_sendmsg+0x62f/0x6a1
       rxrpc_sendmsg+0x166/0x1b7
       sock_sendmsg+0x2d/0x39
       ___sys_sendmsg+0x1ad/0x22b
       __sys_sendmsg+0x41/0x62
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #2 (&call->user_mutex){+.+.}:
       __mutex_lock+0x86/0x7d2
       rxrpc_new_client_call+0x378/0x80e
       rxrpc_kernel_begin_call+0xf3/0x154
       afs_make_call+0x195/0x454 [kafs]
       afs_vl_get_capabilities+0x193/0x198 [kafs]
       afs_vl_lookup_vldb+0x5f/0x151 [kafs]
       afs_create_volume+0x2e/0x2f4 [kafs]
       afs_mount+0x56a/0x8d7 [kafs]
       mount_fs+0x6a/0x109
       vfs_kern_mount+0x67/0x135
       do_mount+0x90b/0xb57
       SyS_mount+0x72/0x98
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #1 (k-sk_lock-AF_RXRPC){+.+.}:
       lock_sock_nested+0x74/0x8a
       rxrpc_kernel_begin_call+0x8a/0x154
       afs_make_call+0x195/0x454 [kafs]
       afs_fs_get_capabilities+0x17a/0x17f [kafs]
       afs_probe_fileserver+0xf7/0x2f0 [kafs]
       afs_select_fileserver+0x83f/0x903 [kafs]
       afs_fetch_status+0x89/0x11d [kafs]
       afs_iget+0x16f/0x4f8 [kafs]
       afs_mount+0x6c6/0x8d7 [kafs]
       mount_fs+0x6a/0x109
       vfs_kern_mount+0x67/0x135
       do_mount+0x90b/0xb57
       SyS_mount+0x72/0x98
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #0 (&vnode->io_lock){+.+.}:
       lock_acquire+0x174/0x19f
       __mutex_lock+0x86/0x7d2
       afs_begin_vnode_operation+0x33/0x77 [kafs]
       afs_fetch_data+0x80/0x12a [kafs]
       afs_readpages+0x314/0x405 [kafs]
       __do_page_cache_readahead+0x203/0x2ba
       filemap_fault+0x179/0x54d
       __do_fault+0x17/0x60
       __handle_mm_fault+0x6d7/0x95c
       handle_mm_fault+0x24e/0x2a3
       __do_page_fault+0x301/0x486
       do_page_fault+0x236/0x259
       page_fault+0x22/0x30
       __clear_user+0x3d/0x60
       padzero+0x1c/0x2b
       load_elf_binary+0x785/0xdc7
       search_binary_handler+0x81/0x1ff
       do_execveat_common.isra.14+0x600/0x888
       do_execve+0x1f/0x21
       SyS_execve+0x28/0x2f
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

other info that might help us debug this:

Chain exists of:
  &vnode->io_lock --> &call->user_mutex --> &mm->mmap_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(&call->user_mutex);
                               lock(&mm->mmap_sem);
  lock(&vnode->io_lock);

 *** DEADLOCK ***

1 lock held by modpost/16701:
 #0:  (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486

stack backtrace:
CPU: 0 PID: 16701 Comm: modpost Tainted: G            E   4.14.0-fscache+ raspberrypi#243
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 dump_stack+0x67/0x8e
 print_circular_bug+0x341/0x34f
 check_prev_add+0x11f/0x5d4
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? __lock_acquire+0xf77/0x10b4
 __lock_acquire+0xf77/0x10b4
 lock_acquire+0x174/0x19f
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 __mutex_lock+0x86/0x7d2
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_fetch_data+0x80/0x12a [kafs]
 afs_readpages+0x314/0x405 [kafs]
 __do_page_cache_readahead+0x203/0x2ba
 ? filemap_fault+0x179/0x54d
 filemap_fault+0x179/0x54d
 __do_fault+0x17/0x60
 __handle_mm_fault+0x6d7/0x95c
 handle_mm_fault+0x24e/0x2a3
 __do_page_fault+0x301/0x486
 do_page_fault+0x236/0x259
 page_fault+0x22/0x30
RIP: 0010:__clear_user+0x3d/0x60
RSP: 0018:ffff880071e93da0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000000000011c RCX: 000000000000011c
RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000060f720
RBP: 000000000060f720 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: ffff8800b5459b68 R12: ffff8800ce150e00
R13: 000000000060f720 R14: 00000000006127a8 R15: 0000000000000000
 padzero+0x1c/0x2b
 load_elf_binary+0x785/0xdc7
 search_binary_handler+0x81/0x1ff
 do_execveat_common.isra.14+0x600/0x888
 do_execve+0x1f/0x21
 SyS_execve+0x28/0x2f
 do_syscall_64+0x89/0x1be
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fdb6009ee07
RSP: 002b:00007fff566d9728 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 000055ba57280900 RCX: 00007fdb6009ee07
RDX: 000055ba5727f270 RSI: 000055ba5727cac0 RDI: 000055ba57280900
RBP: 000055ba57280900 R08: 00007fff566d9700 R09: 0000000000000000
R10: 000055ba5727cac0 R11: 0000000000000246 R12: 0000000000000000
R13: 000055ba5727cac0 R14: 000055ba5727f270 R15: 0000000000000000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Feb 22, 2018
Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall

PID: 47039  TASK: ffff89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
 #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518
 #3 [ffff898e35f15b60] no_context at ffffffff8104854c
 #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff898e35f15dc8  RFLAGS: 00010282
    RAX: 00000000fffffffe  RBX: ffff889b77f6fc00  RCX:ffffffff81c99d88
    RDX: 0000000000000000  RSI: ffff896019ee08e8  RDI:ffff889b77f6fc00
    RBP: ffff898e35f15df0   R8: ffff896019ee08c8  R9:0000000000000000
    R10: 0000000000000400  R11: 0000000000000000  R12:ffff896019ee08c0
    R13: ffff889b77f6fe68  R14: ffffffff81c99d80  R15: ffffffffa022a1e0
    ORIG_RAX: ffffffffffffffff  CS: 0010 SS: 0018
 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
 #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6

PID: 45659  TASK: ffff880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
 #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
 #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
 #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
 #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
 #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
 #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
 #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670

PID: 45659                          PID: 47039
rds_ib_laddr_check
  /* create id_priv with a null event_handler */
  rdma_create_id
  rdma_bind_addr
    cma_acquire_dev
      /* add id_priv to cma_dev->id_list */
      cma_attach_to_dev
                                    cma_ndev_work_handler
                                      /* event_hanlder is null */
                                      id_priv->id.event_handler

Signed-off-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
anholt pushed a commit that referenced this pull request Mar 3, 2018
…uct_mutex

This patch fixes lockdep issue due to circular locking dependency of
struct_mutex, i_mutex_key, mmap_sem, relay_channels_mutex.
For GuC log relay channel we create debugfs file that requires i_mutex_key
lock and we are doing that under struct_mutex. So we introduced newer
dependency as:
    &dev->struct_mutex --> &sb->s_type->i_mutex_key#3 --> &mm->mmap_sem
However, there is dependency from mmap_sem to struct_mutex. Hence we
separate the relay create/destroy operation from under struct_mutex.
Also added runtime check of relay buffer status.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

======================================================
WARNING: possible circular locking dependency detected
4.15.0-rc6-CI-Patchwork_7614+ #1 Not tainted
------------------------------------------------------
debugfs_test/1388 is trying to acquire lock:
 (&dev->struct_mutex){+.+.}, at: [<00000000d5e1d915>] i915_mutex_lock_interruptible+0x47/0x130 [i915]

but task is already holding lock:
 (&mm->mmap_sem){++++}, at: [<0000000029a9c131>] __do_page_fault+0x106/0x560

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){++++}:
       _copy_to_user+0x1e/0x70
       filldir+0x8c/0xf0
       dcache_readdir+0xeb/0x160
       iterate_dir+0xdc/0x140
       SyS_getdents+0xa0/0x130
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #2 (&sb->s_type->i_mutex_key#3){++++}:
       start_creating+0x59/0x110
       __debugfs_create_file+0x2e/0xe0
       relay_create_buf_file+0x62/0x80
       relay_late_setup_files+0x84/0x250
       guc_log_late_setup+0x4f/0x110 [i915]
       i915_guc_log_register+0x32/0x40 [i915]
       i915_driver_load+0x7b6/0x1720 [i915]
       i915_pci_probe+0x2e/0x90 [i915]
       pci_device_probe+0x9c/0x120
       driver_probe_device+0x2a3/0x480
       __driver_attach+0xd9/0xe0
       bus_for_each_dev+0x57/0x90
       bus_add_driver+0x168/0x260
       driver_register+0x52/0xc0
       do_one_initcall+0x39/0x150
       do_init_module+0x56/0x1ef
       load_module+0x231c/0x2d70
       SyS_finit_module+0xa5/0xe0
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #1 (relay_channels_mutex){+.+.}:
       relay_open+0x12c/0x2b0
       intel_guc_log_runtime_create+0xab/0x230 [i915]
       intel_guc_init+0x81/0x120 [i915]
       intel_uc_init+0x29/0xa0 [i915]
       i915_gem_init+0x182/0x530 [i915]
       i915_driver_load+0xaa9/0x1720 [i915]
       i915_pci_probe+0x2e/0x90 [i915]
       pci_device_probe+0x9c/0x120
       driver_probe_device+0x2a3/0x480
       __driver_attach+0xd9/0xe0
       bus_for_each_dev+0x57/0x90
       bus_add_driver+0x168/0x260
       driver_register+0x52/0xc0
       do_one_initcall+0x39/0x150
       do_init_module+0x56/0x1ef
       load_module+0x231c/0x2d70
       SyS_finit_module+0xa5/0xe0
       entry_SYSCALL_64_fastpath+0x1c/0x89

-> #0 (&dev->struct_mutex){+.+.}:
       __mutex_lock+0x81/0x9b0
       i915_mutex_lock_interruptible+0x47/0x130 [i915]
       i915_gem_fault+0x201/0x790 [i915]
       __do_fault+0x15/0x70
       __handle_mm_fault+0x677/0xdc0
       handle_mm_fault+0x14f/0x2f0
       __do_page_fault+0x2d1/0x560
       page_fault+0x4c/0x60

other info that might help us debug this:

Chain exists of:
  &dev->struct_mutex --> &sb->s_type->i_mutex_key#3 --> &mm->mmap_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(&sb->s_type->i_mutex_key#3);
                               lock(&mm->mmap_sem);
  lock(&dev->struct_mutex);

 *** DEADLOCK ***

1 lock held by debugfs_test/1388:
 #0:  (&mm->mmap_sem){++++}, at: [<0000000029a9c131>] __do_page_fault+0x106/0x560

stack backtrace:
CPU: 2 PID: 1388 Comm: debugfs_test Not tainted 4.15.0-rc6-CI-Patchwork_7614+ #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.10 09/29/2016
Call Trace:
 dump_stack+0x5f/0x86
 print_circular_bug.isra.18+0x1d0/0x2c0
 __lock_acquire+0x14ae/0x1b60
 ? lock_acquire+0xaf/0x200
 lock_acquire+0xaf/0x200
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 __mutex_lock+0x81/0x9b0
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? i915_mutex_lock_interruptible+0x47/0x130 [i915]
 i915_mutex_lock_interruptible+0x47/0x130 [i915]
 ? __pm_runtime_resume+0x4f/0x80
 i915_gem_fault+0x201/0x790 [i915]
 __do_fault+0x15/0x70
 ? _raw_spin_unlock+0x29/0x40
 __handle_mm_fault+0x677/0xdc0
 handle_mm_fault+0x14f/0x2f0
 __do_page_fault+0x2d1/0x560
 ? page_fault+0x36/0x60
 page_fault+0x4c/0x60

v2: Added lock protection to guc->log.runtime.relay_chan (Chris)
    Fixed locking inside guc_flush_logs uncovered by new lockdep.

v3: Locking guc_read_update_log_buffer entirely with relay_lock. (Chris)
    Prepared intel_guc_init_early. Moved relay_lock inside relay_create
    relay_destroy, relay_file_create, guc_read_update_log_buffer. (Michal)
    Removed struct_mutex lock around guc_log_flush and removed usage
    of guc_log_has_relay() from runtime_create path as it needs
    struct_mutex lock.

v4: Handle NULL relay sub buffer pointer earlier in read_update_log_buffer
    (Chris). Fixed comment suffix **/. (Michal)

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104693
Testcase: igt/debugfs_test/read_all_entries # with enable_guc=1 and guc_log_level=1
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/1516808821-3638-3-git-send-email-sagar.a.kamble@intel.com
anholt pushed a commit that referenced this pull request Mar 9, 2018
commit 7ba7166 upstream.

It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,

kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
 #0  0x00007fc08889ae0d _int_malloc (libc.so.6)
 #1  0x00007fc08889c2f3 malloc (libc.so.6)
 #2  0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
 #3  0x0000560e6005e75c n/a (urxvt)
 #4  0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
 #5  0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
 #6  0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
 #7  0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
 #8  0x0000560e6005cb55 ev_run (urxvt)
 #9  0x0000560e6003b9b9 main (urxvt)
 #10 0x00007fc08883af4a __libc_start_main (libc.so.6)
 #11 0x0000560e6003f9da _start (urxvt)

After bisection, it was found the first bad commit is bd4c82c ("mm,
THP, swap: delay splitting THP after swapped out").

The root cause is as follows:

When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance.  But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved.  After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.

This is fixed by refusing to save page in the frontswap store functions
if the page is a THP.  So that the THP will be swapped out to swap
device.

Another choice is to split THP if frontswap is enabled.  But it is found
that the frontswap enabling isn't flexible.  For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.

Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.

Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>	[put THP checking in backend]
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shaohua Li <shli@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org>	[4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Mar 9, 2018
commit 5ee39a7 upstream.

aarch64 unhandled signal kernel messages are very verbose, suggesting
them to be more of a debugging aid:

sigsegv[33]: unhandled level 2 translation fault (11) at 0x00000000, esr
0x92000046, in sigsegv[400000+71000]
CPU: 1 PID: 33 Comm: sigsegv Tainted: G        W        4.15.0-rc3+ #3
Hardware name: linux,dummy-virt (DT)
pstate: 60000000 (nZCv daif -PAN -UAO)
pc : 0x4003f4
lr : 0x4006bc
sp : 0000fffffe94a060
x29: 0000fffffe94a070 x28: 0000000000000000
x27: 0000000000000000 x26: 0000000000000000
x25: 0000000000000000 x24: 00000000004001b0
x23: 0000000000486ac8 x22: 00000000004001c8
x21: 0000000000000000 x20: 0000000000400be8
x19: 0000000000400b30 x18: 0000000000484728
x17: 000000000865ffc8 x16: 000000000000270f
x15: 00000000000000b0 x14: 0000000000000002
x13: 0000000000000001 x12: 0000000000000000
x11: 0000000000000000 x10: 0008000020008008
x9 : 000000000000000f x8 : ffffffffffffffff
x7 : 0004000000000000 x6 : ffffffffffffffff
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 00000000004003e4 x2 : 0000fffffe94a1e8
x1 : 000000000000000a x0 : 0000000000000000

Disable them by default, so they can be enabled using
/proc/sys/debug/exception-trace.

Cc: <stable@vger.kernel.org>
Signed-off-by: Michael Weiser <michael.weiser@gmx.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
stschake pushed a commit to stschake/linux that referenced this pull request Mar 11, 2018
aarch64 unhandled signal kernel messages are very verbose, suggesting
them to be more of a debugging aid:

sigsegv[33]: unhandled level 2 translation fault (11) at 0x00000000, esr
0x92000046, in sigsegv[400000+71000]
CPU: 1 PID: 33 Comm: sigsegv Tainted: G        W        4.15.0-rc3+ anholt#3
Hardware name: linux,dummy-virt (DT)
pstate: 60000000 (nZCv daif -PAN -UAO)
pc : 0x4003f4
lr : 0x4006bc
sp : 0000fffffe94a060
x29: 0000fffffe94a070 x28: 0000000000000000
x27: 0000000000000000 x26: 0000000000000000
x25: 0000000000000000 x24: 00000000004001b0
x23: 0000000000486ac8 x22: 00000000004001c8
x21: 0000000000000000 x20: 0000000000400be8
x19: 0000000000400b30 x18: 0000000000484728
x17: 000000000865ffc8 x16: 000000000000270f
x15: 00000000000000b0 x14: 0000000000000002
x13: 0000000000000001 x12: 0000000000000000
x11: 0000000000000000 x10: 0008000020008008
x9 : 000000000000000f x8 : ffffffffffffffff
x7 : 0004000000000000 x6 : ffffffffffffffff
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 00000000004003e4 x2 : 0000fffffe94a1e8
x1 : 000000000000000a x0 : 0000000000000000

Disable them by default, so they can be enabled using
/proc/sys/debug/exception-trace.

Cc: <stable@vger.kernel.org>
Signed-off-by: Michael Weiser <michael.weiser@gmx.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
stschake pushed a commit to stschake/linux that referenced this pull request Mar 11, 2018
It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,

kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
 #0  0x00007fc08889ae0d _int_malloc (libc.so.6)
 anholt#1  0x00007fc08889c2f3 malloc (libc.so.6)
 anholt#2  0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
 anholt#3  0x0000560e6005e75c n/a (urxvt)
 anholt#4  0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
 anholt#5  0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
 anholt#6  0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
 anholt#7  0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
 anholt#8  0x0000560e6005cb55 ev_run (urxvt)
 anholt#9  0x0000560e6003b9b9 main (urxvt)
 anholt#10 0x00007fc08883af4a __libc_start_main (libc.so.6)
 anholt#11 0x0000560e6003f9da _start (urxvt)

After bisection, it was found the first bad commit is bd4c82c ("mm,
THP, swap: delay splitting THP after swapped out").

The root cause is as follows:

When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance.  But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved.  After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.

This is fixed by refusing to save page in the frontswap store functions
if the page is a THP.  So that the THP will be swapped out to swap
device.

Another choice is to split THP if frontswap is enabled.  But it is found
that the frontswap enabling isn't flexible.  For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.

Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.

Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>	[put THP checking in backend]
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shaohua Li <shli@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org>	[4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stschake pushed a commit to stschake/linux that referenced this pull request Mar 11, 2018
I recently noticed a crash on arm64 when feeding a bogus index
into BPF tail call helper. The crash would not occur when the
interpreter is used, but only in case of JIT. Output looks as
follows:

  [  347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
  [...]
  [  347.043065] [fffb850e96492510] address between user and kernel address ranges
  [  347.050205] Internal error: Oops: 96000004 [anholt#1] SMP
  [...]
  [  347.190829] x13: 0000000000000000 x12: 0000000000000000
  [  347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
  [  347.201427] x9 : 0000000000000000 x8 : 0000000000000000
  [  347.206726] x7 : 0000000000000000 x6 : 001c991738000000
  [  347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
  [  347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
  [  347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
  [  347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
  [  347.235221] Call trace:
  [  347.237656]  0xffff000002f3a4fc
  [  347.240784]  bpf_test_run+0x78/0xf8
  [  347.244260]  bpf_prog_test_run_skb+0x148/0x230
  [  347.248694]  SyS_bpf+0x77c/0x1110
  [  347.251999]  el0_svc_naked+0x30/0x34
  [  347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
  [...]

In this case the index used in BPF r3 is the same as in r1
at the time of the call, meaning we fed a pointer as index;
here, it had the value 0xffff808fd7cf0500 which sits in x2.

While I found tail calls to be working in general (also for
hitting the error cases), I noticed the following in the code
emission:

  # bpftool p d j i 988
  [...]
  38:   ldr     w10, [x1,x10]
  3c:   cmp     w2, w10
  40:   b.ge    0x000000000000007c              <-- signed cmp
  44:   mov     x10, #0x20                      // anholt#32
  48:   cmp     x26, x10
  4c:   b.gt    0x000000000000007c
  50:   add     x26, x26, #0x1
  54:   mov     x10, #0x110                     // raspberrypi#272
  58:   add     x10, x1, x10
  5c:   lsl     x11, x2, anholt#3
  60:   ldr     x11, [x10,x11]                  <-- faulting insn (f86b694b)
  64:   cbz     x11, 0x000000000000007c
  [...]

Meaning, the tests passed because commit ddb5599 ("arm64:
bpf: implement bpf_tail_call() helper") was using signed compares
instead of unsigned which as a result had the test wrongly passing.

Change this but also the tail call count test both into unsigned
and cap the index as u32. Latter we did as well in 90caccd
("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
too. Tested on HiSilicon Hi1616.

Result after patch:

  # bpftool p d j i 268
  [...]
  38:	ldr	w10, [x1,x10]
  3c:	add	w2, w2, #0x0
  40:	cmp	w2, w10
  44:	b.cs	0x0000000000000080
  48:	mov	x10, #0x20                  	// anholt#32
  4c:	cmp	x26, x10
  50:	b.hi	0x0000000000000080
  54:	add	x26, x26, #0x1
  58:	mov	x10, #0x110                 	// raspberrypi#272
  5c:	add	x10, x1, x10
  60:	lsl	x11, x2, anholt#3
  64:	ldr	x11, [x10,x11]
  68:	cbz	x11, 0x0000000000000080
  [...]

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
stschake pushed a commit to stschake/linux that referenced this pull request Mar 16, 2018
Currently we can crash perf record when running in pipe mode, like:

  $ perf record ls | perf report
  # To display the perf.data header info, please use --header/--header-only options.
  #
  perf: Segmentation fault
  Error:
  The - file has no samples!

The callstack of the crash is:

    0x0000000000515242 in perf_event__synthesize_event_update_name
  3513            ev = event_update_event__new(len + 1, PERF_EVENT_UPDATE__NAME, evsel->id[0]);
  (gdb) bt
  #0  0x0000000000515242 in perf_event__synthesize_event_update_name
  anholt#1  0x00000000005158a4 in perf_event__synthesize_extra_attr
  anholt#2  0x0000000000443347 in record__synthesize
  anholt#3  0x00000000004438e3 in __cmd_record
  anholt#4  0x000000000044514e in cmd_record
  anholt#5  0x00000000004cbc95 in run_builtin
  anholt#6  0x00000000004cbf02 in handle_internal_command
  anholt#7  0x00000000004cc054 in run_argv
  anholt#8  0x00000000004cc422 in main

The reason of the crash is that the evsel does not have ids array
allocated and the pipe's synthesize code tries to access it.

We don't force evsel ids allocation when we have single event, because
it's not needed. However we need it when we are in pipe mode even for
single event as a key for evsel update event.

Fixing this by forcing evsel ids allocation event for single event, when
we are in pipe mode.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180302161354.30192-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
anholt pushed a commit that referenced this pull request Apr 17, 2018
[ upstream commit 16338a9 ]

I recently noticed a crash on arm64 when feeding a bogus index
into BPF tail call helper. The crash would not occur when the
interpreter is used, but only in case of JIT. Output looks as
follows:

  [  347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
  [...]
  [  347.043065] [fffb850e96492510] address between user and kernel address ranges
  [  347.050205] Internal error: Oops: 96000004 [#1] SMP
  [...]
  [  347.190829] x13: 0000000000000000 x12: 0000000000000000
  [  347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
  [  347.201427] x9 : 0000000000000000 x8 : 0000000000000000
  [  347.206726] x7 : 0000000000000000 x6 : 001c991738000000
  [  347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
  [  347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
  [  347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
  [  347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
  [  347.235221] Call trace:
  [  347.237656]  0xffff000002f3a4fc
  [  347.240784]  bpf_test_run+0x78/0xf8
  [  347.244260]  bpf_prog_test_run_skb+0x148/0x230
  [  347.248694]  SyS_bpf+0x77c/0x1110
  [  347.251999]  el0_svc_naked+0x30/0x34
  [  347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
  [...]

In this case the index used in BPF r3 is the same as in r1
at the time of the call, meaning we fed a pointer as index;
here, it had the value 0xffff808fd7cf0500 which sits in x2.

While I found tail calls to be working in general (also for
hitting the error cases), I noticed the following in the code
emission:

  # bpftool p d j i 988
  [...]
  38:   ldr     w10, [x1,x10]
  3c:   cmp     w2, w10
  40:   b.ge    0x000000000000007c              <-- signed cmp
  44:   mov     x10, #0x20                      // #32
  48:   cmp     x26, x10
  4c:   b.gt    0x000000000000007c
  50:   add     x26, x26, #0x1
  54:   mov     x10, #0x110                     // raspberrypi#272
  58:   add     x10, x1, x10
  5c:   lsl     x11, x2, #3
  60:   ldr     x11, [x10,x11]                  <-- faulting insn (f86b694b)
  64:   cbz     x11, 0x000000000000007c
  [...]

Meaning, the tests passed because commit ddb5599 ("arm64:
bpf: implement bpf_tail_call() helper") was using signed compares
instead of unsigned which as a result had the test wrongly passing.

Change this but also the tail call count test both into unsigned
and cap the index as u32. Latter we did as well in 90caccd
("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
too. Tested on HiSilicon Hi1616.

Result after patch:

  # bpftool p d j i 268
  [...]
  38:	ldr	w10, [x1,x10]
  3c:	add	w2, w2, #0x0
  40:	cmp	w2, w10
  44:	b.cs	0x0000000000000080
  48:	mov	x10, #0x20                  	// #32
  4c:	cmp	x26, x10
  50:	b.hi	0x0000000000000080
  54:	add	x26, x26, #0x1
  58:	mov	x10, #0x110                 	// raspberrypi#272
  5c:	add	x10, x1, x10
  60:	lsl	x11, x2, #3
  64:	ldr	x11, [x10,x11]
  68:	cbz	x11, 0x0000000000000080
  [...]

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Apr 17, 2018
[ Upstream commit 72d5481 ]

It is unlikely request_threaded_irq will fail, but if it does for some
reason we should clear iommu->pr_irq in the error path. Also
intel_svm_finish_prq shouldn't try to clean up the page request
interrupt if pr_irq is 0. Without these, if request_threaded_irq were
to fail the following occurs:

fail with no fixes:

[    0.683147] ------------[ cut here ]------------
[    0.683148] NULL pointer, cannot free irq
[    0.683158] WARNING: CPU: 1 PID: 1 at kernel/irq/irqdomain.c:1632 irq_domain_free_irqs+0x126/0x140
[    0.683160] Modules linked in:
[    0.683163] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.15.0-rc2 #3
[    0.683165] Hardware name:                  /NUC7i3BNB, BIOS BNKBL357.86A.0036.2017.0105.1112 01/05/2017
[    0.683168] RIP: 0010:irq_domain_free_irqs+0x126/0x140
[    0.683169] RSP: 0000:ffffc90000037ce8 EFLAGS: 00010292
[    0.683171] RAX: 000000000000001d RBX: ffff880276283c00 RCX: ffffffff81c5e5e8
[    0.683172] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[    0.683174] RBP: ffff880276283c00 R08: 0000000000000000 R09: 000000000000023c
[    0.683175] R10: 0000000000000007 R11: 0000000000000000 R12: 000000000000007a
[    0.683176] R13: 0000000000000001 R14: 0000000000000000 R15: 0000010010000000
[    0.683178] FS:  0000000000000000(0000) GS:ffff88027ec80000(0000) knlGS:0000000000000000
[    0.683180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.683181] CR2: 0000000000000000 CR3: 0000000001c09001 CR4: 00000000003606e0
[    0.683182] Call Trace:
[    0.683189]  intel_svm_finish_prq+0x3c/0x60
[    0.683191]  free_dmar_iommu+0x1ac/0x1b0
[    0.683195]  init_dmars+0xaaa/0xaea
[    0.683200]  ? klist_next+0x19/0xc0
[    0.683203]  ? pci_do_find_bus+0x50/0x50
[    0.683205]  ? pci_get_dev_by_id+0x52/0x70
[    0.683208]  intel_iommu_init+0x498/0x5c7
[    0.683211]  pci_iommu_init+0x13/0x3c
[    0.683214]  ? e820__memblock_setup+0x61/0x61
[    0.683217]  do_one_initcall+0x4d/0x1a0
[    0.683220]  kernel_init_freeable+0x186/0x20e
[    0.683222]  ? set_debug_rodata+0x11/0x11
[    0.683225]  ? rest_init+0xb0/0xb0
[    0.683226]  kernel_init+0xa/0xff
[    0.683229]  ret_from_fork+0x1f/0x30
[    0.683259] Code: 89 ee 44 89 e7 e8 3b e8 ff ff 5b 5d 44 89 e7 44 89 ee 41 5c 41 5d 41 5e e9 a8 84 ff ff 48 c7 c7 a8 71 a7 81 31 c0 e8 6a d3 f9 ff <0f> ff 5b 5d 41 5c 41 5d 41 5
e c3 0f 1f 44 00 00 66 2e 0f 1f 84
[    0.683285] ---[ end trace f7650e42792627ca ]---

with iommu->pr_irq = 0, but no check in intel_svm_finish_prq:

[    0.669561] ------------[ cut here ]------------
[    0.669563] Trying to free already-free IRQ 0
[    0.669573] WARNING: CPU: 3 PID: 1 at kernel/irq/manage.c:1546 __free_irq+0xa4/0x2c0
[    0.669574] Modules linked in:
[    0.669577] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.15.0-rc2 #4
[    0.669579] Hardware name:                  /NUC7i3BNB, BIOS BNKBL357.86A.0036.2017.0105.1112 01/05/2017
[    0.669581] RIP: 0010:__free_irq+0xa4/0x2c0
[    0.669582] RSP: 0000:ffffc90000037cc0 EFLAGS: 00010082
[    0.669584] RAX: 0000000000000021 RBX: 0000000000000000 RCX: ffffffff81c5e5e8
[    0.669585] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 0000000000000046
[    0.669587] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000023c
[    0.669588] R10: 0000000000000007 R11: 0000000000000000 R12: ffff880276253960
[    0.669589] R13: ffff8802762538a4 R14: ffff880276253800 R15: ffff880276283600
[    0.669593] FS:  0000000000000000(0000) GS:ffff88027ed80000(0000) knlGS:0000000000000000
[    0.669594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.669596] CR2: 0000000000000000 CR3: 0000000001c09001 CR4: 00000000003606e0
[    0.669602] Call Trace:
[    0.669616]  free_irq+0x30/0x60
[    0.669620]  intel_svm_finish_prq+0x34/0x60
[    0.669623]  free_dmar_iommu+0x1ac/0x1b0
[    0.669627]  init_dmars+0xaaa/0xaea
[    0.669631]  ? klist_next+0x19/0xc0
[    0.669634]  ? pci_do_find_bus+0x50/0x50
[    0.669637]  ? pci_get_dev_by_id+0x52/0x70
[    0.669639]  intel_iommu_init+0x498/0x5c7
[    0.669642]  pci_iommu_init+0x13/0x3c
[    0.669645]  ? e820__memblock_setup+0x61/0x61
[    0.669648]  do_one_initcall+0x4d/0x1a0
[    0.669651]  kernel_init_freeable+0x186/0x20e
[    0.669653]  ? set_debug_rodata+0x11/0x11
[    0.669656]  ? rest_init+0xb0/0xb0
[    0.669658]  kernel_init+0xa/0xff
[    0.669661]  ret_from_fork+0x1f/0x30
[    0.669662] Code: 7a 08 75 0e e9 c3 01 00 00 4c 39 7b 08 74 57 48 89 da 48 8b 5a 18 48 85 db 75 ee 89 ee 48 c7 c7 78 67 a7 81 31 c0 e8 4c 37 fa ff <0f> ff 48 8b 34 24 4c 89 ef e
8 0e 4c 68 00 49 8b 46 40 48 8b 80
[    0.669688] ---[ end trace 58a470248700f2fc ]---

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Apr 17, 2018
[ Upstream commit d754941 ]

If, for any reason, userland shuts down iscsi transport interfaces
before proper logouts - like when logging in to LUNs manually, without
logging out on server shutdown, or when automated scripts can't
umount/logout from logged LUNs - kernel will hang forever on its
sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all
still existent paths.

PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
 #0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
 #1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
 #2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
 #3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
 #4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
 #5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
 #6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
 #7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
 #8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
 #9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c

This happens because iscsi_eh_cmd_timed_out(), the transport layer
timeout helper, would tell the queue timeout function (scsi_times_out)
to reset the request timer over and over, until the session state is
back to logged in state. Unfortunately, during server shutdown, this
might never happen again.

Other option would be "not to handle" the issue in the transport
layer. That would trigger the error handler logic, which would also need
the session state to be logged in again.

Best option, for such case, is to tell upper layers that the command was
handled during the transport layer error handler helper, marking it as
DID_NO_CONNECT, which will allow completion and inform about the
problem.

After the session was marked as ISCSI_STATE_FAILED, due to the first
timeout during the server shutdown phase, all subsequent cmds will fail
to be queued, allowing upper logic to fail faster.

Signed-off-by: Rafael David Tinoco <rafael.tinoco@canonical.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Apr 23, 2018
Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage.  This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.

I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it.  We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.

[n-horiguchi@ah.jp.nec.com: v2]
  Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
anholt pushed a commit that referenced this pull request Apr 23, 2018
Patch series "kexec_file, x86, powerpc: refactoring for other
architecutres", v2.

This is a preparatory patchset for adding kexec_file support on arm64.

It was originally included in a arm64 patch set[1], but Philipp is also
working on their kexec_file support on s390[2] and some changes are now
conflicting.

So these common parts were extracted and put into a separate patch set
for better integration.  What's more, my original patch#4 was split into
a few small chunks for easier review after Dave's comment.

As such, the resulting code is basically identical with my original, and
the only *visible* differences are:

 - renaming of _kexec_kernel_image_probe() and  _kimage_file_post_load_cleanup()

 - change one of types of arguments at prepare_elf64_headers()

Those, unfortunately, require a couple of trivial changes on the rest
(#1, #6 to #13) of my arm64 kexec_file patch set[1].

Patch #1 allows making a use of purgatory optional, particularly useful
for arm64.

Patch #2 commonalizes arch_kexec_kernel_{image_probe, image_load,
verify_sig}() and arch_kimage_file_post_load_cleanup() across
architectures.

Patches #3-#7 are also intended to generalize parse_elf64_headers(),
along with exclude_mem_range(), to be made best re-use of.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/561182.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.1/02596.html

This patch (of 7):

On arm64, crash dump kernel's usable memory is protected by *unmapping*
it from kernel virtual space unlike other architectures where the region
is just made read-only.  It is highly unlikely that the region is
accidentally corrupted and this observation rationalizes that digest
check code can also be dropped from purgatory.  The resulting code is so
simple as it doesn't require a bit ugly re-linking/relocation stuff,
i.e.  arch_kexec_apply_relocations_add().

Please see:

   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-December/545428.html

All that the purgatory does is to shuffle arguments and jump into a new
kernel, while we still need to have some space for a hash value
(purgatory_sha256_digest) which is never checked against.

As such, it doesn't make sense to have trampline code between old kernel
and new kernel on arm64.

This patch introduces a new configuration, ARCH_HAS_KEXEC_PURGATORY, and
allows related code to be compiled in only if necessary.

[takahiro.akashi@linaro.org: fix trivial screwup]
  Link: http://lkml.kernel.org/r/20180309093346.GF25863@linaro.org
Link: http://lkml.kernel.org/r/20180306102303.9063-2-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
anholt pushed a commit that referenced this pull request Apr 23, 2018
Patch series "kexec_file: Clean up purgatory load", v2.

Following the discussion with Dave and AKASHI, here are the common code
patches extracted from my recent patch set (Add kexec_file_load support
to s390) [1].  The patches were extracted to allow upstream integration
together with AKASHI's common code patches before the arch code gets
adjusted to the new base.

The reason for this series is to prepare common code for adding
kexec_file_load to s390 as well as cleaning up the mis-use of the
sh_offset field during purgatory load.  In detail this series contains:

Patch #1&2: Minor cleanups/fixes.

Patch #3-9: Clean up the purgatory load/relocation code.  Especially
remove the mis-use of the purgatory_info->sechdrs->sh_offset field,
currently holding a pointer into either kexec_purgatory (ro) or
purgatory_buf (rw) depending on the section.  With these patches the
section address will be calculated verbosely and sh_offset will contain
the offset of the section in the stripped purgatory binary
(purgatory_buf).

Patch #10: Allows architectures to set the purgatory load address.  This
patch is important for s390 as the kernel and purgatory have to be
loaded to fixed addresses.  In current code this is impossible as the
purgatory load is opaque to the architecture.

Patch #11: Moves x86 purgatories sha implementation to common lib/
directory to allow reuse in other architectures.

This patch (of 11)

When building the kernel with CONFIG_KEXEC_FILE enabled gcc prints a
compile warning multiple times.

  In file included from <path>/linux/init/initramfs.c:526:0:
  <path>/include/linux/kexec.h:120:9: warning: `struct kimage' declared inside parameter list [enabled by default]
           unsigned long cmdline_len);
           ^

This is because the typedefs for kexec_file_load uses struct kimage
before it is declared.  Fix this by simply forward declaring struct
kimage.

Link: http://lkml.kernel.org/r/20180321112751.22196-2-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
anholt pushed a commit that referenced this pull request May 7, 2018
[ Upstream commit 2c0aa08 ]

Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall

PID: 47039  TASK: ffff89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
 #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518
 #3 [ffff898e35f15b60] no_context at ffffffff8104854c
 #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff898e35f15dc8  RFLAGS: 00010282
    RAX: 00000000fffffffe  RBX: ffff889b77f6fc00  RCX:ffffffff81c99d88
    RDX: 0000000000000000  RSI: ffff896019ee08e8  RDI:ffff889b77f6fc00
    RBP: ffff898e35f15df0   R8: ffff896019ee08c8  R9:0000000000000000
    R10: 0000000000000400  R11: 0000000000000000  R12:ffff896019ee08c0
    R13: ffff889b77f6fe68  R14: ffffffff81c99d80  R15: ffffffffa022a1e0
    ORIG_RAX: ffffffffffffffff  CS: 0010 SS: 0018
 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
 #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6

PID: 45659  TASK: ffff880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
 #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
 #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
 #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
 #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
 #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
 #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
 #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670

PID: 45659                          PID: 47039
rds_ib_laddr_check
  /* create id_priv with a null event_handler */
  rdma_create_id
  rdma_bind_addr
    cma_acquire_dev
      /* add id_priv to cma_dev->id_list */
      cma_attach_to_dev
                                    cma_ndev_work_handler
                                      /* event_hanlder is null */
                                      id_priv->id.event_handler

Signed-off-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
anholt pushed a commit that referenced this pull request Jul 25, 2018
…sfers

This bug happens only when the UDC needs to sleep during usb_ep_dequeue,
as is the case for (at least) dwc3.

[  382.200896] BUG: scheduling while atomic: screen/1808/0x00000100
[  382.207124] 4 locks held by screen/1808:
[  382.211266]  #0:  (rcu_callback){....}, at: [<c10b4ff0>] rcu_process_callbacks+0x260/0x440
[  382.219949]  #1:  (rcu_read_lock_sched){....}, at: [<c1358ba0>] percpu_ref_switch_to_atomic_rcu+0xb0/0x130
[  382.230034]  #2:  (&(&ctx->ctx_lock)->rlock){....}, at: [<c11f0c73>] free_ioctx_users+0x23/0xd0
[  382.230096]  #3:  (&(&ffs->eps_lock)->rlock){....}, at: [<f81e7710>] ffs_aio_cancel+0x20/0x60 [usb_f_fs]
[  382.230160] Modules linked in: usb_f_fs libcomposite configfs bnep btsdio bluetooth ecdh_generic brcmfmac brcmutil intel_powerclamp coretemp dwc3 kvm_intel ulpi udc_core kvm irqbypass crc32_pclmul crc32c_intel pcbc dwc3_pci aesni_intel aes_i586 crypto_simd cryptd ehci_pci ehci_hcd gpio_keys usbcore basincove_gpadc industrialio usb_common
[  382.230407] CPU: 1 PID: 1808 Comm: screen Not tainted 4.14.0-edison+ #117
[  382.230416] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48
[  382.230425] Call Trace:
[  382.230438]  <SOFTIRQ>
[  382.230466]  dump_stack+0x47/0x62
[  382.230498]  __schedule_bug+0x61/0x80
[  382.230522]  __schedule+0x43/0x7a0
[  382.230587]  schedule+0x5f/0x70
[  382.230625]  dwc3_gadget_ep_dequeue+0x14c/0x270 [dwc3]
[  382.230669]  ? do_wait_intr_irq+0x70/0x70
[  382.230724]  usb_ep_dequeue+0x19/0x90 [udc_core]
[  382.230770]  ffs_aio_cancel+0x37/0x60 [usb_f_fs]
[  382.230798]  kiocb_cancel+0x31/0x40
[  382.230822]  free_ioctx_users+0x4d/0xd0
[  382.230858]  percpu_ref_switch_to_atomic_rcu+0x10a/0x130
[  382.230881]  ? percpu_ref_exit+0x40/0x40
[  382.230904]  rcu_process_callbacks+0x2b3/0x440
[  382.230965]  __do_softirq+0xf8/0x26b
[  382.231011]  ? __softirqentry_text_start+0x8/0x8
[  382.231033]  do_softirq_own_stack+0x22/0x30
[  382.231042]  </SOFTIRQ>
[  382.231071]  irq_exit+0x45/0xc0
[  382.231089]  smp_apic_timer_interrupt+0x13c/0x150
[  382.231118]  apic_timer_interrupt+0x35/0x3c
[  382.231132] EIP: __copy_user_ll+0xe2/0xf0
[  382.231142] EFLAGS: 00210293 CPU: 1
[  382.231154] EAX: bfd4508c EBX: 00000004 ECX: 00000003 EDX: f3d8fe50
[  382.231165] ESI: f3d8fe51 EDI: bfd4508d EBP: f3d8fe14 ESP: f3d8fe08
[  382.231176]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  382.231265]  core_sys_select+0x25f/0x320
[  382.231346]  ? __wake_up_common_lock+0x62/0x80
[  382.231399]  ? tty_ldisc_deref+0x13/0x20
[  382.231438]  ? ldsem_up_read+0x1b/0x40
[  382.231459]  ? tty_ldisc_deref+0x13/0x20
[  382.231479]  ? tty_write+0x29f/0x2e0
[  382.231514]  ? n_tty_ioctl+0xe0/0xe0
[  382.231541]  ? tty_write_unlock+0x30/0x30
[  382.231566]  ? __vfs_write+0x22/0x110
[  382.231604]  ? security_file_permission+0x2f/0xd0
[  382.231635]  ? rw_verify_area+0xac/0x120
[  382.231677]  ? vfs_write+0x103/0x180
[  382.231711]  SyS_select+0x87/0xc0
[  382.231739]  ? SyS_write+0x42/0x90
[  382.231781]  do_fast_syscall_32+0xd6/0x1a0
[  382.231836]  entry_SYSENTER_32+0x47/0x71
[  382.231848] EIP: 0xb7f75b05
[  382.231857] EFLAGS: 00000246 CPU: 1
[  382.231868] EAX: ffffffda EBX: 00000400 ECX: bfd4508c EDX: bfd4510c
[  382.231878] ESI: 00000000 EDI: 00000000 EBP: 00000000 ESP: bfd45020
[  382.231889]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  382.232281] softirq: huh, entered softirq 9 RCU c10b4d90 with preempt_count 00000100, exited with 00000000?

Tested-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
lategoodbye pushed a commit that referenced this pull request Sep 5, 2018
Fix the warning below by calling rhashtable_lookup_fast.
Also, make some code movements for better quality and human
readability.

[  342.450870] WARNING: suspicious RCU usage
[  342.455856] 4.18.0-rc2+ #17 Tainted: G           O
[  342.462210] -----------------------------
[  342.467202] ./include/linux/rhashtable.h:481 suspicious rcu_dereference_check() usage!
[  342.476568]
[  342.476568] other info that might help us debug this:
[  342.476568]
[  342.486978]
[  342.486978] rcu_scheduler_active = 2, debug_locks = 1
[  342.495211] 4 locks held by modprobe/3934:
[  342.500265]  #0: 00000000e23116b2 (mlx5_intf_mutex){+.+.}, at:
mlx5_unregister_interface+0x18/0x90 [mlx5_core]
[  342.511953]  #1: 00000000ca16db96 (rtnl_mutex){+.+.}, at: unregister_netdev+0xe/0x20
[  342.521109]  #2: 00000000a46e2c4b (&priv->state_lock){+.+.}, at: mlx5e_close+0x29/0x60
[mlx5_core]
[  342.531642]  #3: 0000000060c5bde3 (mem_id_lock){+.+.}, at: xdp_rxq_info_unreg+0x93/0x6b0
[  342.541206]
[  342.541206] stack backtrace:
[  342.547075] CPU: 12 PID: 3934 Comm: modprobe Tainted: G           O      4.18.0-rc2+ #17
[  342.556621] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[  342.565606] Call Trace:
[  342.568861]  dump_stack+0x78/0xb3
[  342.573086]  xdp_rxq_info_unreg+0x3f5/0x6b0
[  342.578285]  ? __call_rcu+0x220/0x300
[  342.582911]  mlx5e_free_rq+0x38/0xc0 [mlx5_core]
[  342.588602]  mlx5e_close_channel+0x20/0x120 [mlx5_core]
[  342.594976]  mlx5e_close_channels+0x26/0x40 [mlx5_core]
[  342.601345]  mlx5e_close_locked+0x44/0x50 [mlx5_core]
[  342.607519]  mlx5e_close+0x42/0x60 [mlx5_core]
[  342.613005]  __dev_close_many+0xb1/0x120
[  342.617911]  dev_close_many+0xa2/0x170
[  342.622622]  rollback_registered_many+0x148/0x460
[  342.628401]  ? __lock_acquire+0x48d/0x11b0
[  342.633498]  ? unregister_netdev+0xe/0x20
[  342.638495]  rollback_registered+0x56/0x90
[  342.643588]  unregister_netdevice_queue+0x7e/0x100
[  342.649461]  unregister_netdev+0x18/0x20
[  342.654362]  mlx5e_remove+0x2a/0x50 [mlx5_core]
[  342.659944]  mlx5_remove_device+0xe5/0x110 [mlx5_core]
[  342.666208]  mlx5_unregister_interface+0x39/0x90 [mlx5_core]
[  342.673038]  cleanup+0x5/0xbfc [mlx5_core]
[  342.678094]  __x64_sys_delete_module+0x16b/0x240
[  342.683725]  ? do_syscall_64+0x1c/0x210
[  342.688476]  do_syscall_64+0x5a/0x210
[  342.693025]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 8d5d885 ("xdp: rhashtable with allocator ID to pointer mapping")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
lategoodbye pushed a commit that referenced this pull request Sep 5, 2018
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core().  But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
   when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
   memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages.  For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
 online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
 online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path.  In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
  Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
  Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lategoodbye pushed a commit that referenced this pull request Sep 5, 2018
Patch series "add support for relative references in special sections", v10.

This adds support for emitting special sections such as initcall arrays,
PCI fixups and tracepoints as relative references rather than absolute
references.  This reduces the size by 50% on 64-bit architectures, but
more importantly, it removes the need for carrying relocation metadata for
these sections in relocatable kernels (e.g., for KASLR) that needs to be
fixed up at boot time.  On arm64, this reduces the vmlinux footprint of
such a reference by 8x (8 byte absolute reference + 24 byte RELA entry vs
4 byte relative reference)

Patch #3 was sent out before as a single patch.  This series supersedes
the previous submission.  This version makes relative ksymtab entries
dependent on the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS rather
than trying to infer from kbuild test robot replies for which
architectures it should be blacklisted.

Patch #1 introduces the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS,
and sets it for the main architectures that are expected to benefit the
most from this feature, i.e., 64-bit architectures or ones that use
runtime relocations.

Patch #2 add support for #define'ing __DISABLE_EXPORTS to get rid of
ksymtab/kcrctab sections in decompressor and EFI stub objects when
rebuilding existing C files to run in a different context.

Patches #4 - #6 implement relative references for initcalls, PCI fixups
and tracepoints, respectively, all of which produce sections with order
~1000 entries on an arm64 defconfig kernel with tracing enabled.  This
means we save about 28 KB of vmlinux space for each of these patches.

[From the v7 series blurb, which included the jump_label patches as well]:

  For the arm64 kernel, all patches combined reduce the memory footprint
  of vmlinux by about 1.3 MB (using a config copied from Ubuntu that has
  KASLR enabled), of which ~1 MB is the size reduction of the RELA section
  in .init, and the remaining 300 KB is reduction of .text/.data.

This patch (of 6):

Before updating certain subsystems to use place relative 32-bit
relocations in special sections, to save space and reduce the number of
absolute relocations that need to be processed at runtime by relocatable
kernels, introduce the Kconfig symbol and define it for some architectures
that should be able to support and benefit from it.

Link: http://lkml.kernel.org/r/20180704083651.24360-2-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: James Morris <jmorris@namei.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
Cc: James Morris <james.morris@microsoft.com>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lategoodbye pushed a commit that referenced this pull request May 25, 2019
Patch series "lib/sort & lib/list_sort: faster and smaller", v2.

Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
thought I'd try to reduce the number made by the library sort functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that nice
simple solid heapsort is preferable to more complex algorithms (sorry,
Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size
and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
corresponding upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced by
commit 835cc0c with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document the
clever algorithm.

TESTING NOTES: I have some ugly user-space benchmarking code which I
used for testing before moving this code into the kernel.  Shout if you
want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
round of minor edits to quell checkpatch.  I figure there will be at
least one round of comments and final testing.

This patch (of 5):

Rather than having special-case swap functions for 4- and 8-byte
objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
most users of sort() by avoiding fallback to the byte copy loop.

Despite what ca96ab8 ("lib/sort: Add 64 bit swap function") claims,
very few users of sort() sort pointers (or pointer-sized objects); most
sort structures containing at least two words.  (E.g.
drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
Uytterhoeven.

Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lategoodbye pushed a commit that referenced this pull request Oct 10, 2019
Observe a segmentation fault when 'perf stat' is asked to repeat forever
with the interval option.

Without fix:

  # perf stat -r 0 -I 5000 -e cycles -a sleep 10
  #           time             counts unit events
       5.000211692  3,13,89,82,34,157      cycles
      10.000380119  1,53,98,52,22,294      cycles
      10.040467280       17,16,79,265      cycles
  Segmentation fault

This problem was only observed when we use forever option aka -r 0 and
works with limited repeats. Calling print_counter with ts being set to
NULL, is not a correct option when interval is set. Hence avoid
print_counter(NULL,..)  if interval is set.

With fix:

  # perf stat -r 0 -I 5000 -e cycles -a sleep 10
   #           time             counts unit events
       5.019866622  3,15,14,43,08,697      cycles
      10.039865756  3,15,16,31,95,261      cycles
      10.059950628     1,26,05,47,158      cycles
       5.009902655  3,14,52,62,33,932      cycles
      10.019880228  3,14,52,22,89,154      cycles
      10.030543876       66,90,18,333      cycles
       5.009848281  3,14,51,98,25,437      cycles
      10.029854402  3,15,14,93,04,918      cycles
       5.009834177  3,14,51,95,92,316      cycles

Committer notes:

Did the 'git bisect' to find the cset introducing the problem to add the
Fixes tag below, and at that time the problem reproduced as:

  (gdb) run stat -r0 -I500 sleep 1
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  print_interval (prefix=prefix@entry=0x7fffffffc8d0 "", ts=ts@entry=0x0) at builtin-stat.c:866
  866		sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, csv_sep);
  (gdb) bt
  #0  print_interval (prefix=prefix@entry=0x7fffffffc8d0 "", ts=ts@entry=0x0) at builtin-stat.c:866
  #1  0x000000000041860a in print_counters (ts=ts@entry=0x0, argc=argc@entry=2, argv=argv@entry=0x7fffffffd640) at builtin-stat.c:938
  #2  0x0000000000419a7f in cmd_stat (argc=2, argv=0x7fffffffd640, prefix=<optimized out>) at builtin-stat.c:1411
  #3  0x000000000045c65a in run_builtin (p=p@entry=0x6291b8 <commands+216>, argc=argc@entry=5, argv=argv@entry=0x7fffffffd640) at perf.c:370
  #4  0x000000000045c893 in handle_internal_command (argc=5, argv=0x7fffffffd640) at perf.c:429
  #5  0x000000000045c8f1 in run_argv (argcp=argcp@entry=0x7fffffffd4ac, argv=argv@entry=0x7fffffffd4a0) at perf.c:473
  #6  0x000000000045cac9 in main (argc=<optimized out>, argv=<optimized out>) at perf.c:588
  (gdb)

Mostly the same as just before this patch:

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005874a7 in print_interval (config=0xa1f2a0 <stat_config>, evlist=0xbc9b90, prefix=0x7fffffffd1c0 "`", ts=0x0) at util/stat-display.c:964
  964		sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
  (gdb) bt
  #0  0x00000000005874a7 in print_interval (config=0xa1f2a0 <stat_config>, evlist=0xbc9b90, prefix=0x7fffffffd1c0 "`", ts=0x0) at util/stat-display.c:964
  #1  0x0000000000588047 in perf_evlist__print_counters (evlist=0xbc9b90, config=0xa1f2a0 <stat_config>, _target=0xa1f0c0 <target>, ts=0x0, argc=2, argv=0x7fffffffd670)
      at util/stat-display.c:1172
  #2  0x000000000045390f in print_counters (ts=0x0, argc=2, argv=0x7fffffffd670) at builtin-stat.c:656
  #3  0x0000000000456bb5 in cmd_stat (argc=2, argv=0x7fffffffd670) at builtin-stat.c:1960
  #4  0x00000000004dd2e0 in run_builtin (p=0xa30e00 <commands+288>, argc=5, argv=0x7fffffffd670) at perf.c:310
  #5  0x00000000004dd54d in handle_internal_command (argc=5, argv=0x7fffffffd670) at perf.c:362
  #6  0x00000000004dd694 in run_argv (argcp=0x7fffffffd4cc, argv=0x7fffffffd4c0) at perf.c:406
  #7  0x00000000004dda11 in main (argc=5, argv=0x7fffffffd670) at perf.c:531
  (gdb)

Fixes: d4f63a4 ("perf stat: Introduce print_counters function")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # v4.2+
Link: http://lore.kernel.org/lkml/20190904094738.9558-3-srikar@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
lategoodbye pushed a commit that referenced this pull request Oct 10, 2019
We release wrong pointer on error path in cpu_cache_level__read
function, leading to segfault:

  (gdb) r record ls
  Starting program: /root/perf/tools/perf/perf record ls
  ...
  [ perf record: Woken up 1 times to write data ]
  double free or corruption (out)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
  (gdb) bt
  #0  0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
  #1  0x00007ffff7443bac in abort () from /lib64/power9/libc.so.6
  #2  0x00007ffff74af8bc in __libc_message () from /lib64/power9/libc.so.6
  #3  0x00007ffff74b92b8 in malloc_printerr () from /lib64/power9/libc.so.6
  #4  0x00007ffff74bb874 in _int_free () from /lib64/power9/libc.so.6
  #5  0x0000000010271260 in __zfree (ptr=0x7fffffffa0b0) at ../../lib/zalloc..
  #6  0x0000000010139340 in cpu_cache_level__read (cache=0x7fffffffa090, cac..
  #7  0x0000000010143c90 in build_caches (cntp=0x7fffffffa118, size=<optimiz..
  ...

Releasing the proper pointer.

Fixes: 720e98b ("perf tools: Add perf data cache feature")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org: # v4.6+
Link: http://lore.kernel.org/lkml/20190912105235.10689-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
lategoodbye pushed a commit that referenced this pull request Oct 10, 2019
I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
have DEBUG defined in my build.  The problem is that
print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
consider here

  1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
  2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
  3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out

Right now, that last case isn't detected and we still call
print_hex_dump() from the stub wrapper.

Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
it works properly in all cases.

Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
is a nop, changing behavior to what we want, i.e.  print nothing.

Link: http://lkml.kernel.org/r/20190816235624.115280-1-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lategoodbye pushed a commit that referenced this pull request Oct 10, 2019
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset includes two small fixes for the mlxsw driver and one
patch which clarifies recently introduced devlink-trap documentation.

Patch #1 clears the port's VLAN filters during port initialization. This
ensures that the drop reason reported to the user is consistent. The
problem is explained in detail in the commit message.

Patch #2 clarifies the description of one of the traps exposed via
devlink-trap.

Patch #3 from Danielle forbids the installation of a tc filter with
multiple mirror actions since this is not supported by the device. The
failure is communicated to the user via extack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants