Skip to content
This repository has been archived by the owner on Oct 3, 2024. It is now read-only.

The Centos7.5 can not support #42

Closed
alanxie1129 opened this issue Jun 4, 2018 · 4 comments
Closed

The Centos7.5 can not support #42

alanxie1129 opened this issue Jun 4, 2018 · 4 comments

Comments

@alanxie1129
Copy link

git clone https://github.com/intel/gvt-linux.git

cd gvt-linux

git checkout gvt-stable-4.14

echo ""|make oldconfig

vi .config

change y for CONFIG_DRM_I915_GVT, CONFIG_DRM_I915_GVT_KVMGT and CONFIG_DRM_I915_GVT_XENGT CONFIG_VFIO_MDEV and CONFIG_VFIO_MDEV_DEVICE.

make -j8 && make modules_install && make install

vi /boot/grub2/grub.cfg

menuentry 'CentOS Linux (4.14.15+) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-862.el7.x86_64-advanced-dacc0767-4de5-45f5-a580-3d7cfa65073d' {
load_video
set gfxpayload=keep
insmod gzio
insmod part_msdos
insmod xfs
set root='hd0,msdos1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1' 0216cf0f-102e-41e8-b562-ce6d35d06fc1
else
search --no-floppy --fs-uuid --set=root 0216cf0f-102e-41e8-b562-ce6d35d06fc1
fi
linux16 /vmlinuz-4.14.15+ root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8 i915.enable_gvt=1 kvm.ignore_msrs=1
initrd16 /initramfs-4.14.15+.img
}

done

reboot os from new kernel boot

no this /sys/class/mdev_bus

but same practice for ubuntu 18.04 is good ,Centos7.5 is not support

@TerrenceXu
Copy link

Hi, Centos7.5 is also supported for GVT-g.
You can add (add_drivers+="kvmgt vfio-iommu-type1 vfio-mdev") into /etc/dracut.conf, and regenerate initrd, and try again. :)

@alanxie1129
Copy link
Author

@TerrenceXu
Did you mean rhel does not need to install a new kernel? Just add add_drivers+="kvmgt vfio-iommu-type1 vfio-mdev" in /etc/dracut.conf? Can you tell us the detailed steps?

@alanxie1129
Copy link
Author

edit /etc/dracut.conf
add_drivers+="kvmgt vfio-iommu-type1 vfio-mdev"

dracut -f /boot/initramfs-3.10.0-862.el7.x86_64.img
or
dracut -f /boot/initramfs-4.17.0-1.el7.elrepo.x86_64.img

Kernel startup failed

zhenyw pushed a commit that referenced this issue Nov 19, 2018
Increase kasan instrumented kernel stack size from 32k to 64k. Other
architectures seems to get away with just doubling kernel stack size under
kasan, but on s390 this appears to be not enough due to bigger frame size.
The particular pain point is kasan inlined checks (CONFIG_KASAN_INLINE
vs CONFIG_KASAN_OUTLINE). With inlined checks one particular case hitting
stack overflow is fs sync on xfs filesystem:

 #0 [9a0681e8]  704 bytes  check_usage at 34b1fc
 #1 [9a0684a8]  432 bytes  check_usage at 34c710
 #2 [9a068658]  1048 bytes  validate_chain at 35044a
 #3 [9a068a70]  312 bytes  __lock_acquire at 3559fe
 #4 [9a068ba8]  440 bytes  lock_acquire at 3576ee
 #5 [9a068d60]  104 bytes  _raw_spin_lock at 21b44e0
 #6 [9a068dc8]  1992 bytes  enqueue_entity at 2dbf72
 #7 [9a069590]  1496 bytes  enqueue_task_fair at 2df5f0
 #8 [9a069b68]  64 bytes  ttwu_do_activate at 28f438
 #9 [9a069ba8]  552 bytes  try_to_wake_up at 298c4c
 #10 [9a069dd0]  168 bytes  wake_up_worker at 23f97c
 #11 [9a069e78]  200 bytes  insert_work at 23fc2e
 #12 [9a069f40]  648 bytes  __queue_work at 2487c0
 #13 [9a06a1c8]  200 bytes  __queue_delayed_work at 24db28
 #14 [9a06a290]  248 bytes  mod_delayed_work_on at 24de84
 #15 [9a06a388]  24 bytes  kblockd_mod_delayed_work_on at 153e2a0
 #16 [9a06a3a0]  288 bytes  __blk_mq_delay_run_hw_queue at 158168c
 #17 [9a06a4c0]  192 bytes  blk_mq_run_hw_queue at 1581a3c
 #18 [9a06a580]  184 bytes  blk_mq_sched_insert_requests at 15a2192
 #19 [9a06a638]  1024 bytes  blk_mq_flush_plug_list at 1590f3a
 #20 [9a06aa38]  704 bytes  blk_flush_plug_list at 1555028
 #21 [9a06acf8]  320 bytes  schedule at 219e476
 #22 [9a06ae38]  760 bytes  schedule_timeout at 21b0aac
 #23 [9a06b130]  408 bytes  wait_for_common at 21a1706
 #24 [9a06b2c8]  360 bytes  xfs_buf_iowait at fa1540
 #25 [9a06b430]  256 bytes  __xfs_buf_submit at fadae6
 #26 [9a06b530]  264 bytes  xfs_buf_read_map at fae3f6
 #27 [9a06b638]  656 bytes  xfs_trans_read_buf_map at 10ac9a8
 #28 [9a06b8c8]  304 bytes  xfs_btree_kill_root at e72426
 #29 [9a06b9f8]  288 bytes  xfs_btree_lookup_get_block at e7bc5e
 #30 [9a06bb18]  624 bytes  xfs_btree_lookup at e7e1a6
 #31 [9a06bd88]  2664 bytes  xfs_alloc_ag_vextent_near at dfa070
 #32 [9a06c7f0]  144 bytes  xfs_alloc_ag_vextent at dff3ca
 #33 [9a06c880]  1128 bytes  xfs_alloc_vextent at e05fce
 #34 [9a06cce8]  584 bytes  xfs_bmap_btalloc at e58342
 #35 [9a06cf30]  1336 bytes  xfs_bmapi_write at e618de
 #36 [9a06d468]  776 bytes  xfs_iomap_write_allocate at ff678e
 #37 [9a06d770]  720 bytes  xfs_map_blocks at f82af8
 #38 [9a06da40]  928 bytes  xfs_writepage_map at f83cd6
 #39 [9a06dde0]  320 bytes  xfs_do_writepage at f85872
 #40 [9a06df20]  1320 bytes  write_cache_pages at 73dfe8
 #41 [9a06e448]  208 bytes  xfs_vm_writepages at f7f892
 #42 [9a06e518]  88 bytes  do_writepages at 73fe6a
 #43 [9a06e570]  872 bytes  __writeback_single_inode at a20cb6
 #44 [9a06e8d8]  664 bytes  writeback_sb_inodes at a23be2
 #45 [9a06eb70]  296 bytes  __writeback_inodes_wb at a242e0
 #46 [9a06ec98]  928 bytes  wb_writeback at a2500e
 #47 [9a06f038]  848 bytes  wb_do_writeback at a260ae
 #48 [9a06f388]  536 bytes  wb_workfn at a28228
 #49 [9a06f5a0]  1088 bytes  process_one_work at 24a234
 #50 [9a06f9e0]  1120 bytes  worker_thread at 24ba26
 #51 [9a06fe40]  104 bytes  kthread at 26545a
 #52 [9a06fea8]             kernel_thread_starter at 21b6b62

To be able to increase the stack size to 64k reuse LLILL instruction
in __switch_to function to load 64k - STACK_FRAME_OVERHEAD - __PT_SIZE
(65192) value as unsigned.

Reported-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
zhenyw pushed a commit that referenced this issue Dec 10, 2018
Driver doesn't release rtnl lock if registration with
L2 driver (bnxt_re_register_netdev) fais and this causes
hang while requesting for the next lock.

[  371.635416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  371.635417] kworker/u48:1   D    0   634      2 0x80000000
[  371.635423] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
[  371.635424] Call Trace:
[  371.635426]  ? __schedule+0x36b/0xbd0
[  371.635429]  schedule+0x39/0x90
[  371.635430]  schedule_preempt_disabled+0x11/0x20
[  371.635431]  __mutex_lock+0x45b/0x9c0
[  371.635433]  ? __mutex_lock+0x16d/0x9c0
[  371.635435]  ? bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
[  371.635438]  ? wake_up_klogd+0x37/0x40
[  371.635442]  bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
[  371.635447]  bnxt_re_task+0xfd/0x180 [bnxt_re]
[  371.635449]  process_one_work+0x216/0x5b0
[  371.635450]  ? process_one_work+0x189/0x5b0
[  371.635453]  worker_thread+0x4e/0x3d0
[  371.635455]  kthread+0x10e/0x140
[  371.635456]  ? process_one_work+0x5b0/0x5b0
[  371.635458]  ? kthread_stop+0x220/0x220
[  371.635460]  ret_from_fork+0x3a/0x50
[  371.635477] INFO: task NetworkManager:1228 blocked for more than 120 seconds.
[  371.635478]       Tainted: G    B      OE     4.20.0-rc1+ #42
[  371.635479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Release the rtnl_lock correctly in the failure path.

Fixes: de5c95d ("RDMA/bnxt_re: Fix system crash during RDMA resource initialization")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
zhenyw pushed a commit that referenced this issue Dec 10, 2018
When bnxt_re_ib_reg returns failure, the device structure gets
freed. Driver tries to access the device pointer
after it is freed.

[ 4871.034744] Failed to register with netedev: 0xffffffa1
[ 4871.034765] infiniband (null): Failed to register with IB: 0xffffffea
[ 4871.046430] ==================================================================
[ 4871.046437] BUG: KASAN: use-after-free in bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046439] Write of size 4 at addr ffff880fa8406f48 by task kworker/u48:2/17813

[ 4871.046443] CPU: 20 PID: 17813 Comm: kworker/u48:2 Kdump: loaded Tainted: G B OE  4.20.0-rc1+ #42
[ 4871.046444] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[ 4871.046447] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
[ 4871.046449] Call Trace:
[ 4871.046454]  dump_stack+0x91/0xeb
[ 4871.046458]  print_address_description+0x6a/0x2a0
[ 4871.046461]  kasan_report+0x176/0x2d0
[ 4871.046463]  ? bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046466]  bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046470]  process_one_work+0x216/0x5b0
[ 4871.046471]  ? process_one_work+0x189/0x5b0
[ 4871.046475]  worker_thread+0x4e/0x3d0
[ 4871.046479]  kthread+0x10e/0x140
[ 4871.046480]  ? process_one_work+0x5b0/0x5b0
[ 4871.046482]  ? kthread_stop+0x220/0x220
[ 4871.046486]  ret_from_fork+0x3a/0x50

[ 4871.046492] The buggy address belongs to the page:
[ 4871.046494] page:ffffea003ea10180 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 4871.046495] flags: 0x57ffffc0000000()
[ 4871.046498] raw: 0057ffffc0000000 0000000000000000 ffffea003ea10188 0000000000000000
[ 4871.046500] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 4871.046501] page dumped because: kasan: bad access detected

Avoid accessing the device structure once it is freed.

Fixes: 497158a ("RDMA/bnxt_re: Fix the ib_reg failure cleanup")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
zhenyw pushed a commit that referenced this issue Dec 10, 2018
We see the following lockdep warning:

[ 2284.078521] ======================================================
[ 2284.078604] WARNING: possible circular locking dependency detected
[ 2284.078604] 4.19.0+ #42 Tainted: G            E
[ 2284.078604] ------------------------------------------------------
[ 2284.078604] rmmod/254 is trying to acquire lock:
[ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
[ 2284.078604]
[ 2284.078604] but task is already holding lock:
[ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
[ 2284.078604]
[ 2284.078604] which lock already depends on the new lock.
[ 2284.078604]
[ 2284.078604]
[ 2284.078604] the existing dependency chain (in reverse order) is:
[ 2284.078604]
[ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
[ 2284.078604]        tipc_node_timeout+0x20a/0x330 [tipc]
[ 2284.078604]        call_timer_fn+0xa1/0x280
[ 2284.078604]        run_timer_softirq+0x1f2/0x4d0
[ 2284.078604]        __do_softirq+0xfc/0x413
[ 2284.078604]        irq_exit+0xb5/0xc0
[ 2284.078604]        smp_apic_timer_interrupt+0xac/0x210
[ 2284.078604]        apic_timer_interrupt+0xf/0x20
[ 2284.078604]        default_idle+0x1c/0x140
[ 2284.078604]        do_idle+0x1bc/0x280
[ 2284.078604]        cpu_startup_entry+0x19/0x20
[ 2284.078604]        start_secondary+0x187/0x1c0
[ 2284.078604]        secondary_startup_64+0xa4/0xb0
[ 2284.078604]
[ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
[ 2284.078604]        del_timer_sync+0x34/0xa0
[ 2284.078604]        tipc_node_delete+0x1a/0x40 [tipc]
[ 2284.078604]        tipc_node_stop+0xcb/0x190 [tipc]
[ 2284.078604]        tipc_net_stop+0x154/0x170 [tipc]
[ 2284.078604]        tipc_exit_net+0x16/0x30 [tipc]
[ 2284.078604]        ops_exit_list.isra.8+0x36/0x70
[ 2284.078604]        unregister_pernet_operations+0x87/0xd0
[ 2284.078604]        unregister_pernet_subsys+0x1d/0x30
[ 2284.078604]        tipc_exit+0x11/0x6f2 [tipc]
[ 2284.078604]        __x64_sys_delete_module+0x1df/0x240
[ 2284.078604]        do_syscall_64+0x66/0x460
[ 2284.078604]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2284.078604]
[ 2284.078604] other info that might help us debug this:
[ 2284.078604]
[ 2284.078604]  Possible unsafe locking scenario:
[ 2284.078604]
[ 2284.078604]        CPU0                    CPU1
[ 2284.078604]        ----                    ----
[ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]                                lock((&n->timer)#2);
[ 2284.078604]                                lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]   lock((&n->timer)#2);
[ 2284.078604]
[ 2284.078604]  *** DEADLOCK ***
[ 2284.078604]
[ 2284.078604] 3 locks held by rmmod/254:
[ 2284.078604]  #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
[ 2284.078604]  #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
[ 2284.078604]  #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
[...}

The reason is that the node timer handler sometimes needs to delete a
node which has been disconnected for too long. To do this, it grabs
the lock 'node_list_lock', which may at the same time be held by the
generic node cleanup function, tipc_node_stop(), during module removal.
Since the latter is calling del_timer_sync() inside the same lock, we
have a potential deadlock.

We fix this letting the timer cleanup function use spin_trylock()
instead of just spin_lock(), and when it fails to grab the lock it
just returns so that the timer handler can terminate its execution.
This is safe to do, since tipc_node_stop() anyway is about to
delete both the timer and the node instance.

Fixes: 6a939f3 ("tipc: Auto removal of peer down node instance")
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zhenyw pushed a commit that referenced this issue Mar 29, 2019
We might have never enabled (started) the psock's parser, in which case it
will not get stopped when destroying the psock. This leads to a warning
when trying to cancel parser's work from psock's deferred destructor:

[  405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40
[  405.326712] Modules linked in: [last unloaded: test_bpf]
[  405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G        W         5.0.0 #42
[  405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[  405.329712] Workqueue: events sk_psock_destroy_deferred
[  405.330254] RIP: 0010:strp_done+0x3c/0x40
[  405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 <0f> 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b
[  405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246
[  405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8
[  405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640
[  405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665
[  405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600
[  405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8
[  405.338777] FS:  0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
[  405.339903] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0
[  405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  405.344415] Call Trace:
[  405.344821]  sk_psock_destroy_deferred+0x23/0x1b0
[  405.345585]  process_one_work+0x1ae/0x3e0
[  405.346110]  worker_thread+0x3c/0x3b0
[  405.346576]  ? pwq_unbound_release_workfn+0xd0/0xd0
[  405.347187]  kthread+0x11d/0x140
[  405.347601]  ? __kthread_parkme+0x80/0x80
[  405.348108]  ret_from_fork+0x35/0x40
[  405.348566] ---[ end trace a4a3af4026a327d4 ]---

Stop psock's parser just before canceling its work.

Fixes: 1d79895 ("sk_msg: Always cancel strp work before freeing the psock")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
zhenyw pushed a commit that referenced this issue Jun 11, 2019
Commit 3309bec ("KVM: PPC: Book3S HV: Fix lockdep warning when
entering the guest") moved calls to trace_hardirqs_{on,off} in the
entry path used for HPT guests.  Similar code exists in the new
streamlined entry path used for radix guests on POWER9.  This makes
the same change there, so as to avoid lockdep warnings such as this:

[  228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
[  228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270
[  228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat
+xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter
+ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf
+uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs
+zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor
+raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea
+sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core
[  228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42
[  228.686911] NIP:  c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20
[  228.686963] REGS: c000200cdb50f570 TRAP: 0700   Not tainted  (5.2.0-rc1-xive+)
[  228.687001] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 48222222  XER: 20040000
[  228.687060] CFAR: c000000000116db0 IRQMASK: 1
[  228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e
[  228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f
[  228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000
[  228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000
[  228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900
[  228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074
[  228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000
[  228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0
[  228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270
[  228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270
[  228.687466] Call Trace:
[  228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable)
[  228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0
[  228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100
[  228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340
[  228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv]
[  228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv]
[  228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[  228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[  228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm]
[  228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
[  228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
[  228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80
[  228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70
[  228.688109] Instruction dump:
[  228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87
[  228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87
[  228.688251] irq event stamp: 205
[  228.688287] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
[  228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
[  228.688412] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
[  228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
[  228.688513] ---[ end trace eb16f6260022a812 ]---
[  228.688548] possible reason: unannotated irqs-off.
[  228.688571] irq event stamp: 205
[  228.688607] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
[  228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
[  228.688719] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
[  228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210

Cc: stable@vger.kernel.org # v4.20+
Fixes: 95a6432 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
zhenyw pushed a commit that referenced this issue Aug 21, 2019
…ath device

When the user issues a command with side effects, we will end up freezing
the namespace request queue when updating disk info (and the same for
the corresponding mpath disk node).

However, we are not freezing the mpath node request queue,
which means that mpath I/O can still come in and block on blk_queue_enter
(called from nvme_ns_head_make_request -> direct_make_request).

This is a deadlock, because blk_queue_enter will block until the inner
namespace request queue is unfroze, but that process is blocked because
the namespace revalidation is trying to update the mpath disk info
and freeze its request queue (which will never complete because
of the I/O that is blocked on blk_queue_enter).

Fix this by freezing all the subsystem nsheads request queues before
executing the passthru command. Given that these commands are infrequent
we should not worry about this temporary I/O freeze to keep things sane.

Here is the matching hang traces:
--
[ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
[ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.487274] systemd-udevd D 0 17994 1 0x00000000
[ 374.493407] Call Trace:
[ 374.496145] __schedule+0x2ef/0x620
[ 374.500047] schedule+0x38/0xa0
[ 374.503569] blk_queue_enter+0x139/0x220
[ 374.507959] ? remove_wait_queue+0x60/0x60
[ 374.512540] direct_make_request+0x60/0x130
[ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
[ 374.523740] ? generic_make_request_checks+0x307/0x6f0
[ 374.529484] generic_make_request+0x10d/0x2e0
[ 374.534356] submit_bio+0x75/0x140
[ 374.538163] ? guard_bio_eod+0x32/0xe0
[ 374.542361] submit_bh_wbc+0x171/0x1b0
[ 374.546553] block_read_full_page+0x1ed/0x330
[ 374.551426] ? check_disk_change+0x70/0x70
[ 374.556008] ? scan_shadow_nodes+0x30/0x30
[ 374.560588] blkdev_readpage+0x18/0x20
[ 374.564783] do_read_cache_page+0x301/0x860
[ 374.569463] ? blkdev_writepages+0x10/0x10
[ 374.574037] ? prep_new_page+0x88/0x130
[ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
[ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
[ 374.588947] read_cache_page+0x12/0x20
[ 374.593142] read_dev_sector+0x2d/0xd0
[ 374.597337] read_lba+0x104/0x1f0
[ 374.601046] find_valid_gpt+0xfa/0x720
[ 374.605243] ? string_nocheck+0x58/0x70
[ 374.609534] ? find_valid_gpt+0x720/0x720
[ 374.614016] efi_partition+0x89/0x430
[ 374.618113] ? string+0x48/0x60
[ 374.621632] ? snprintf+0x49/0x70
[ 374.625339] ? find_valid_gpt+0x720/0x720
[ 374.629828] check_partition+0x116/0x210
[ 374.634214] rescan_partitions+0xb6/0x360
[ 374.638699] __blkdev_reread_part+0x64/0x70
[ 374.643377] blkdev_reread_part+0x23/0x40
[ 374.647860] blkdev_ioctl+0x48c/0x990
[ 374.651956] block_ioctl+0x41/0x50
[ 374.655766] do_vfs_ioctl+0xa7/0x600
[ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
[ 374.664832] ksys_ioctl+0x67/0x90
[ 374.668539] __x64_sys_ioctl+0x1a/0x20
[ 374.672732] do_syscall_64+0x5a/0x1c0
[ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
[ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.760170] nvmeadm D 0 49141 36333 0x00004080
[ 374.766301] Call Trace:
[ 374.769038] __schedule+0x2ef/0x620
[ 374.772939] schedule+0x38/0xa0
[ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
[ 374.781614] ? remove_wait_queue+0x60/0x60
[ 374.786192] blk_mq_freeze_queue+0x1a/0x20
[ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
[ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
[ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
[ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
[ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
[ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
[ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
[ 374.833114] do_vfs_ioctl+0xa7/0x600
[ 374.837117] ? __audit_syscall_entry+0xdd/0x130
[ 374.842184] ksys_ioctl+0x67/0x90
[ 374.845891] __x64_sys_ioctl+0x1a/0x20
[ 374.850082] do_syscall_64+0x5a/0x1c0
[ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
--

Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Tested-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
zhenyw pushed a commit that referenced this issue Dec 5, 2019
Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
 create_page_buffers+0x4d/0x60
 __block_write_begin_int+0x8e/0x5a0
 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
 ? jbd2__journal_start+0xd7/0x1f0
 ext4_da_write_begin+0x112/0x3d0
 generic_perform_write+0xf1/0x1b0
 ? file_update_time+0x70/0x140
 __generic_file_write_iter+0x141/0x1a0
 ext4_file_write_iter+0xef/0x3b0
 __vfs_write+0x17e/0x1e0
 vfs_write+0xa5/0x1a0
 ksys_write+0x57/0xd0
 do_syscall_64+0x55/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
@TerrenceXu
Copy link

Close it and focus on the latest centos.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants