Call commit callbacks from the tail of the list #6986

lidongyang · 2017-12-20T01:31:19Z

Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workload:
[ 3597.867291] NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [tx_commit_cb:67888]
[ 3597.867329] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log sb_edac zfs(POE) edac_core intel_powerclamp coretemp zunicode(POE) zavl(POE) intel_rapl icp(POE) iosf_mbi kvm_intel zcommon(POE) znvpair(POE) spl(OE) kvm iTCO_wdt iTCO_vendor_support irqbypass dm_round_robin crc32_pclmul ghash_clmulni_intel sg ipmi_si hpilo hpwdt aesni_intel ipmi_devintf lrw gf128mul glue_helper ipmi_msghandler ioatdma ablk_helper pcspkr i2c_i801 wmi cryptd nfsd dm_multipath lpc_ich shpchp dca acpi_cpufreq acpi_power_meter dm_mod
[ 3597.867344] auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm tg3 mlx5_core(OE) devlink mlx_compat(OE) crct10dif_pclmul ptp crct10dif_common serio_raw crc32c_intel i2c_core pps_core hpsa(OE) scsi_transport_sas
[ 3597.867346] CPU: 20 PID: 67888 Comm: tx_commit_cb Tainted: P OE ------------ 3.10.0-693.2.2.el7.x86_64 #1
[ 3597.867348] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
[ 3597.867349] task: ffff889c3d6f1fa0 ti: ffff8898b7dec000 task.ti: ffff8898b7dec000
[ 3597.867357] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x116/0x1e0
[ 3597.867358] RSP: 0018:ffff8898b7defcd0 EFLAGS: 00000246
[ 3597.867358] RAX: 0000000000000000 RBX: 000000010001a588 RCX: 0000000000a10000
[ 3597.867359] RDX: ffff88befed97880 RSI: 0000000000c10001 RDI: ffff885e58a78138
[ 3597.867359] RBP: ffff8898b7defcd0 R08: ffff88befec97880 R09: 0000000000000000
[ 3597.867360] R10: 00000000ea2b3a01 R11: ffffea017ba8acc0 R12: ffff885efbf32f40
[ 3597.867361] R13: ffff88bebc38a48d R14: 0000000000016d39 R15: ffffffff811de591
[ 3597.867361] FS: 0000000000000000(0000) GS:ffff88befec80000(0000) knlGS:0000000000000000
[ 3597.867362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3597.867362] CR2: 00007fda2a9a2090 CR3: 0000005eef36d000 CR4: 00000000003407e0
[ 3597.867363] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3597.867363] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3597.867364] Stack:
[ 3597.867365] ffff8898b7defce0 ffffffff8169e61f ffff8898b7defcf0 ffffffff816abb70
[ 3597.867366] ffff8898b7defd68 ffffffffc136314e ffff885e58a78138 ffff885c6376a650
[ 3597.867367] ffff8898b7defd10 ffff8898b7defd10 0000000000000000 0000000000000000
[ 3597.867367] Call Trace:
[ 3597.867375] [] queued_spin_lock_slowpath+0xb/0xf
[ 3597.867378] [] _raw_spin_lock+0x20/0x30
[ 3597.867445] [] ptlrpc_commit_replies+0x7e/0x380 [ptlrpc]
[ 3597.867481] [] tgt_cb_last_committed+0x2c2/0x3d0 [ptlrpc]
[ 3597.867489] [] osd_trans_commit_cb+0x14b/0x490 [osd_zfs]
[ 3597.867529] [] dmu_tx_do_callbacks+0x44/0x70 [zfs]
[ 3597.867554] [] txg_do_callbacks+0x14/0x30 [zfs]
[ 3597.867561] [] taskq_thread+0x246/0x470 [spl]
[ 3597.867564] [] ? wake_up_state+0x20/0x20
[ 3597.867568] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 3597.867571] [] kthread+0xcf/0xe0
[ 3597.867573] [] ? insert_kthread_work+0x40/0x40
[ 3597.867576] [] ret_from_fork+0x58/0x90
[ 3597.867577] [] ? insert_kthread_work+0x40/0x40
[ 3597.867588] Code: 0d 48 98 83 e2 30 48 81 c2 80 78 01 00 48 03 14 c5 e0 fd b0 81 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 08 <85> c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2

This patch makes zfs commit callbacks work on highest transaction
number first, saving the subsequent calls to ptlrpc_commit_replies
in Lustre, which makes the problem go away.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Signed-off-by: Li Dongyang dongyang.li@anu.edu.au

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

behlendorf

Thanks for opening this PR and linking to the upstream Lustre issue. We should be able to make this change since this interface is currently only used by Lustre and we never made any promises about the calling order.

@lidongyang @bzzz77 am I correct in assuming the primary benefit from this change is due to reduced lock contention in Lustre thanks to the new fast path in tgt_cb_last_committed()? Can you share any performance numbers, does this change improve the overall metadata rates?

I also have a few small review comments which need to be addressed.

There's a nice comment in include/sys/dmu.h above the dmu_tx_callback_register prototype which describes the callback interface. Please add a few lines which describing the order in which the callbacks are now run.
Massage the commit message to pass the style checker. You can run make checkstyle locally until it passes.
I happened to notice the dmu_tx_callback_register() prototype is declared redundantly in include/sys/dmu_tx.h. Please remove it from include/sys/dmu_tx.h. While you're there also move the dmu_tx_do_callbacks prototype from include/sys/dmu_tx.h to include/sys/dmu_tx.h so these are all in one place and only one place.

Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>

lidongyang · 2017-12-21T04:50:31Z

@behlendorf yep you are right about tgt_cb_last_committed(), and it did improve performance numbers, we got them using mdtest:
Before the patch:

-- started at 12/15/2017 18:23:34 --

mdtest-1.9.4-rc1 was launched with 120 total task(s) on 30 node(s)
Command line used: /system/Benchmarks/Filesystem/mdtest/mdtest_ompi211 -u -d /g/data1b/fu2/jas900/mdtest/mdt0/25@/g/data1b/fu2/jas900/mdtest/mdt1/25@/g/data1b/fu2/jas900/mdtest/mdt2/25@/g/data1b/fu2/jas900/mdtest/mdt3/25 -n 32768 -i 2
Path: /g/data1b/fu2/jas900/mdtest/mdt0
FS: 11714.4 TiB   Used FS: 0.1%   Inodes: 191.8 Mi   Used Inodes: 24.9%

120 tasks, 3932160 files/directories

SUMMARY: (of 2 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      46209.655      45325.344      45767.499        442.155
   Directory stat    :     422943.103     395433.941     409188.522      13754.581
   Directory removal :     109555.447     107923.481     108739.464        815.983
   File creation     :      84488.802      82024.716      83256.759       1232.043
   File stat         :     290506.847     260661.397     275584.122      14922.725
   File read         :     308452.944     306049.686     307251.315       1201.629
   File removal      :     125729.192     122936.462     124332.827       1396.365
   Tree creation     :        280.482        262.247        271.365          9.118
   Tree removal      :         56.103         28.919         42.511         13.592

-- finished at 12/15/2017 18:31:30 --

Note that mdt will get soft lockup and eventually crash.

After the patch:

-- started at 12/16/2017 18:22:51 --

mdtest-1.9.4-rc1 was launched with 120 total task(s) on 30 node(s)
Command line used: /system/Benchmarks/Filesystem/mdtest/mdtest_ompi211 -u -d /g/data1b/fu2/jas900/mdtest/mdt0/26@/g/data1b/fu2/jas900/mdtest/mdt1/26@/g/data1b/fu2/jas900/mdtest/mdt2/26@/g/data1b/fu2/jas900/mdtest/mdt3/26 -n 32768 -i 2
Path: /g/data1b/fu2/jas900/mdtest/mdt0
FS: 11714.4 TiB   Used FS: 0.1%   Inodes: 191.6 Mi   Used Inodes: 25.6%

120 tasks, 3932160 files/directories

SUMMARY: (of 2 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      54567.094      53627.233      54097.164        469.930
   Directory stat    :     430300.347     389450.366     409875.357      20424.991
   Directory removal :     155325.535     155181.753     155253.644         71.891
   File creation     :     129396.860     125304.163     127350.511       2046.348
   File stat         :     305263.708     301667.676     303465.692       1798.016
   File read         :     305201.679     303282.957     304242.318        959.361
   File removal      :     199668.640     198088.652     198878.646        789.994
   Tree creation     :        677.521        371.513        524.517        153.004
   Tree removal      :         51.097         31.695         41.396          9.701

-- finished at 12/16/2017 18:28:59 --

For the operations which will generate a transaction and a new last_commited, like creation and removal we can see a improvement. for others like stat and read, nothing really changed. It makes sense since the patch only changes trans callbacks.

Thanks
Dongyang

behlendorf

Looks good, thanks for the quick turn around and performance results.

codecov · 2017-12-22T00:03:08Z

Codecov Report

Merging #6986 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #6986      +/-   ##
==========================================
+ Coverage   75.33%   75.36%   +0.02%     
==========================================
  Files         296      296              
  Lines       95454    95454              
==========================================
+ Hits        71915    71935      +20     
+ Misses      23539    23519      -20

Flag	Coverage Δ
#kernel	`74.59% <0%> (+0.01%)`	⬆️
#user	`67.67% <100%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 993669a...c020654. Read the comment docs.

ofaaland

LGTM

Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes openzfs#6986

Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986

Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes openzfs#6986

behlendorf requested changes Dec 20, 2017

View reviewed changes

behlendorf approved these changes Dec 21, 2017

View reviewed changes

behlendorf requested a review from ofaaland December 21, 2017 20:44

ofaaland approved these changes Dec 22, 2017

View reviewed changes

behlendorf merged commit 823d48b into openzfs:master Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call commit callbacks from the tail of the list #6986

Call commit callbacks from the tail of the list #6986

lidongyang commented Dec 20, 2017 •

edited

Loading

behlendorf left a comment

lidongyang commented Dec 21, 2017 •

edited

Loading

behlendorf left a comment

codecov bot commented Dec 22, 2017

ofaaland left a comment

Call commit callbacks from the tail of the list #6986

Call commit callbacks from the tail of the list #6986

Conversation

lidongyang commented Dec 20, 2017 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

lidongyang commented Dec 21, 2017 • edited Loading

behlendorf left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 22, 2017

Codecov Report

ofaaland left a comment

Choose a reason for hiding this comment

lidongyang commented Dec 20, 2017 •

edited

Loading

lidongyang commented Dec 21, 2017 •

edited

Loading