Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call commit callbacks from the tail of the list #6986

Merged
merged 1 commit into from
Dec 22, 2017
Merged

Call commit callbacks from the tail of the list #6986

merged 1 commit into from
Dec 22, 2017

Conversation

lidongyang
Copy link
Contributor

@lidongyang lidongyang commented Dec 20, 2017

Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workload:
[ 3597.867291] NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [tx_commit_cb:67888]
[ 3597.867329] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log sb_edac zfs(POE) edac_core intel_powerclamp coretemp zunicode(POE) zavl(POE) intel_rapl icp(POE) iosf_mbi kvm_intel zcommon(POE) znvpair(POE) spl(OE) kvm iTCO_wdt iTCO_vendor_support irqbypass dm_round_robin crc32_pclmul ghash_clmulni_intel sg ipmi_si hpilo hpwdt aesni_intel ipmi_devintf lrw gf128mul glue_helper ipmi_msghandler ioatdma ablk_helper pcspkr i2c_i801 wmi cryptd nfsd dm_multipath lpc_ich shpchp dca acpi_cpufreq acpi_power_meter dm_mod
[ 3597.867344] auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm tg3 mlx5_core(OE) devlink mlx_compat(OE) crct10dif_pclmul ptp crct10dif_common serio_raw crc32c_intel i2c_core pps_core hpsa(OE) scsi_transport_sas
[ 3597.867346] CPU: 20 PID: 67888 Comm: tx_commit_cb Tainted: P OE ------------ 3.10.0-693.2.2.el7.x86_64 #1
[ 3597.867348] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
[ 3597.867349] task: ffff889c3d6f1fa0 ti: ffff8898b7dec000 task.ti: ffff8898b7dec000
[ 3597.867357] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x116/0x1e0
[ 3597.867358] RSP: 0018:ffff8898b7defcd0 EFLAGS: 00000246
[ 3597.867358] RAX: 0000000000000000 RBX: 000000010001a588 RCX: 0000000000a10000
[ 3597.867359] RDX: ffff88befed97880 RSI: 0000000000c10001 RDI: ffff885e58a78138
[ 3597.867359] RBP: ffff8898b7defcd0 R08: ffff88befec97880 R09: 0000000000000000
[ 3597.867360] R10: 00000000ea2b3a01 R11: ffffea017ba8acc0 R12: ffff885efbf32f40
[ 3597.867361] R13: ffff88bebc38a48d R14: 0000000000016d39 R15: ffffffff811de591
[ 3597.867361] FS: 0000000000000000(0000) GS:ffff88befec80000(0000) knlGS:0000000000000000
[ 3597.867362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3597.867362] CR2: 00007fda2a9a2090 CR3: 0000005eef36d000 CR4: 00000000003407e0
[ 3597.867363] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3597.867363] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3597.867364] Stack:
[ 3597.867365] ffff8898b7defce0 ffffffff8169e61f ffff8898b7defcf0 ffffffff816abb70
[ 3597.867366] ffff8898b7defd68 ffffffffc136314e ffff885e58a78138 ffff885c6376a650
[ 3597.867367] ffff8898b7defd10 ffff8898b7defd10 0000000000000000 0000000000000000
[ 3597.867367] Call Trace:
[ 3597.867375] [] queued_spin_lock_slowpath+0xb/0xf
[ 3597.867378] [] _raw_spin_lock+0x20/0x30
[ 3597.867445] [] ptlrpc_commit_replies+0x7e/0x380 [ptlrpc]
[ 3597.867481] [] tgt_cb_last_committed+0x2c2/0x3d0 [ptlrpc]
[ 3597.867489] [] osd_trans_commit_cb+0x14b/0x490 [osd_zfs]
[ 3597.867529] [] dmu_tx_do_callbacks+0x44/0x70 [zfs]
[ 3597.867554] [] txg_do_callbacks+0x14/0x30 [zfs]
[ 3597.867561] [] taskq_thread+0x246/0x470 [spl]
[ 3597.867564] [] ? wake_up_state+0x20/0x20
[ 3597.867568] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 3597.867571] [] kthread+0xcf/0xe0
[ 3597.867573] [] ? insert_kthread_work+0x40/0x40
[ 3597.867576] [] ret_from_fork+0x58/0x90
[ 3597.867577] [] ? insert_kthread_work+0x40/0x40
[ 3597.867588] Code: 0d 48 98 83 e2 30 48 81 c2 80 78 01 00 48 03 14 c5 e0 fd b0 81 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 08 <85> c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2

This patch makes zfs commit callbacks work on highest transaction
number first, saving the subsequent calls to ptlrpc_commit_replies
in Lustre, which makes the problem go away.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Signed-off-by: Li Dongyang dongyang.li@anu.edu.au

Description

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR and linking to the upstream Lustre issue. We should be able to make this change since this interface is currently only used by Lustre and we never made any promises about the calling order.

@lidongyang @bzzz77 am I correct in assuming the primary benefit from this change is due to reduced lock contention in Lustre thanks to the new fast path in tgt_cb_last_committed()? Can you share any performance numbers, does this change improve the overall metadata rates?

I also have a few small review comments which need to be addressed.

  • There's a nice comment in include/sys/dmu.h above the dmu_tx_callback_register prototype which describes the callback interface. Please add a few lines which describing the order in which the callbacks are now run.

  • Massage the commit message to pass the style checker. You can run make checkstyle locally until it passes.

  • I happened to notice the dmu_tx_callback_register() prototype is declared redundantly in include/sys/dmu_tx.h. Please remove it from include/sys/dmu_tx.h. While you're there also move the dmu_tx_do_callbacks prototype from include/sys/dmu_tx.h to include/sys/dmu_tx.h so these are all in one place and only one place.

Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
@lidongyang
Copy link
Contributor Author

lidongyang commented Dec 21, 2017

@behlendorf yep you are right about tgt_cb_last_committed(), and it did improve performance numbers, we got them using mdtest:
Before the patch:

-- started at 12/15/2017 18:23:34 --

mdtest-1.9.4-rc1 was launched with 120 total task(s) on 30 node(s)
Command line used: /system/Benchmarks/Filesystem/mdtest/mdtest_ompi211 -u -d /g/data1b/fu2/jas900/mdtest/mdt0/25@/g/data1b/fu2/jas900/mdtest/mdt1/25@/g/data1b/fu2/jas900/mdtest/mdt2/25@/g/data1b/fu2/jas900/mdtest/mdt3/25 -n 32768 -i 2
Path: /g/data1b/fu2/jas900/mdtest/mdt0
FS: 11714.4 TiB   Used FS: 0.1%   Inodes: 191.8 Mi   Used Inodes: 24.9%

120 tasks, 3932160 files/directories

SUMMARY: (of 2 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      46209.655      45325.344      45767.499        442.155
   Directory stat    :     422943.103     395433.941     409188.522      13754.581
   Directory removal :     109555.447     107923.481     108739.464        815.983
   File creation     :      84488.802      82024.716      83256.759       1232.043
   File stat         :     290506.847     260661.397     275584.122      14922.725
   File read         :     308452.944     306049.686     307251.315       1201.629
   File removal      :     125729.192     122936.462     124332.827       1396.365
   Tree creation     :        280.482        262.247        271.365          9.118
   Tree removal      :         56.103         28.919         42.511         13.592

-- finished at 12/15/2017 18:31:30 --

Note that mdt will get soft lockup and eventually crash.

After the patch:

-- started at 12/16/2017 18:22:51 --

mdtest-1.9.4-rc1 was launched with 120 total task(s) on 30 node(s)
Command line used: /system/Benchmarks/Filesystem/mdtest/mdtest_ompi211 -u -d /g/data1b/fu2/jas900/mdtest/mdt0/26@/g/data1b/fu2/jas900/mdtest/mdt1/26@/g/data1b/fu2/jas900/mdtest/mdt2/26@/g/data1b/fu2/jas900/mdtest/mdt3/26 -n 32768 -i 2
Path: /g/data1b/fu2/jas900/mdtest/mdt0
FS: 11714.4 TiB   Used FS: 0.1%   Inodes: 191.6 Mi   Used Inodes: 25.6%

120 tasks, 3932160 files/directories

SUMMARY: (of 2 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   Directory creation:      54567.094      53627.233      54097.164        469.930
   Directory stat    :     430300.347     389450.366     409875.357      20424.991
   Directory removal :     155325.535     155181.753     155253.644         71.891
   File creation     :     129396.860     125304.163     127350.511       2046.348
   File stat         :     305263.708     301667.676     303465.692       1798.016
   File read         :     305201.679     303282.957     304242.318        959.361
   File removal      :     199668.640     198088.652     198878.646        789.994
   Tree creation     :        677.521        371.513        524.517        153.004
   Tree removal      :         51.097         31.695         41.396          9.701

-- finished at 12/16/2017 18:28:59 --

For the operations which will generate a transaction and a new last_commited, like creation and removal we can see a improvement. for others like stat and read, nothing really changed. It makes sense since the patch only changes trans callbacks.

Thanks
Dongyang

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the quick turn around and performance results.

@behlendorf behlendorf requested a review from ofaaland December 21, 2017 20:44
@codecov
Copy link

codecov bot commented Dec 22, 2017

Codecov Report

Merging #6986 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6986      +/-   ##
==========================================
+ Coverage   75.33%   75.36%   +0.02%     
==========================================
  Files         296      296              
  Lines       95454    95454              
==========================================
+ Hits        71915    71935      +20     
+ Misses      23539    23519      -20
Flag Coverage Δ
#kernel 74.59% <0%> (+0.01%) ⬆️
#user 67.67% <100%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 993669a...c020654. Read the comment docs.

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@behlendorf behlendorf merged commit 823d48b into openzfs:master Dec 22, 2017
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Dec 31, 2017
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 16, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 18, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 18, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jan 19, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this pull request Jan 29, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
tonyhutter pushed a commit that referenced this pull request Feb 6, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this pull request Feb 13, 2018
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants