Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report long delays in writing MMP blocks #7212

Closed
wants to merge 2 commits into from

Conversation

ofaaland
Copy link
Contributor

@ofaaland ofaaland commented Feb 22, 2018

Description

Check time taken to acquire SCL_STATE in mmp_write_uberblock() and if that time is greater than 1/10 of MMP_MIN_INTERVAL report it viz zfs_dbgmsg().

Add an entry to multihost_history when no leaf can be found to write an MMP block to, and record the reason in the error column. The error code for such entries is a bitfield, displayed in hex:

0x1 At least one vdev (interior or leaf) was not writeable.
0x2 At least one writeable leaf vdev was found, but it had a pending MMP write.

timestamp = the time in seconds since the epoch when no leaf could be found originally.

duration = the time (in ns) during which no MMP block was written for this reason. This does not include the preceeding inter-write period nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when this occurred.

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

Sample output, truncated to fit:

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Motivation and Context

To assist troubleshooting #7118 and similar problems.

How Has This Been Tested?

I ran ztest locally. I also loaded the modules, imported a pool, injected delays via zinject, and extracted multihost_history. I confirmed that the error value reported and the time spans reported were consistent with the injected delays and with the timestamps and durations of the successful MMP writes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@ofaaland ofaaland added the Status: Work in Progress Not yet ready for general review label Feb 22, 2018
@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch 2 times, most recently from b8cf5bc to 9c01b92 Compare February 27, 2018 21:36
@ofaaland ofaaland changed the title WIP Detect long SCL_STATE acquisition time in mmp_write_uberblock Report long delays in writing MMP blocks Feb 27, 2018
@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch from 9c01b92 to 5a63955 Compare February 27, 2018 21:50
@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch 2 times, most recently from 8b57257 to 94bb49e Compare February 27, 2018 22:21
* MMP_TREE_RO At least one leaf was found, but none were writeable.
*/
typedef enum mmp_error {
MMP_WRITE_TREE_RO = ENETDOWN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tethering these to EHWPOISON + 1 might be a better choice since then they won't overlap with any legitimate existing errno for the architecture. Mind you this value will vary between architectures but that's also true for ENETDOWN. Alternately, you could start they at say 500 which is above most values I'm aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to returning bitmap instead as discussed.

module/zfs/mmp.c Outdated

*error = 0;
if (out_vd == NULL) {
if (fail_mask && MMP_FAIL_WRITE_PENDING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be & not &&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block of code removed entirely

module/zfs/mmp.c Outdated
if (out_vd == NULL) {
if (fail_mask && MMP_FAIL_WRITE_PENDING)
*error = MMP_WRITE_PENDING;
else if (fail_mask && MMP_FAIL_WRITE_PENDING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be & not &&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block of code removed entirely

module/zfs/mmp.c Outdated
*/

static vdev_t *
mmp_random_leaf(vdev_t *in_vd, int *error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong, since we're already changing this prototype let's make it a bit more conventional.

static int mmp_random_leaf(vdev_t *in_vd, vdev_t **out_vd);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

module/zfs/mmp.c Outdated
vd = mmp_random_leaf(spa->spa_root_vdev, &error);

mutex_enter(&mmp->mmp_io_lock);

if (vd == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment before this block describing the skipped history entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

(u_longlong_t)smh->mmp_kstat_id, (u_longlong_t)smh->txg,
(u_longlong_t)smh->timestamp, (longlong_t)smh->io_error,
(longlong_t)smh->duration, (u_longlong_t)smh->mmp_delay,
(u_longlong_t)smh->vdev_guid, (u_longlong_t)smh->vdev_label,
(smh->vdev_path ? smh->vdev_path : "-"));
(smh->vdev_path ? smh->vdev_path : "-"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] accidentally added extra white space before smh->vdev_path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* Set duration in existing "skip" record to how long we have waited for a leaf
* vdev to become available.
*
* Important that we start search at the end of the list where new records are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/at the end of the list/at the head of the list/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -730,6 +730,7 @@ typedef struct spa_mmp_history {
char *vdev_path;
uint64_t vdev_label; /* vdev label */
int io_error; /* error status of MMP write */
uint64_t error_start; /* hrtime of start of error period */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be hrtime_t.

@behlendorf
Copy link
Contributor

It looks like ztest caught an issue here too, see the CentOS 7 ztest log.

@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch 3 times, most recently from 43089bc to 21ce6e8 Compare February 28, 2018 21:14
@ofaaland
Copy link
Contributor Author

Re: the ztest failure, I am not taking any form of config lock while walking the tree to count leaves. Looks like SCL_VDEV is for exactly that purpose, but I propose I do that in a separate patch with any other locking changes that I find are required as I continue to investigate MMP. OK?

@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch from 21ce6e8 to 8dd07e8 Compare February 28, 2018 23:43
@ofaaland
Copy link
Contributor Author

ofaaland commented Mar 1, 2018

I'll add a test to trigger the "skipped MMP write" entries in multihost_history, which should address the code coverage. Consider me duly chastised.

module/zfs/mmp.c Outdated
* issued mmp write: records time issued, error status, etc.
* skipped mmp write: an mmp write could not be issued because no
* suitable leaf vdev was available. See comment
* above struct spa_mmp_history for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] formating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch from 8dd07e8 to 46def72 Compare March 3, 2018 00:31
@ofaaland
Copy link
Contributor Author

ofaaland commented Mar 3, 2018

Refreshed the patch with changes to address feedback.

@ofaaland ofaaland removed the Status: Work in Progress Not yet ready for general review label Mar 3, 2018
MMP_WRITE_PENDING,
MMP_WRITE_OTHER
} mmp_error_t;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadcode, please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

module/zfs/mmp.c Outdated
static int
mmp_random_leaf(vdev_t *in_vd, vdev_t **out_vd)
{
int error = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe int error_mask would be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch from 46def72 to 99d6842 Compare March 3, 2018 07:57
ofaaland added 2 commits March 5, 2018 17:11
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
@ofaaland ofaaland force-pushed the b_mmp_write_delayed branch from 99d6842 to cf117a6 Compare March 6, 2018 01:12
@ofaaland
Copy link
Contributor Author

ofaaland commented Mar 6, 2018

Rebased on master

@codecov
Copy link

codecov bot commented Mar 6, 2018

Codecov Report

Merging #7212 into master will increase coverage by 0.17%.
The diff coverage is 96.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7212      +/-   ##
==========================================
+ Coverage   76.35%   76.53%   +0.17%     
==========================================
  Files         327      327              
  Lines      103874   103922      +48     
==========================================
+ Hits        79314    79536     +222     
+ Misses      24560    24386     -174
Flag Coverage Δ
#kernel 76.15% <96.61%> (-0.03%) ⬇️
#user 65.84% <61.9%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2705ebf...cf117a6. Read the comment docs.

@behlendorf behlendorf closed this in 14c240c Mar 6, 2018
behlendorf pushed a commit that referenced this pull request Mar 6, 2018
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Mar 12, 2018
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes openzfs#7212
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Mar 13, 2018
This is a squashed patchset for zfs-0.7.7.  The individual commits are
in the tonyhutter:zfs-0.7.7-hutter branch.  I squashed the commits so
that buildbot wouldn't have to run against each one, and because
github/builbot seem to have a maximum limit of 30 commits they can
test from a PR.

- Fix MMP write frequency for large pools openzfs#7205 openzfs#7289
- Handle zio_resume and mmp => off openzfs#7286
- Fix zfs-kmod builds when using rpm >= 4.14 openzfs#7284
- zdb and inuse tests don't pass with real disks openzfs#6939 openzfs#7261
- Take user namespaces into account in policy checks openzfs#6800 openzfs#7270
- Detect long config lock acquisition in mmp openzfs#7212
- Linux 4.16 compat: get_disk_and_module() openzfs#7264
- Change checksum & IO delay ratelimit values openzfs#7252
- Increment zil_itx_needcopy_bytes properly openzfs#6988 openzfs#7176
- Fix some typos openzfs#7237
- Fix zpool(8) list example to match actual format openzfs#7244
- Add SMART self-test results to zpool status -c openzfs#7178
- Add scrub after resilver zed script openzfs#4662 openzfs#7086
- Fix free memory calculation on v3.14+ openzfs#7170
- Report duration and error in mmp_history entries openzfs#7190
- Do not initiate MMP writes while pool is suspended openzfs#7182
- Linux 4.16 compat: use correct *_dec_and_test()
- Allow modprobe to fail when called within systemd openzfs#7174
- Add SMART attributes for SSD and NVMe openzfs#7183 openzfs#7193
- Correct count_uberblocks in mmp.kshlib openzfs#7191
- Fix config issues: frame size and headers openzfs#7169
- Clarify zinject(8) explanation of -e openzfs#7172
- OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio openzfs#7168
- 'zfs receive' fails with "dataset is busy" openzfs#7129 openzfs#7154
- contrib/initramfs: add missing conf.d/zfs openzfs#7158
- mmp should use a fixed tag for spa_config locks openzfs#6530 openzfs#7155
- Handle zap_add() failures in mixed case mode openzfs#7011 openzfs#7054
- Fix zdb -ed on objset for exported pool openzfs#7099 openzfs#6464
- Fix zdb -E segfault openzfs#7099
- Fix zdb -R decompression openzfs#7099 openzfs#4984
- Fix racy assignment of zcb.zcb_haderrors openzfs#7099
- Fix zle_decompress out of bound access openzfs#7099
- Fix zdb -c traverse stop on damaged objset root openzfs#7099
- Linux 4.11 compat: avoid refcount_t name conflict openzfs#7148
- Linux 4.16 compat: inode_set_iversion() openzfs#7148
- OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable openzfs#7141
- Remove deprecated zfs_arc_p_aggressive_disable openzfs#7135
- Fix default libdir for Debian/Ubuntu openzfs#7083 openzfs#7101
- Bug fix in qat_compress.c for vmalloc addr check openzfs#7125
- Fix systemd_ RPM macros usage on Debian-based distributions openzfs#7074 openzfs#7100
- Emit an error message before MMP suspends pool openzfs#7048
- ZTS: Fix create-o_ashift test case openzfs#6924 openzfs#6977
- Fix --with-systemd on Debian-based distributions (openzfs#6963) openzfs#6591 openzfs#6963
- Remove vn_rename and vn_remove dependency openzfs/spl#648 openzfs#6753
- Add support for "--enable-code-coverage" option openzfs#6670
- Make "-fno-inline" compile option more accessible openzfs#6605
- Add configure option to enable gcov analysis openzfs#6642
- Implement --enable-debuginfo to force debuginfo openzfs#2734
- Make --enable-debug fail when given bogus args openzfs#2734

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Requires-spl: refs/pull/690/head
@tonyhutter tonyhutter mentioned this pull request Mar 13, 2018
13 tasks
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Mar 13, 2018
This is a squashed patchset for zfs-0.7.7.  The individual commits are
in the tonyhutter:zfs-0.7.7-hutter branch.  I squashed the commits so
that buildbot wouldn't have to run against each one, and because
github/builbot seem to have a maximum limit of 30 commits they can
test from a PR.

- Fix MMP write frequency for large pools openzfs#7205 openzfs#7289
- Handle zio_resume and mmp => off openzfs#7286
- Fix zfs-kmod builds when using rpm >= 4.14 openzfs#7284
- zdb and inuse tests don't pass with real disks openzfs#6939 openzfs#7261
- Take user namespaces into account in policy checks openzfs#6800 openzfs#7270
- Detect long config lock acquisition in mmp openzfs#7212
- Linux 4.16 compat: get_disk_and_module() openzfs#7264
- Change checksum & IO delay ratelimit values openzfs#7252
- Increment zil_itx_needcopy_bytes properly openzfs#6988 openzfs#7176
- Fix some typos openzfs#7237
- Fix zpool(8) list example to match actual format openzfs#7244
- Add SMART self-test results to zpool status -c openzfs#7178
- Add scrub after resilver zed script openzfs#4662 openzfs#7086
- Fix free memory calculation on v3.14+ openzfs#7170
- Report duration and error in mmp_history entries openzfs#7190
- Do not initiate MMP writes while pool is suspended openzfs#7182
- Linux 4.16 compat: use correct *_dec_and_test()
- Allow modprobe to fail when called within systemd openzfs#7174
- Add SMART attributes for SSD and NVMe openzfs#7183 openzfs#7193
- Correct count_uberblocks in mmp.kshlib openzfs#7191
- Fix config issues: frame size and headers openzfs#7169
- Clarify zinject(8) explanation of -e openzfs#7172
- OpenZFS 8857 - zio_remove_child() panic due to already destroyed
  parent zio openzfs#7168
- 'zfs receive' fails with "dataset is busy" openzfs#7129 openzfs#7154
- contrib/initramfs: add missing conf.d/zfs openzfs#7158
- mmp should use a fixed tag for spa_config locks openzfs#6530 openzfs#7155
- Handle zap_add() failures in mixed case mode openzfs#7011 openzfs#7054
- Fix zdb -ed on objset for exported pool openzfs#7099 openzfs#6464
- Fix zdb -E segfault openzfs#7099
- Fix zdb -R decompression openzfs#7099 openzfs#4984
- Fix racy assignment of zcb.zcb_haderrors openzfs#7099
- Fix zle_decompress out of bound access openzfs#7099
- Fix zdb -c traverse stop on damaged objset root openzfs#7099
- Linux 4.11 compat: avoid refcount_t name conflict openzfs#7148
- Linux 4.16 compat: inode_set_iversion() openzfs#7148
- OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common
  contains a use after end of the lifetime of a local variable openzfs#7141
- Remove deprecated zfs_arc_p_aggressive_disable openzfs#7135
- Fix default libdir for Debian/Ubuntu openzfs#7083 openzfs#7101
- Bug fix in qat_compress.c for vmalloc addr check openzfs#7125
- Fix systemd_ RPM macros usage on Debian-based distributions openzfs#7074
  openzfs#7100
- Emit an error message before MMP suspends pool openzfs#7048
- ZTS: Fix create-o_ashift test case openzfs#6924 openzfs#6977
- Fix --with-systemd on Debian-based distributions (openzfs#6963) openzfs#6591 openzfs#6963
- Remove vn_rename and vn_remove dependency openzfs/spl#648 openzfs#6753
- Add support for "--enable-code-coverage" option openzfs#6670
- Make "-fno-inline" compile option more accessible openzfs#6605
- Add configure option to enable gcov analysis openzfs#6642
- Implement --enable-debuginfo to force debuginfo openzfs#2734
- Make --enable-debug fail when given bogus args openzfs#2734

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Requires-spl: refs/pull/690/head
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Mar 13, 2018
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes openzfs#7212
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Mar 13, 2018
This is a squashed patchset for zfs-0.7.7.  The individual commits are
in the tonyhutter:zfs-0.7.7-hutter branch.  I squashed the commits so
that buildbot wouldn't have to run against each one, and because
github/builbot seem to have a maximum limit of 30 commits they can
test from a PR.

- Fix MMP write frequency for large pools openzfs#7205 openzfs#7289
- Handle zio_resume and mmp => off openzfs#7286
- Fix zfs-kmod builds when using rpm >= 4.14 openzfs#7284
- zdb and inuse tests don't pass with real disks openzfs#6939 openzfs#7261
- Take user namespaces into account in policy checks openzfs#6800 openzfs#7270
- Detect long config lock acquisition in mmp openzfs#7212
- Linux 4.16 compat: get_disk_and_module() openzfs#7264
- Change checksum & IO delay ratelimit values openzfs#7252
- Increment zil_itx_needcopy_bytes properly openzfs#6988 openzfs#7176
- Fix some typos openzfs#7237
- Fix zpool(8) list example to match actual format openzfs#7244
- Add SMART self-test results to zpool status -c openzfs#7178
- Add scrub after resilver zed script openzfs#4662 openzfs#7086
- Fix free memory calculation on v3.14+ openzfs#7170
- Report duration and error in mmp_history entries openzfs#7190
- Do not initiate MMP writes while pool is suspended openzfs#7182
- Linux 4.16 compat: use correct *_dec_and_test()
- Allow modprobe to fail when called within systemd openzfs#7174
- Add SMART attributes for SSD and NVMe openzfs#7183 openzfs#7193
- Correct count_uberblocks in mmp.kshlib openzfs#7191
- Fix config issues: frame size and headers openzfs#7169
- Clarify zinject(8) explanation of -e openzfs#7172
- OpenZFS 8857 - zio_remove_child() panic due to already destroyed
  parent zio openzfs#7168
- 'zfs receive' fails with "dataset is busy" openzfs#7129 openzfs#7154
- contrib/initramfs: add missing conf.d/zfs openzfs#7158
- mmp should use a fixed tag for spa_config locks openzfs#6530 openzfs#7155
- Handle zap_add() failures in mixed case mode openzfs#7011 openzfs#7054
- Fix zdb -ed on objset for exported pool openzfs#7099 openzfs#6464
- Fix zdb -E segfault openzfs#7099
- Fix zdb -R decompression openzfs#7099 openzfs#4984
- Fix racy assignment of zcb.zcb_haderrors openzfs#7099
- Fix zle_decompress out of bound access openzfs#7099
- Fix zdb -c traverse stop on damaged objset root openzfs#7099
- Linux 4.11 compat: avoid refcount_t name conflict openzfs#7148
- Linux 4.16 compat: inode_set_iversion() openzfs#7148
- OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common
  contains a use after end of the lifetime of a local variable openzfs#7141
- Remove deprecated zfs_arc_p_aggressive_disable openzfs#7135
- Fix default libdir for Debian/Ubuntu openzfs#7083 openzfs#7101
- Bug fix in qat_compress.c for vmalloc addr check openzfs#7125
- Fix systemd_ RPM macros usage on Debian-based distributions openzfs#7074
  openzfs#7100
- Emit an error message before MMP suspends pool openzfs#7048
- ZTS: Fix create-o_ashift test case openzfs#6924 openzfs#6977
- Fix --with-systemd on Debian-based distributions (openzfs#6963) openzfs#6591 openzfs#6963
- Remove vn_rename and vn_remove dependency openzfs/spl#648 openzfs#6753
- Fix "--enable-code-coverage" debug build openzfs#6674
- Update codecov.yml openzfs#6669
- Add support for "--enable-code-coverage" option openzfs#6670
- Make "-fno-inline" compile option more accessible openzfs#6605
- Add configure option to enable gcov analysis openzfs#6642
- Implement --enable-debuginfo to force debuginfo openzfs#2734
- Make --enable-debug fail when given bogus args openzfs#2734

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Requires-spl: refs/pull/690/head
tonyhutter pushed a commit that referenced this pull request Mar 19, 2018
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
@ofaaland ofaaland deleted the b_mmp_write_delayed branch April 2, 2018 17:32
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 16, 2018
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes openzfs#7212
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 4, 2018
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes openzfs#7212
tonyhutter pushed a commit that referenced this pull request May 10, 2018
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants