MMP delay logic changes #7068

jhammond-intel · 2018-01-19T21:56:43Z

Description

RFC MMP delay logic changes from @adilger. These are based on the discussion on #7045 and #7048. (I have not tested these but I plan to.)

Motivation and Context

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

In mmp_thread(), if the mmp_delay is increasing beyond the fixed mmp_interval, use the longer mmp_delay value when determining if the pool should be suspended. Otherwise, if the system is very busy (but still getting MMP writes to disk) it may incorrectly suspend the pool, even though a peer system would not import it. Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>

Base zfs_multihost_fail_intervals on zfs_multihost_import_intervals by default, so that the number of failed writes for suspending a pool is not larger than the number of intervals a peer will use when checking if it is safe to import the pool. Otherwise, it would be possible (with poorly-set module parameters, fail_interval > import_interval) that the current importer writes MMP updates irregularly but eventually succeeds before mmp_fail_intervals (hence does not suspend the pool), yet the peer system imported the pool after waiting an insufficient number of update intervals. Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>

adilger · 2018-01-19T22:21:38Z

I would consider these to be more of a "bug fix" than "new feature", but either way I thought it would be useful to get some feedback sooner rather than later...

It needs to add updates to the man/man5/zfs-module-parameters.5 descriptions (I'm working on this already).

I was thinking about the issue of zfs_multihost_import_intervals vs. zfs_multihost_fail_intervals inconsistency between nodes. Is there a reason that these values would be module parameters rather than pool parameters? It would make sense to have them as pool parameters so that they are consistent between any nodes that are trying to import the pool. There may be a bit of chicken-and-egg problem, but it should still be able to read these parameters from the dataset without actually doing an import? That is fodder for a separate patch, however.

ofaaland · 2018-01-19T22:36:24Z

module/zfs/mmp.c

@@ -392,6 +391,30 @@ mmp_thread(void *arg)
 		else
 			next_time = start + MSEC2NSEC(MMP_DEFAULT_INTERVAL);

+		/*
+		 * Don't allow fail_intervals larger than import_intervals,


Would this be better implemented in a callback, so you can reject bad values at the time the user tries to set them? One such example is:

0582e40 Add callback for zfs_multihost_interval

ofaaland · 2018-01-19T22:55:34Z

Is there a reason that these values would be module parameters rather than pool parameters? It would make sense to have them as pool parameters so that they are consistent between any nodes that are trying to import the pool. There may be a bit of chicken-and-egg problem, but it should still be able to read these parameters from the dataset without actually doing an import?

The two reasons I remember both arise from the fact that pool properties are stored in an object in the pool.

We wanted to make the go/no-go decision very early in the import process, to minimize the chance that some later patch introduces dangerous code before the MMP check is performed.
Getting far enough to read an object from the pool results in the node (sometimes) crashing when the pool is changing. There are several areas of the ZFS kernel module code that assume no change within the pool during the import process. We found that assumption was made about vdev count, log device count, and IIRC device size. There might well be more, these just came up early

I agree, it would be great not to depend on uniform settings among the nodes.

adilger · 2018-01-19T23:29:15Z

What about storing the zfs_multihost_import_interval parameter in the per-VDEV nvlist? This is relatively easily parsed (I did that in the util-linux-ng version of libblkid) and could be accessed very early in the import process without any risk of state changing.

ofaaland · 2018-01-19T23:52:42Z

What about storing the zfs_multihost_import_interval parameter in the per-VDEV nvlist?

One might end up with different values from different VDEVs, for example if some devices get disconnected from host A but remain connected to host B. But if we take the MAX() of all values found we would still be guaranteed to have a safe value.

It wouldn't be a pool property in a sense, but the user would think of it that way. Sort of breaks the model, so I'm not sure how to best cope with that.

I agree that storage location for zfs_multihost_import_interval has advantages.

ofaaland · 2018-01-25T22:02:14Z

Looking into this I see other issues in MMP that are not dangerous but make MMP more likely to suspend the pool than they should be, or more difficult to debug than it need to be. I'm working through those and will be back with more information soon.

arturpzol · 2018-02-14T12:06:01Z

module/zfs/mmp.c

+				    "zfs_multihost_import_interval %u, "
+				    "using smaller value",
+				    zfs_multihost_fail_intervals,
+				    zfs_multihost_import_interval);


zfs_multihost_import_interval -> zfs_multihost_import_intervals ?

behlendorf · 2018-06-04T22:55:19Z

I'm going to close out this PR for now. Several of the changes proposed have already been committed as part of other patches. The remaining changes in behavior could benefit for additional discussion in a new PR.

Andreas Dilger added 2 commits January 19, 2018 15:46

behlendorf requested a review from ofaaland January 19, 2018 22:09

ofaaland reviewed Jan 19, 2018

View reviewed changes

morphinz mentioned this pull request Feb 8, 2018

ZFS suspends my pool On I/O disk failure.. #7118

Closed

arturpzol reviewed Feb 14, 2018

View reviewed changes

behlendorf closed this Jun 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMP delay logic changes #7068

MMP delay logic changes #7068

jhammond-intel commented Jan 19, 2018 •

edited

Loading

adilger commented Jan 19, 2018

ofaaland Jan 19, 2018

ofaaland commented Jan 19, 2018 •

edited

Loading

adilger commented Jan 19, 2018

ofaaland commented Jan 19, 2018

ofaaland commented Jan 25, 2018

arturpzol Feb 14, 2018

behlendorf commented Jun 4, 2018

MMP delay logic changes #7068

MMP delay logic changes #7068

Conversation

jhammond-intel commented Jan 19, 2018 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

adilger commented Jan 19, 2018

ofaaland Jan 19, 2018

Choose a reason for hiding this comment

ofaaland commented Jan 19, 2018 • edited Loading

adilger commented Jan 19, 2018

ofaaland commented Jan 19, 2018

ofaaland commented Jan 25, 2018

arturpzol Feb 14, 2018

Choose a reason for hiding this comment

behlendorf commented Jun 4, 2018

jhammond-intel commented Jan 19, 2018 •

edited

Loading

ofaaland commented Jan 19, 2018 •

edited

Loading