Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit deadman zevents as with delay zevents #11786

Merged
merged 3 commits into from
Apr 7, 2021

Conversation

ghost
Copy link

@ghost ghost commented Mar 23, 2021

Motivation and Context

Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Description

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Make existing deadman tests run on FreeBSD.

Add a test for the deadman event ratelimit.

How Has This Been Tested?

Ran ZTS on FreeBSD and Linux.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ghost ghost added the Status: Code Review Needed Ready for review and testing label Mar 23, 2021
@behlendorf
Copy link
Contributor

Good find. It just looks like the test case still needs to be refined a little.

http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/3320/steps/shell_4/logs/summary

@ghost ghost force-pushed the ratelimit-deadman branch from f2741f8 to 42e3c5f Compare March 24, 2021 21:45
@ghost
Copy link
Author

ghost commented Mar 24, 2021

  • Add sleep 10 after writing in the test to give the events time to trickle through the kernel

@ghost ghost force-pushed the ratelimit-deadman branch from 42e3c5f to eeced36 Compare March 25, 2021 16:43
@ghost
Copy link
Author

ghost commented Mar 25, 2021

  • Change the sleep into a loop that will try for up to 30 seconds

Starting to wonder how much work zpool wait -t events would be to implement...

@ghost ghost force-pushed the ratelimit-deadman branch from eeced36 to 9007277 Compare March 25, 2021 21:51
@ghost
Copy link
Author

ghost commented Mar 25, 2021

  • Doubled the timeout waiting for events

Increasing the timeout seemed to help the first time, but one or two still failed. This really shouldn't take that long, but we seem to get unlucky on AWS from time to time. I hope 60 seconds is plenty of time in practice.

@behlendorf
Copy link
Contributor

Even with the 60s limit we seem to have hit it with the Github Actions CI builders. That's really quite a long time, and taking a quick look at the console logs it does look like the deadman ran. So it seems there should have been events. But it's suspicious that both the Github Actions CI builders failed.

log_fail "No dropped events after $TIMEOUT seconds"
fi
sleep 1
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right to me. Why are we expecting dropped events here? I'd expect to only see drops reported when we overflow the zfs_zevent_len_max queue depth and the ZED doesn't consume the event before it's discard. Events which are internally ratelimited on the kernel side and never posted aren't considered to be drops.

What we've been meaning to do for a while is increase the default zfs_zevent_len_max to make drops far more uncommon. The current value is really very low, and scaled of the number of cpus which doesn't make much sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I definitely misunderstood that. I'll look for another way to test this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact zfs_is_ratelimiting_event() does increment ratelimit_dropped via fm_erpt_dropped_increment() when the rate limiting kicks in, so I should expect to see dropped events. ZED is not running during these tests, either. I wonder if it's possible the zio is actually completing in under 1ms, so no deadman events are posted? That feels too good to be true.

@behlendorf behlendorf self-requested a review April 1, 2021 15:51
@ghost ghost force-pushed the ratelimit-deadman branch from 9007277 to df58a01 Compare April 1, 2021 18:01
@ghost
Copy link
Author

ghost commented Apr 1, 2021

  • Rebased
  • Removed the check for dropped events

Looks like something slipped in to break the build on stable/12

log_must zinject -a
log_must dd if=$mntpnt/file of=$TEST_BASE_DIR/devnull oflag=sync

sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sleep here needed with such a short deadman timeout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's going on here. I put the sleep there under the understanding that events might not have finished making their way into the queue by the time we go to look for them. Ten seconds should have been overkill, but apparently it's not even enough in some cases. There are some test bots failing to see any deadman events somehow.

Unfortunately, I can't reproduce these failures locally. The tests are working as intended on FreeBSD and Linux in my dev environments. I'll continue auditing the zevent code. Perhaps there are some more hidden issues like the ratelimit_dropped one I came across already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest updating the test to use the file_wait_event function from events_common.ksh. That checks the ZED log for the message and you can set a maximum timeout... although I guess that wouldn't be portable for FreeBSD. But you could write something similar.

One other thing you might try is to replace the zinject -a with a zpool export; zpool import. While zinject -a should drop the cache, it may not drop everything if for example the buffers are still being written back. In which case they'll be cache hits for your dd read and you won't trigger the deadman probably.

@ghost ghost force-pushed the ratelimit-deadman branch from df58a01 to 1c9dab8 Compare April 5, 2021 19:16
@ghost
Copy link
Author

ghost commented Apr 5, 2021

  • Rebased
  • Export and import rather than using zinject to clear the ARC

log_must file_write -b 1048576 -c 8 -o create -d 0 -f $mntpnt/file
log_must zpool export $TESTPOOL
log_must zpool import $TESTPOOL
log_must dd if=$mntpnt/file of=$TEST_BASE_DIR/devnull oflag=sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in the most recent results we didn't see any delays. Perhaps we should make sure the deadman will trigger by additionally injecting some IO delays in to the pipeline. Check out the apply_zinject_delays function in perf.shlib for how to use zinject for this.

Ryan Moeller added 3 commits April 6, 2021 17:43
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
@ghost ghost force-pushed the ratelimit-deadman branch from 1c9dab8 to ddf3ec8 Compare April 6, 2021 21:53
@ghost
Copy link
Author

ghost commented Apr 6, 2021

  • Rebased
  • Inject delay for testing

@ghost
Copy link
Author

ghost commented Apr 7, 2021

🎉

@behlendorf
Copy link
Contributor

Alright! Let me just resubmit those builds for good measure then this should be good to go.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 7, 2021
@behlendorf behlendorf merged commit e778b04 into openzfs:master Apr 7, 2021
@ghost ghost deleted the ratelimit-deadman branch April 8, 2021 04:59
mcmilk pushed a commit to mcmilk/zfs that referenced this pull request Apr 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
adamdmoss pushed a commit to adamdmoss/zfs that referenced this pull request Apr 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
mcmilk pushed a commit to mcmilk/zfs that referenced this pull request Apr 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Apr 14, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 7, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
ghost pushed a commit to truenas/zfs that referenced this pull request May 13, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
behlendorf pushed a commit that referenced this pull request May 20, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11786
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes openzfs#11786
tonyhutter pushed a commit that referenced this pull request Jun 23, 2021
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11786
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants