Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ENOSPC in "Handle zap_add() failures in mixed ..." #7421

Merged
merged 1 commit into from
Apr 18, 2018

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Apr 10, 2018

Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Signed-off-by: Chunwei Chen david.chen@nutanix.com

Description

Motivation and Context

#7401

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 10, 2018

@behlendorf behlendorf requested review from ahrens and behlendorf April 10, 2018 17:33
@tuxoko tuxoko force-pushed the i7401 branch 3 times, most recently from eb6d663 to cc83da9 Compare April 10, 2018 17:50
@trisk
Copy link
Contributor

trisk commented Apr 10, 2018

If mze_canfit_fzap_leaf is canonical (and not overly conservative), can we make it a condition for all paths in zap_add, not just microzap to fatzap upgrades? That way fzap_add is guaranteed to succeed.

@ahrens
Copy link
Member

ahrens commented Apr 10, 2018

Thanks for your analysis, and for adding a test case!

Are we seeing that 2 ZAP's end up with the same salt, or is it that even with (slightly?) different salts, the low bits of the hash value are preserved? Either way, perhaps we should consider initializing the salt with random_get_pseudo_bytes() rather than the current hacky way.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 10, 2018

@ahrens
The current salt on Linux is different but pretty limited. Because it XOR two pointer so the upper bits are all zeros. However, I did try using random_get_pseudo_bytes() but still get the ENOSPC error.

@trisk
The point for mze_canfit_fzap_leaf is to make sure you can fit those entries with same hash value (mixed mode) when you upgrade, so you won't fail during upgrade and end up losing entries. But during normal zap_add, we need to let it fail in order for mixed mode to work.

@ryao
Copy link
Contributor

ryao commented Apr 10, 2018

@tuxoko Good work. This explains things nicely.

@trisk
Copy link
Contributor

trisk commented Apr 10, 2018

@tuxoko Thanks, makes sense that we'd need the failure to allow expansions on an existing fatzap.
Changes look good!

@ahrens
Copy link
Member

ahrens commented Apr 10, 2018

@tuxoko I tested with salt = random_get_pseudo_bytes() and I also saw that the directory entries have similar order when sorted by hash value. This is most easily noticed by observing that the inode (object) numbers returned from getdents() (or in zdb) of the new directory are nearly sequential, indicating that the new sorting (dirent order) is similar to the old sorting (object ID, indicating creation order). Example output below.

I would guess that this is a property of the zap hash function. We should consider replacing it with something better, e.g. cityhash https://github.com/google/cityhash/. But that would require a ZPL on-disk version bump, and obviously your change should go forward without that.

illumos$ truss -v getdents ls -U /test/fs3/fs
getdents64(3, 0xFEDE4000, 8192)                 = 8192
    ino=37169 off=269237847 rlen=32  "100763-8531"
    ino=37172 off=269243910 rlen=32  "100742-7627"
    ino=37171 off=269244466 rlen=32  "100721-8797"
    ino=37175 off=269246886 rlen=32  "100728-5858"
    ino=37174 off=269247203 rlen=32  "100735-7141"
    ino=37173 off=269247978 rlen=32  "100749-2542"
    ino=37177 off=269251869 rlen=32  "100742-3581"
    ino=37176 off=269252650 rlen=32  "100777-1775"
    ino=37179 off=269255038 rlen=32  "100728-3104"
    ino=37178 off=269257892 rlen=32  "100721-2524"
    ino=37181 off=269260142 rlen=32  "100735-9054"
    ino=37180 off=269260397 rlen=32  "100763-4310"
    ino=37185 off=269263243 rlen=32  "100742-9732"
    ino=37184 off=269263496 rlen=32  "100714-4476"
    ino=37183 off=269264877 rlen=32  "100770-4050"
    ino=37182 off=269266879 rlen=32  "100721-6682"
    ino=37186 off=269269057 rlen=32  "100756-2242"
    ino=37187 off=269273562 rlen=32  "100763-6424"
    ino=58243 off=269275058 rlen=32  "100749-870"
    ino=37189 off=269278921 rlen=32  "100728-1630"
    ino=58241 off=269279797 rlen=32  "100742-116"
    ino=37190 off=269281985 rlen=32  "100763-2782"
    ino=37193 off=269285466 rlen=32  "100770-6764"
    ino=37191 off=269286719 rlen=32  "100714-6342"
    ino=37192 off=269286866 rlen=32  "100728-5596"
    ino=37196 off=269288173 rlen=32  "100777-7273"
    ino=37195 off=269291482 rlen=32  "100742-5087"
    ino=37198 off=269293604 rlen=32  "100735-1447"
    ino=37197 off=269295405 rlen=32  "100749-4044"
    ino=37199 off=269298881 rlen=32  "100742-1321"

@ryao
Copy link
Contributor

ryao commented Apr 10, 2018

@ahrens It is not likely to change your conclusions, but I should point out that the implementations of random_get_pseudo_bytes() differ between Illumos and Linux. Illumos uses a "FIPS 186-2 algorithm" while we use a 128-bit Xorshift Pseudo Random Number Generator:

https://github.com/zfsonlinux/spl/blob/master/module/spl/spl-generic.c#L64

That implies that the tests that the two of you did were using different sources of randomness, but came to the same result. You might have expected that, but I thought I would state it explicitly.

done

log_must test $NR_FILES -eq $(ls -U $TESTDIR/src | wc -l)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick comment here explaining why you use cp_files instead of just cp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@tonyhutter
Copy link
Contributor

I separated out your test case code and tried it on master with/without the bad patch (cc63068). With the bad patch, it correctly failed 18/20 times. When I bumped NR_FILES from 30,000 to 60,000, it correctly failed 20/20 times, while only increasing the test time from 1min -> 1min30s. Without the bad patch, it correctly passed 10/10 times. I also tested your full patch in this PR and saw it correctly passed 10/10 times.

@tuxoko tuxoko force-pushed the i7401 branch 2 times, most recently from 2ddec57 to 2ac5646 Compare April 11, 2018 21:14
@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 11, 2018

@tonyhutter
I bumped NR_FILES to 60000. And also change the how files are created to speed up the test case.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuxoko thanks for getting to the bottom of this.


WD=$(pwd)
cd $TESTDIR/src
# create NR_FILES in BATCH at a time to prevent overlowing argument buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflowing?


log_must test $NR_FILES -eq $(ls -U $TESTDIR/src | wc -l)

# copy files from src to dst, use cp_files to make sure readdir order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/make sure/ensure

Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 17, 2018

Fix the comment.

@codecov
Copy link

codecov bot commented Apr 17, 2018

Codecov Report

Merging #7421 into master will decrease coverage by 0.02%.
The diff coverage is 81.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7421      +/-   ##
==========================================
- Coverage   76.53%    76.5%   -0.03%     
==========================================
  Files         335      331       -4     
  Lines      107100   104299    -2801     
==========================================
- Hits        81965    79792    -2173     
+ Misses      25135    24507     -628
Flag Coverage Δ
#kernel 76.64% <96.42%> (-0.41%) ⬇️
#user 65.65% <67.39%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d68ac65...983d87d. Read the comment docs.

@sanjeevbagewadi
Copy link
Contributor

Thank you @tuxoko for handling this while I was away. I agree with the changes.

The optimisation for bailing out early was to avoid growing the table unnecessarily. But, as seen in this bug we cannot deterministically decide when. Hence, we would need to live with the extra indirection.

@behlendorf behlendorf merged commit 599b864 into openzfs:master Apr 18, 2018
@lnicola
Copy link
Contributor

lnicola commented Apr 24, 2018

This isn't the best place to ask, but is there a way to detect if you've been affected by this issue yet? If not, will the FAQ when there is?

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 15, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 23, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 5, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
@mailinglists35
Copy link

mailinglists35 commented Sep 30, 2018

@ryao apologies for interrupting you with my tagging, I know this is probably very low priority but is there any way for a user without zfs source code reading ability to verify if is there any impact of this issue?

I read on the remedy pull request that this is triggered by copying of "large amount of files", but it does not mention the time distribution of those files.

Is there a rate number to compare that guarantees the issue was (or not) triggered? Like N files per second or per cpu cycles?

Thank you! :)

amotin added a commit to amotin/zfs that referenced this pull request Apr 26, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
Closes openzfs#13215
amotin added a commit to amotin/zfs that referenced this pull request Apr 26, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
Closes openzfs#13215
amotin added a commit to amotin/zfs that referenced this pull request Apr 26, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
Closes openzfs#13215
amotin added a commit to amotin/zfs that referenced this pull request Apr 27, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
Closes openzfs#13215
behlendorf pushed a commit that referenced this pull request May 17, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in #7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #13215
Closes #16138
amotin added a commit to amotin/zfs that referenced this pull request May 23, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#13215
Closes openzfs#16138
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 23, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#13215
Closes openzfs#16138
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 23, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#13215
Closes openzfs#16138
behlendorf pushed a commit that referenced this pull request May 29, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in #7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #13215
Closes #16138
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Originally Solaris didn't expect errors there, but they may happen
if we fail to add entry into ZAP.  Linux fixed it in openzfs#7421, but it
was never fully ported to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#13215
Closes openzfs#16138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.