Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out-of-order ZIL txtype lost on hardlinked files #9061

Merged
merged 1 commit into from
Aug 14, 2019

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Jul 18, 2019

We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Motivation and Context

#8769

Description

How Has This Been Tested?

Tested with the repro in #8769

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 18, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job running this down, the proposed fix makes good sense. This looks like a potentially very long standing bug. As part of this change we're going to want to update the slog/slog_replay_fs.ksh test to cover this case. In the original issue you mentioned running in to an issue with zpool freeze, is that still the case?

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 18, 2019 via email

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 18, 2019

I tried today on master but didn't hit any issue with zpool freeze

Actually, let me take that back. I hit it after another try...

[11669.936739] VERIFY3(0 == dmu_object_claim_dnsize(zfsvfs->z_os, obj, DMU_OT_PLAIN_FILE_CONTENTS, 0, obj_type, bonuslen, dnodesize, tx)) failed (0 == 28)
[11669.936743] PANIC at zfs_znode.c:725:zfs_mknode()
[11669.936744] Showing stack for process 2871
[11669.936747] CPU: 1 PID: 2871 Comm: mount.zfs Tainted: P           OE     4.18.0-25-generic #26~18.04.1-Ubuntu
[11669.936747] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[11669.936748] Call Trace:
[11669.936755]  dump_stack+0x63/0x85
[11669.936761]  spl_dumpstack+0x29/0x30 [spl]
[11669.936764]  spl_panic+0xc8/0x110 [spl]
[11669.936813]  zfs_mknode+0x80d/0xf30 [zfs]
[11669.936818]  ? spl_kmem_zalloc+0xe9/0x150 [spl]
[11669.936844]  ? dsl_dir_tempreserve_space+0x1fe/0x460 [zfs]
[11669.936873]  ? txg_rele_to_quiesce+0x35/0x40 [zfs]
[11669.936893]  ? dmu_tx_assign+0x1e7/0x460 [zfs]
[11669.936922]  zfs_symlink+0x357/0x6f0 [zfs]
[11669.936970]  zfs_replay_create+0x3ad/0x420 [zfs]
[11669.937001]  zil_replay_log_record+0xdc/0x1a0 [zfs]
[11669.937028]  ? zil_replay_error.isra.18+0xb0/0xb0 [zfs]
[11669.937054]  zil_parse+0x657/0x900 [zfs]
[11669.937080]  ? zil_replay_error.isra.18+0xb0/0xb0 [zfs]
[11669.937105]  ? zil_set_logbias+0x20/0x20 [zfs]
[11669.937132]  zil_replay+0xbc/0x120 [zfs]
[11669.937157]  zfsvfs_setup+0x165/0x200 [zfs]
[11669.937186]  zfs_domount+0x326/0x3e0 [zfs]
[11669.937211]  zpl_mount+0x143/0x180 [zfs]
[11669.937214]  mount_fs+0x37/0x150
[11669.937218]  vfs_kern_mount.part.25+0x5d/0x110
[11669.937219]  do_mount+0x5ed/0xce0
[11669.937221]  ? copy_mount_options+0x2c/0x220
[11669.937223]  ksys_mount+0x98/0xe0
[11669.937224]  __x64_sys_mount+0x25/0x30
[11669.937226]  do_syscall_64+0x5a/0x120
[11669.937229]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[11669.937231] RIP: 0033:0x7faf0df473ca
[11669.937231] Code: 48 8b 0d c1 8a 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8e 8a 2c 00 f7 d8 64 89 01 48 
[11669.937258] RSP: 002b:00007fff46ff1e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[11669.937260] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007faf0df473ca
[11669.937260] RDX: 0000562700d079cb RSI: 00007fff46ff5fa0 RDI: 00007fff46ff78f0
[11669.937261] RBP: 00007fff46ff6fe0 R08: 00007fff46ff2fa0 R09: 00005627027bae80
[11669.937262] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000000000
[11669.937263] R13: 00005627027b3590 R14: 00007fff46ff1fa0 R15: 00007fff46ff2fa0

We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Closes openzfs#8769
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
@behlendorf
Copy link
Contributor

That failure looks like #7151 and #8910 which up till now we haven't been able to reproduce.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 19, 2019

The repro in #8769 seems to be pretty reliable if you do it with zpool freeze, and it only happen on second time.

However, I have never hit it with "echo b > /proc/sysrq-trigger". So I think this is probably a separate issue in zpool freeze. Unless they also did zpool freeze on their pool.

@codecov
Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #9061 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9061      +/-   ##
==========================================
+ Coverage   78.67%   78.74%   +0.07%     
==========================================
  Files         400      400              
  Lines      121009   120993      -16     
==========================================
+ Hits        95199    95276      +77     
+ Misses      25810    25717      -93
Flag Coverage Δ
#kernel 79.44% <100%> (-0.11%) ⬇️
#user 66.38% <ø> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43a8536...b466e78. Read the comment docs.

Copy link
Member

@prakashsurya prakashsurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@behlendorf
Copy link
Contributor

The repro in #8769 seems to be pretty reliable if you do it with zpool freeze, and it only happen on second time.

I'm not sure what you mean by the 'second time' here. Do you mean running it twice using the same pool?

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 27, 2019 via email

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 13, 2019
@behlendorf behlendorf merged commit 8e556c5 into openzfs:master Aug 14, 2019
@DSpeichert DSpeichert mentioned this pull request Aug 20, 2019
12 tasks
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 21, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 22, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 23, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 17, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 23, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#8769
Closes openzfs#9061
tonyhutter pushed a commit that referenced this pull request Sep 26, 2019
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants