Behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs. #2458

pavel-odintsov · 2014-07-03T12:18:14Z

Hello!

My issue about behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs.

I created mirrored pool for 220Gb:

zpool create -f vzpool mirror /dev/sdc /dev/sdd

zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vzpool   222G  1,09M   222G     0%  1.00x  ONLINE  -

And created zvol for 300 Gb over this pool:

zfs create -s -V 300G vzpool/ct101

And formatted it as ext4:

parted -s /dev/vzpool/ct101 mklabel gpt
parted /dev/vzpool/ct101 "mkpart primary 1 -1"
mkfs.ext4 /dev/vzpool/ct101-part1

After this I tried to emulate condition «out of space on pool» using following commands:

mount /dev/vzpool/ct101-part1 /mnt
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -b > /mnt/temp_file_unlimited_Gb
ctrl+a d

When I wrote about 230Gb on this partition I got following errors in dmesg:

Jul  2 17:48:26 zfs kernel: [20935.912818] EXT4-fs warning (device zd0p1): ext4_end_bio:
Jul  2 17:48:26 zfs kernel: [20935.912835] EXT4-fs warning (device zd0p1): ext4_end_bio: I/O error writing to inode 12 (size 0 starting block 58907648)
Jul  2 17:48:26 zfs kernel: [20935.912872] Buffer I/O error on device zd0p1, logical block 58907264
Jul  2 17:50:39 zfs kernel: [21068.632693] ------------[ cut here ]------------
Jul  2 17:50:39 zfs kernel: [21068.632742] WARNING: at fs/buffer.c:1182 mark_buffer_dirty+0x82/0xa0() (Tainted: P           ---------------   )
Jul  2 17:50:39 zfs kernel: [21068.632821] Hardware name: System Product Name
Jul  2 17:50:39 zfs kernel: [21068.632859] Modules linked in: vzethdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nfs lockd fscache auth_rpcgss nfs_acl sunrpc nf_conntrack vziolimit vzdquota ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vzevent vznetdev vzmon vzdev ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ext3 jbd cpufreq_ondemand acpi_cpufreq freq_table mperf iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core sg lpc_ich mfd_core shpchp e1000e ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif xhci_hcd video output ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jul  2 17:50:39 zfs kernel: [21068.633409] Pid: 4125, comm: jbd2/zd0p1-8 veid: 0 Tainted: P           ---------------    2.6.32-042stab090.5 #1
Jul  2 17:50:39 zfs kernel: [21068.633488] Call Trace:
Jul  2 17:50:39 zfs kernel: [21068.633528]  [<ffffffff81074f97>] ? warn_slowpath_common+0x87/0xc0
Jul  2 17:50:39 zfs kernel: [21068.633573]  [<ffffffff81074fea>] ? warn_slowpath_null+0x1a/0x20
Jul  2 17:50:39 zfs kernel: [21068.633617]  [<ffffffff811e5092>] ? mark_buffer_dirty+0x82/0xa0
Jul  2 17:50:39 zfs kernel: [21068.633666]  [<ffffffffa0090315>] ? __jbd2_journal_temp_unlink_buffer+0xa5/0x140 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.633745]  [<ffffffffa00903c6>] ? __jbd2_journal_unfile_buffer+0x16/0x30 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.633822]  [<ffffffffa0090728>] ? __jbd2_journal_refile_buffer+0xc8/0xf0 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.633900]  [<ffffffffa00933a8>] ? jbd2_journal_commit_transaction+0xd38/0x1500 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.633978]  [<ffffffff81009770>] ? __switch_to+0xd0/0x320
Jul  2 17:50:39 zfs kernel: [21068.634023]  [<ffffffff81088cdb>] ? try_to_del_timer_sync+0x7b/0xe0
Jul  2 17:50:39 zfs kernel: [21068.634073]  [<ffffffffa0098bb8>] ? kjournald2+0xb8/0x220 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.634118]  [<ffffffff810a1550>] ? autoremove_wake_function+0x0/0x40
Jul  2 17:50:39 zfs kernel: [21068.634167]  [<ffffffffa0098b00>] ? kjournald2+0x0/0x220 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.634212]  [<ffffffff810a0f36>] ? kthread+0x96/0xa0
Jul  2 17:50:39 zfs kernel: [21068.634254]  [<ffffffff8100c34a>] ? child_rip+0xa/0x20
Jul  2 17:50:39 zfs kernel: [21068.634306]  [<ffffffff810a0ea0>] ? kthread+0x0/0xa0
Jul  2 17:50:39 zfs kernel: [21068.634350]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
Jul  2 17:50:39 zfs kernel: [21068.634392] ---[ end trace 520ff3534d63f82f ]---
Jul  2 17:50:39 zfs kernel: [21068.634431] Tainting kernel with flag 0x9
Jul  2 17:50:39 zfs kernel: [21068.634470] Pid: 4125, comm: jbd2/zd0p1-8 veid: 0 Tainted: P           ---------------    2.6.32-042stab090.5 #1
Jul  2 17:50:39 zfs kernel: [21068.634548] Call Trace:
Jul  2 17:50:39 zfs kernel: [21068.634585]  [<ffffffff81074e21>] ? add_taint+0x71/0x80
Jul  2 17:50:39 zfs kernel: [21068.634627]  [<ffffffff81074fa4>] ? warn_slowpath_common+0x94/0xc0
Jul  2 17:50:39 zfs kernel: [21068.634672]  [<ffffffff81074fea>] ? warn_slowpath_null+0x1a/0x20
Jul  2 17:50:39 zfs kernel: [21068.634715]  [<ffffffff811e5092>] ? mark_buffer_dirty+0x82/0xa0
Jul  2 17:50:39 zfs kernel: [21068.634762]  [<ffffffffa0090315>] ? __jbd2_journal_temp_unlink_buffer+0xa5/0x140 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.634841]  [<ffffffffa00903c6>] ? __jbd2_journal_unfile_buffer+0x16/0x30 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.634918]  [<ffffffffa0090728>] ? __jbd2_journal_refile_buffer+0xc8/0xf0 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.634996]  [<ffffffffa00933a8>] ? jbd2_journal_commit_transaction+0xd38/0x1500 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.635072]  [<ffffffff81009770>] ? __switch_to+0xd0/0x320
Jul  2 17:50:39 zfs kernel: [21068.635116]  [<ffffffff81088cdb>] ? try_to_del_timer_sync+0x7b/0xe0
Jul  2 17:50:39 zfs kernel: [21068.635164]  [<ffffffffa0098bb8>] ? kjournald2+0xb8/0x220 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.635208]  [<ffffffff810a1550>] ? autoremove_wake_function+0x0/0x40
Jul  2 17:50:39 zfs kernel: [21068.635256]  [<ffffffffa0098b00>] ? kjournald2+0x0/0x220 [jbd2]
Jul  2 17:50:39 zfs kernel: [21068.635305]  [<ffffffff810a0f36>] ? kthread+0x96/0xa0
Jul  2 17:50:39 zfs kernel: [21068.635348]  [<ffffffff8100c34a>] ? child_rip+0xa/0x20
Jul  2 17:50:39 zfs kernel: [21068.635392]  [<ffffffff810a0ea0>] ? kthread+0x0/0xa0
Jul  2 17:50:39 zfs kernel: [21068.635433]  [<ffffffff8100c340>] ? child_rip+0x0/0x20
Jul  2 17:51:13 zfs kernel: [21102.995512] __ratelimit: 628 callbacks suppressed

And all space on spool was depleted:

[root@zfs ~]# zfs  list
NAME           USED  AVAIL  REFER  MOUNTPOINT
vzpool         219G      0   272K  /vzpool
vzpool/ct101   219G      0   219G  -
 ```
```bash
[root@zfs ~]# zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vzpool   222G   219G  3,47G    98%  1.00x  ONLINE  -

And ext4 filesystem was corrupted totally:

LANG=C mount /dev/vzpool/ct101-part1 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/zd0p1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

But pool enlarge or fsck.ext4 can fix this issue.

Yep, I read about «unexpected behavior» in man zfs about overcommit. But it’s really useful thing for allocating a lot of partitions larger total disk space. It’s widely used in LVM.

Can you add some warnings to dmesg from ZFS about pool depletion and return «no space anymore» error to ext4 earlier then it really occur? And maybe is possible return some specific code to ext4 for safely save data and stop operations with this device.

mpatlasov · 2014-07-06T11:07:37Z

There are twofold solution to the problem Pavel set forth:

Firstly, we could add an ad-hoc callback to struct backing_dev_info, for example, block_device_full(). Then, zvol could register there its zvol_device_full() method to return "true" if zfs pool is close to depletion. And, ext4 running on zvol might call that method in the context of syscalls to return ENOSPC when the method returns "true". In fact, block_device_full() callback may be called in exactly the same places where ext4 checks for disk quota.

Secondly, if zvol cannot process incoming request due to pool depletion, it could put the request into a separate queue and schedule periodical checks for free space (e.g. once per a few seconds), rather than failing request immediately with EIO. Later, when system administrator free some amount of disk space, those requests could be retried. In this approach, end users would observe that some apps are freezed for while, may be even for a long while, but they have a chance to proceed smoothly after some amount of disk space freed.

pavel-odintsov · 2014-07-06T20:56:23Z

Hello again!

You could see code examples for this approach from ext4 side here https://github.com/pavel-odintsov/openvz_rhel6_kernel_mirror/blob/f310e3d87d79ed06262852d770d5c140ae59422f/fs/ext4/ext4.h#L2180 (function check_bd_full) and from block device here https://github.com/pavel-odintsov/openvz_rhel6_kernel_mirror/blob/f310e3d87d79ed06262852d770d5c140ae59422f/drivers/block/ploop/dev.c#L3495 (function ploop_bd_full).

This code written for OpenVZ project (for RHEL 2.6.32 kernel) and licensed under GPLv2 and you can use it without any problems.

ryao · 2014-07-11T15:04:12Z

@pavel-odintsov If the number of outstanding IOs on zvols exceeded 32, then reordering could have occurred that might have worsened things. #2484 should address that.

@mpatlasov what makes you think that EIO was returned? The ext4 source code will print that message for any kind of IO error, including ENOSPC:

http://lxr.free-electrons.com/source/fs/ext4/page-io.c#L298

We return ENOSPC on datasets when we run out of space. It would be surprising if we returned EIO on a zvol given that the internal transaction paths are the same. A cursory look through the code suggests that this is indeed the case. In specific, zvol_write() calls dmu_tx_assign(), which should return ENOSPC if we lack sufficient space to complete the transaction.

That being said, I suspect that your idea to retry requests would make the problem worse by allowing ext4 to believe things are fine when they are not. We are already doing that to some extent because of how zvol transaction processing works when all 32 zvol taskq threads are busy. That allows for reordering that could affect the flushes ext4 uses as part of its journalling. #2484 should address that by changing the zvol code to block those submitting requests until ZFS has decided on the fate of the request. That way the ext4 code does not proceed to do other things thinking that the request is in flight when it is possible for those other things to occur before the first request.

Beyond that, we could look into implementing hooks to signal consumers like ext4 to stop writing when the pool is full such as marking the device congested or using the device full callback OpenVZ implemented.

behlendorf · 2014-07-16T21:28:04Z

@pavel-odintsov My suggestion for this particular case is to mount ext4 with the errors=remount-ro mount option. This will cause the ext4 filesystem to immediately go read-only and limit any further damage if errors (EIO, ENOSPC, etc) are encounter from the block layer. You can then that whatever administrative action is needed. I don't think this is something we can (or should) be trying to tackle in ZFS.

pavel-odintsov · 2015-05-13T07:12:14Z

behlendorf, your suggestion works really well in this case. We checked it deeply and found no issues! Thank you!

behlendorf · 2015-05-13T16:36:26Z

@pavel-odintsov thanks for the follow up. I'm glad that solution worked for you. And as I mentioned above I don't think this is something we should be trying to tackle in ZFS so I'm closing out this issue.

behlendorf added this to the 0.7.0 milestone Jul 16, 2014

behlendorf added the Documentation label Jul 16, 2014

kernelOfTruth mentioned this issue May 13, 2015

Iterative ZFS volume migration over network #3407

Closed

behlendorf closed this as completed May 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs. #2458

Behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs. #2458

pavel-odintsov commented Jul 3, 2014

mpatlasov commented Jul 6, 2014

pavel-odintsov commented Jul 6, 2014

ryao commented Jul 11, 2014

behlendorf commented Jul 16, 2014

pavel-odintsov commented May 13, 2015

behlendorf commented May 13, 2015

Behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs. #2458

Behavior of ext4 on ZVOL (allocated with overcommit) when out of space on pool occurs. #2458

Comments

pavel-odintsov commented Jul 3, 2014

mpatlasov commented Jul 6, 2014

pavel-odintsov commented Jul 6, 2014

ryao commented Jul 11, 2014

behlendorf commented Jul 16, 2014

pavel-odintsov commented May 13, 2015

behlendorf commented May 13, 2015