Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool hangs after some load with zvols #9693

Closed
glztmf opened this issue Dec 7, 2019 · 11 comments
Closed

zpool hangs after some load with zvols #9693

glztmf opened this issue Dec 7, 2019 · 11 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@glztmf
Copy link

glztmf commented Dec 7, 2019

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10 (buster)
Linux Kernel 5.0.15-1-pve
Architecture x86_64
ZFS Version 0.8.1-pve1
SPL Version 0.8.1-pve1

Describe the problem you're observing

Some operations with specific zpool/zvols hang.
Virtual machines freeze.

Describe how to reproduce the problem

Install PVE 6.0-4.
Create additional mirror zpool.
Create several VMs with vdisks on created zpool.
Add load on VMs disks and wait for a few days.

Include any warning/errors/backtraces from the system logs

Inside VMs terminals:

INFO: task jbd2/vda1-8:478 blocked for more than 120 seconds.
INFO: task systemd-journal:563 blocked for more than 120 seconds.
...

Processes:

# ps aux | egrep ' D | D< |ST[A]T'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       485  0.0  0.0      0     0 ?        D<   Dec05   2:17 [zvol]
root      4323  0.0  0.0      0     0 ?        D    Dec05   0:01 [txg_quiesce]
root     18842  1.2  0.0      0     0 ?        D    13:14   0:45 [md15_resync]
root     28115  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28117  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28119  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28120  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28126  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28127  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28128  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28129  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28130  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28131  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28132  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28133  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28134  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28135  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28136  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28137  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28138  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28139  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28140  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28141  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28142  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28143  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28145  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28150  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28151  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28152  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28153  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28154  0.0  0.0      0     0 ?        D<   Dec06   0:29 [zvol]
root     28155  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28156  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]
root     28157  0.0  0.0      0     0 ?        D<   Dec06   0:00 [zvol]

from dmesg:

[Fri Dec  6 16:30:23 2019] INFO: task zvol:485 blocked for more than 120 seconds.
[Fri Dec  6 16:30:23 2019]       Tainted: P          IO      5.0.15-1-pve #1
[Fri Dec  6 16:30:23 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Dec  6 16:30:23 2019] zvol            D    0   485      2 0x80000000
[Fri Dec  6 16:30:23 2019] Call Trace:
[Fri Dec  6 16:30:23 2019]  __schedule+0x2d4/0x870
[Fri Dec  6 16:30:23 2019]  ? try_to_wake_up+0x59/0x4f0
[Fri Dec  6 16:30:23 2019]  schedule+0x2c/0x70
[Fri Dec  6 16:30:23 2019]  rwsem_down_write_failed+0x160/0x340
[Fri Dec  6 16:30:23 2019]  call_rwsem_down_write_failed+0x17/0x30
[Fri Dec  6 16:30:23 2019]  ? spl_kmem_free+0x33/0x40 [spl]
[Fri Dec  6 16:30:23 2019]  down_write+0x2d/0x40
[Fri Dec  6 16:30:23 2019]  dmu_zfetch+0x134/0x590 [zfs]
[Fri Dec  6 16:30:23 2019]  dmu_buf_hold_array_by_dnode+0x379/0x450 [zfs]
[Fri Dec  6 16:30:23 2019]  dmu_write_uio_dnode+0x4c/0x140 [zfs]
[Fri Dec  6 16:30:23 2019]  zvol_write+0x190/0x620 [zfs]
[Fri Dec  6 16:30:23 2019]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri Dec  6 16:30:23 2019]  ? wake_up_q+0x80/0x80
[Fri Dec  6 16:30:23 2019]  kthread+0x120/0x140
[Fri Dec  6 16:30:23 2019]  ? task_done+0xb0/0xb0 [spl]
[Fri Dec  6 16:30:23 2019]  ? __kthread_parkme+0x70/0x70
[Fri Dec  6 16:30:23 2019]  ret_from_fork+0x35/0x40
[Fri Dec  6 16:30:23 2019] INFO: task txg_quiesce:4323 blocked for more than 120 seconds.
[Fri Dec  6 16:30:23 2019]       Tainted: P          IO      5.0.15-1-pve #1
[Fri Dec  6 16:30:23 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Dec  6 16:30:23 2019] txg_quiesce     D    0  4323      2 0x80000000
[Fri Dec  6 16:30:23 2019] Call Trace:
[Fri Dec  6 16:30:23 2019]  __schedule+0x2d4/0x870
[Fri Dec  6 16:30:23 2019]  schedule+0x2c/0x70
[Fri Dec  6 16:30:23 2019]  cv_wait_common+0x104/0x130 [spl]
[Fri Dec  6 16:30:23 2019]  ? wait_woken+0x80/0x80
[Fri Dec  6 16:30:23 2019]  __cv_wait+0x15/0x20 [spl]
[Fri Dec  6 16:30:23 2019]  txg_quiesce_thread+0x2ac/0x3a0 [zfs]
[Fri Dec  6 16:30:23 2019]  ? txg_sync_thread+0x4c0/0x4c0 [zfs]
[Fri Dec  6 16:30:23 2019]  thread_generic_wrapper+0x74/0x90 [spl]
[Fri Dec  6 16:30:23 2019]  kthread+0x120/0x140
[Fri Dec  6 16:30:23 2019]  ? __thread_exit+0x20/0x20 [spl]
[Fri Dec  6 16:30:23 2019]  ? __kthread_parkme+0x70/0x70
[Fri Dec  6 16:30:23 2019]  ret_from_fork+0x35/0x40

zfs create command hangs:

# strace zfs create zpl2/test2
stat("/sys/module/zfs/features.pool/org.zfsonlinux:allocation_classes", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
stat("/sys/module/zfs/features.pool/com.datto:resilver_defer", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=2995, ...}) = 0
read(6, "# Locale name alias data base.\n#"..., 3072) = 2995
read(6, "", 3072)                       = 0
close(6)                                = 0
openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x12, 0), 0x7fff81e9c610) = 0
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f154c79b000
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x12, 0), 0x7fff81e99010) = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x5, 0), 0x7fff81e95980) = 0
brk(0x559c52bc3000)                     = 0x559c52bc3000
munmap(0x7f154c79b000, 266240)          = 0
brk(0x559c52c0c000)                     = 0x559c52c0c000
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x12, 0), 0x7fff81e9cb00) = -1 ENOENT (No such file or directory)
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x5, 0), 0x7fff81e9cac0) = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x12, 0), 0x7fff81e9c690) = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x27, 0), 0x7fff81e9c610) = 0
ioctl(5, _IOC(_IOC_NONE, 0x5a, 0x17, 0)

creating snapshot:

# strace zfs snapshot zpl2/vm-1414-disk-1@test
execve("/usr/sbin/zfs", ["zfs", "snapshot", "zpl2/vm-1414-disk-1@test"], 0x7fffde7df1d0 /* 21 vars */) = 0
brk(NULL)                               = 0x55e876424000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...
stat("/sys/module/zfs/features.pool/com.datto:resilver_defer", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=2995, ...}) = 0
read(6, "# Locale name alias data base.\n#"..., 3072) = 2995
read(6, "", 3072)                       = 0
close(6)                                = 0
openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7faaa3621000
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x12, 0), 0x7ffd4c711990) = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x5, 0), 0x7ffd4c70e300) = 0
brk(0x55e876466000)                     = 0x55e876466000
munmap(0x7faaa3621000, 266240)          = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x5, 0), 0x7ffd4c711920) = 0
brk(0x55e87648f000)                     = 0x55e87648f000
ioctl(5, _IOC(_IOC_NONE, 0x5a, 0x23, 0)

zpool history command hangs:

stat("/sys/module/zfs/features.pool/org.zfsonlinux:project_quota", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
stat("/sys/module/zfs/features.pool/org.zfsonlinux:allocation_classes", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
stat("/sys/module/zfs/features.pool/com.datto:resilver_defer", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0x5, 0), 0x7ffe6f208270) = 0
brk(0x564e6bc5c000)                     = 0x564e6bc5c000
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=2995, ...}) = 0
read(6, "# Locale name alias data base.\n#"..., 3072) = 2995
read(6, "", 3072)                       = 0
close(6)                                = 0
openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x8), ...}) = 0
write(1, "History for 'zpl2':\n", 20History for 'zpl2':
)   = 20
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f217eed1000
ioctl(3, _IOC(_IOC_NONE, 0x5a, 0xa, 0)

zpool status works fine:

# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 48.7M in 0 days 00:00:01 with 0 errors on Wed Nov 27 11:41:49 2019
config:

        NAME                                              STATE     READ WRITE CKSUM
        rpool                                             ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            scsi-3600508b1001cd542127ea37ea7a8acf9-part3  ONLINE       0     0     0
            scsi-3600508b1001c39b7de4f60b664d58be3-part3  ONLINE       0     0     0

errors: No known data errors

  pool: zpl1
 state: ONLINE
  scan: resilvered 1007M in 0 days 00:00:09 with 0 errors on Tue Nov 26 19:49:33 2019
config:

        NAME        STATE     READ WRITE CKSUM
        zpl1        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda4    ONLINE       0     0     0
            sdb4    ONLINE       0     0     0

errors: No known data errors

  pool: zpl2
 state: ONLINE
  scan: none requested
config:

        NAME                                        STATE     READ WRITE CKSUM
        zpl2                                        ONLINE       0     0     0
          mirror-0                                  ONLINE       0     0     0
            wwn-0x600508b1001cf654d572b375c56c4532  ONLINE       0     0     0
            wwn-0x600508b1001cd152a9b5ce7a105d408f  ONLINE       0     0     0
          mirror-1                                  ONLINE       0     0     0
            wwn-0x600508b1001c1b5698104782c7f381cd  ONLINE       0     0     0
            wwn-0x600508b1001c25c5205c24bf340de661  ONLINE       0     0     0

errors: No known data errors

zdb commands work fine after zpool set cachefile=/etc/zfs/zpool.cache zpl2

# zdb -C zpl2

MOS Configuration:
        version: 5000
        name: 'zpl2'
        state: 0
        txg: 348433
        pool_guid: 8066186125374951052
        errata: 0
        hostid: 319288231
        hostname: 'docker14'
        com.delphix:has_per_vdev_zaps
        vdev_children: 2
        vdev_tree:
            type: 'root'
            id: 0
            guid: 8066186125374951052
            create_txg: 4
            children[0]:
                type: 'mirror'
                id: 0
                guid: 13861948267117057889
                metaslab_array: 144
                metaslab_shift: 33
                ashift: 12
                asize: 1000156430336
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 129
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 4319338434976575004
                    path: '/dev/disk/by-id/wwn-0x600508b1001cf654d572b375c56c4532-part1'
                    devid: 'scsi-3600508b1001cf654d572b375c56c4532-part1'
                    phys_path: 'pci-0000:05:00.0-scsi-0:1:0:4'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 130
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 14373262178903517398
                    path: '/dev/disk/by-id/wwn-0x600508b1001cd152a9b5ce7a105d408f-part1'
                    devid: 'scsi-3600508b1001cd152a9b5ce7a105d408f-part1'
                    phys_path: 'pci-0000:05:00.0-scsi-0:1:0:5'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 131
            children[1]:
                type: 'mirror'
                id: 1
                guid: 6843532063015181877
                metaslab_array: 135
                metaslab_shift: 33
                ashift: 12
                asize: 1000156430336
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 132
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 15637452578554080026
                    path: '/dev/disk/by-id/wwn-0x600508b1001c1b5698104782c7f381cd-part1'
                    devid: 'scsi-3600508b1001c1b5698104782c7f381cd-part1'
                    phys_path: 'pci-0000:05:00.0-scsi-0:1:0:6'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 133
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 995581400251175409
                    path: '/dev/disk/by-id/wwn-0x600508b1001c25c5205c24bf340de661-part1'
                    devid: 'scsi-3600508b1001c25c5205c24bf340de661-part1'
                    phys_path: 'pci-0000:05:00.0-scsi-0:1:0:7'
                    whole_disk: 1
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 134
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data

# zdb -b zpl2

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 1, metaslab 115 of 116 ...
 267G completed (3891MB/s) estimated time remaining: 0hr 00min 00sec
        No leaks (block sum matches space maps exactly)

        bp count:              35053166
        ganged count:                 0
        bp logical:        292479080960      avg:   8343
        bp physical:       287418612736      avg:   8199     compression:   1.02
        bp allocated:      288112029696      avg:   8219     compression:   1.02
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:      288112029696     used: 14.46%

        additional, non-pointer bps of type 0:        118
        Dittoed blocks on same vdev: 8616

# zdb -h zpl2

History:
2019-11-12.13:54:54 zpool create zpl2 mirror sde sdf mirror sdg sdh
2019-11-12.17:09:49 zfs create -V 209715200k zpl2/vm-100-disk-0
2019-11-14.10:10:52 zpool import -c /etc/zfs/zpool.cache -aN
2019-11-15.13:29:14 zpool import -c /etc/zfs/zpool.cache -aN
2019-11-15.14:37:57 zpool import -c /etc/zfs/zpool.cache -aN
2019-02-14.13:12:06 zpool import -c /etc/zfs/zpool.cache -aN
2019-11-15.15:28:33 zpool import -c /etc/zfs/zpool.cache -aN
2019-11-20.00:03:44 zfs snapshot zpl2/vm-100-disk-0@repos_deletion
2019-11-20.00:16:17 zfs rollback zpl2/vm-100-disk-0@repos_deletion
2019-11-21.14:00:25 zfs create -V 209715200k zpl2/vm-1410-disk-0
2019-11-22.12:43:12 zfs destroy -r zpl2/vm-100-disk-0
2019-11-26.13:59:13 zpool import -d /dev/disk/by-id/ -o cachefile=none zpl2
2019-11-26.15:29:08 zfs snapshot zpl2/vm-1410-disk-0@failed
2019-11-26.15:30:56 zpool set listsnapshots=on zpl2
2019-11-26.15:32:14 zfs clone zpl2/vm-1410-disk-0@failed zpl2/vm-1410-disk-0-clone
2019-11-26.19:33:11 zfs destroy zpl2/vm-1410-disk-0-clone
2019-11-27.12:24:23 zpool import -d /dev/disk/by-id/ -o cachefile=none zpl2
2019-11-29.14:44:12 zpool import -d /dev/disk/by-id/ -o cachefile=none zpl2
2019-12-02.10:14:15 zfs create -V 5242880k zpl2/vm-1414-disk-0
2019-12-02.10:23:46 zpool import -d /dev/disk/by-id/ -o cachefile=none zpl2
2019-12-02.10:26:55 zfs create -V 5242880k zpl2/vm-1414-disk-1
2019-12-02.10:30:24 zfs create -V 31457280k zpl2/vm-1413-disk-0
2019-12-02.10:36:06 zfs recv -ve zpl2
2019-12-02.10:37:50 zfs recv -ve zpl2
2019-12-02.10:38:27 zfs destroy -r zpl2/vm-1413-disk-0
2019-12-02.10:41:56 zfs recv -ve zpl2
2019-12-02.11:19:41 zfs create zpl2/docker
2019-12-02.11:19:52 zfs destroy zpl2/docker
2019-12-02.11:52:18 zfs create zpl2/docker -o mountpoint=/var/lib/docker
2019-12-02.11:53:28 zfs destroy zpl2/vm-1413-disk-0@migrate
2019-12-02.11:53:32 zfs destroy zpl2/vm-1412-disk-0@migrate
2019-12-02.11:53:38 zfs destroy zpl2/vm-1411-disk-0@migrate
2019-12-02.11:53:48 zfs destroy zpl2/vm-1410-disk-0@failed
2019-12-02.11:53:58 zfs snapshot zpl2/vm-1410-disk-0@norm
2019-12-05.12:40:02 zpool import -d /dev/disk/by-id/ -o cachefile=none zpl2
2019-12-05.13:15:39 zfs set atime=off zpl2
2019-12-05.13:21:14 zfs set compression=off zpl2

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Dec 9, 2019 via email

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 9, 2019
@glztmf
Copy link
Author

glztmf commented Dec 11, 2019

@Fabian-Gruenbichler , thanks, i will try ZoL 0.8.2 a little later.
now i'm trying to find the exact way to reproduce the problem in order to test same way on other versions

@glztmf
Copy link
Author

glztmf commented Dec 13, 2019

Same hangs with pve 6.1-3 and zfs 0.8.2.

Created 2 VMs and ran following fio tests simultaneously:

fio --loops=100 --size=500m --filename=/mnt/fiotest.tmp --stonewall --ioengine=libaio \
  --name=1 --bs=8k --rw=randrw \
  --name=2 --bs=512k --rw=randrw \
  --name=3 --bs=512k --rw=randrw \
  --name=4 --bs=512k --rw=randrw \
  --name=5 --bs=512k --iodepth=32 --rw=randrw \
  --name=6 --bs=512k --iodepth=32 --rw=randrw

After about an hour VMs and zfs hang with similar symptoms.

Processes with D state:

# ps -e -o pid,state,start,command | egrep ' D | D< | Dl |STAT'
  506 D 15:13:13 [zvol]
 7014 D 15:19:04 [txg_sync]
19278 D 16:26:38 [zvol]
19279 D 16:26:38 [zvol]
19280 D 16:26:38 [zvol]
19281 D 16:26:38 [zvol]
19282 D 16:26:38 [zvol]
19283 D 16:26:38 [zvol]
19284 D 16:26:38 [zvol]
19285 D 16:26:38 [zvol]
19286 D 16:26:38 [zvol]
19287 D 16:26:38 [zvol]
19288 D 16:26:38 [zvol]
19289 D 16:26:38 [zvol]
19291 D 16:26:38 [zvol]
19293 D 16:26:38 [zvol]
19294 D 16:26:38 [zvol]
19295 D 16:26:38 [zvol]
19296 D 16:26:38 [zvol]
19297 D 16:26:38 [zvol]
19298 D 16:26:38 [zvol]
19300 D 16:26:38 [zvol]
19301 D 16:26:38 [zvol]
19303 D 16:26:38 [zvol]
19304 D 16:26:38 [zvol]
19305 D 16:26:38 [zvol]
19322 D 16:26:38 [zvol]
19348 D 16:26:38 [zvol]
19349 D 16:26:38 [zvol]
19350 D 16:26:38 [zvol]
19351 D 16:26:38 [zvol]
19352 D 16:26:38 [zvol]
19353 D 16:26:38 [zvol]

Inside VMs terminals similar messages:

task sometask:xxx blocked for more than 120 seconds.

And fio IO drops to zero, but doesn't hang:

fio-3.7
Starting 6 processes
Jobs: 1 (f=1): [m(1),P(5)][36.6%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 24m:15s]

Also found zfs module option doesn't work at startup:

# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10737418240

... because after startup this parameter with default value:

# cat /sys/module/zfs/parameters/zfs_arc_max
0

Am I missing something? Or should I create one more issue...

@glztmf
Copy link
Author

glztmf commented Dec 17, 2019

Same thing with 4k volblocksize-d virtual disks.

  PID STAT  STARTED COMMAND
  510 D<   20:52:21 [zvol]
 3298 S+   12:57:01 grep -E  D | D< | Dl |STAT
19814 D    20:52:42 [txg_sync]
25706 D<   12:50:13 [zvol]
25707 D<   12:50:13 [zvol]
25709 D<   12:50:13 [zvol]
25710 D<   12:50:13 [zvol]
25711 D<   12:50:13 [zvol]
25712 D<   12:50:13 [zvol]
25713 D<   12:50:13 [zvol]
25714 D<   12:50:13 [zvol]
25715 D<   12:50:13 [zvol]
25716 D<   12:50:13 [zvol]
25717 D<   12:50:13 [zvol]
25718 D<   12:50:13 [zvol]
25719 D<   12:50:13 [zvol]
25720 D<   12:50:13 [zvol]
25721 D<   12:50:13 [zvol]
25722 D<   12:50:13 [zvol]
25723 D<   12:50:13 [zvol]
25724 D<   12:50:13 [zvol]
25725 D<   12:50:13 [zvol]
25726 D<   12:50:13 [zvol]
25727 D<   12:50:13 [zvol]
25728 D<   12:50:13 [zvol]
25729 D<   12:50:13 [zvol]
25730 D<   12:50:13 [zvol]
25731 D<   12:50:13 [zvol]
25732 D<   12:50:13 [zvol]
25733 D<   12:50:13 [zvol]
25734 D<   12:50:13 [zvol]
25735 D<   12:50:13 [zvol]
25736 D<   12:50:13 [zvol]
25737 D<   12:50:13 [zvol]

Interesting what means that there are always 31 new [zvol] freezing processes + 1 old [zvol] process.

@devZer0
Copy link

devZer0 commented Apr 13, 2020

i think this one is related: #10095

@glztmf glztmf closed this as completed Apr 15, 2020
@devZer0
Copy link

devZer0 commented Apr 16, 2020

just for my interest - why has this been closed without further comment/notice?

@glztmf glztmf reopened this Apr 17, 2020
@devZer0
Copy link

devZer0 commented Apr 17, 2020

probably related: #9172

@glztmf
Copy link
Author

glztmf commented Apr 19, 2020

probably a duplicate of #9172

similar errors, but zfs versions are too much different and in #9172 hangs only zvol with compression on, whereas I have no compression and my hangs affect management commands of any zfs volume/dataset
also we moved to pve5.4-3 with zfs version 0.7.13 and such errors have never occurred on it

@glztmf
Copy link
Author

glztmf commented Apr 19, 2020

just for my interest - why has this been closed without further comment/notice?

it was an accident...from my smartphone :)

@stale
Copy link

stale bot commented Apr 19, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Apr 19, 2021
@stale stale bot closed this as completed Jul 18, 2021
@devZer0
Copy link

devZer0 commented Jul 18, 2021

i still think it sucks that tickets getting closed automatically because of inactivity.

it's like sweeping dirt under the carpet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants