Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another deadlock with zfs 0.6.4-1 on CentOS 7 #3350

Closed
deajan opened this issue Apr 27, 2015 · 3 comments
Closed

Another deadlock with zfs 0.6.4-1 on CentOS 7 #3350

deajan opened this issue Apr 27, 2015 · 3 comments

Comments

@deajan
Copy link

deajan commented Apr 27, 2015

Hello,

I have a production backup server which just happened to have a deadlock with the latest zfs release.
Here's the info i can provide:

I have a cron task that writes memory and arc stats every 3 minutes to a file.
Here are the results and of course the kernel crash

Kernel dump: https://gist.github.com/deajan/fe7c9b6be3c267d21d8c
Stats some minutes before the crash (mon. april 27 11:03:01 CEST 2015):

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7,6G        1,9G        248M         11M        5,5G        5,2G
Swap:          7,9G        1,2G        6,7G

$ cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 1506919283 215396624384947
name                            type data
hits                            4    10636266
misses                          4    2251647
demand_data_hits                4    5255589
demand_data_misses              4    84153
demand_metadata_hits            4    4566727
demand_metadata_misses          4    840784
prefetch_data_hits              4    209142
prefetch_data_misses            4    1283268
prefetch_metadata_hits          4    604808
prefetch_metadata_misses        4    43442
mru_hits                        4    3849050
mru_ghost_hits                  4    111664
mfu_hits                        4    5973266
mfu_ghost_hits                  4    90140
deleted                         4    2474179
recycle_miss                    4    90336
mutex_miss                      4    91
evict_skip                      4    16178666
evict_l2_cached                 4    0
evict_l2_eligible               4    301244682752
evict_l2_ineligible             4    7440224256
hash_elements                   4    12224
hash_elements_max               4    131784
hash_collisions                 4    301592
hash_chains                     4    50
hash_chain_max                  4    5
p                               4    522418608
c                               4    552516192
c_min                           4    4194304
c_max                           4    4085360640
size                            4    552456008
hdr_size                        4    4675368
data_size                       4    454426624
meta_size                       4    59717120
other_size                      4    33636896
anon_size                       4    77742080
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    415407616
mru_evict_data                  4    362676224
mru_evict_metadata              4    33717760
mru_ghost_size                  4    96797696
mru_ghost_evict_data            4    82837504
mru_ghost_evict_metadata        4    13960192
mfu_size                        4    20994048
mfu_evict_data                  4    15204352
mfu_evict_metadata              4    16384
mfu_ghost_size                  4    343501824
mfu_ghost_evict_data            4    321912832
mfu_ghost_evict_metadata        4    21588992
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    7643
memory_indirect_count           4    306329
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    98029384
arc_meta_limit                  4    3064020480
arc_meta_max                    4    1401057672

$ cat /proc/meminfo
MemTotal:        7979220 kB
MemFree:          254616 kB
MemAvailable:    5484280 kB
Buffers:          173432 kB
Cached:          5132560 kB
SwapCached:       116264 kB
Active:          1465892 kB
Inactive:        5045156 kB
Active(anon):     474756 kB
Inactive(anon):   742112 kB
Active(file):     991136 kB
Inactive(file):  4303044 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8273916 kB
SwapFree:        7052236 kB
Dirty:            123664 kB
Writeback:         19512 kB
AnonPages:       1104120 kB
Mapped:            33076 kB
Shmem:             11716 kB
Slab:             455640 kB
SReclaimable:     188900 kB
SUnreclaim:       266740 kB
KernelStack:        7056 kB
PageTables:        24612 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12263524 kB
Committed_AS:    4094208 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      772320 kB
VmallocChunk:   34357153600 kB
HardwareCorrupted:     0 kB
AnonHugePages:     86016 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      113076 kB
DirectMap2M:     8243200 kB

$ vmstats -s
      7979220 K total memory
      1963020 K used memory
      1465720 K active memory
      5045156 K inactive memory
       254568 K free memory
       173432 K buffer memory
      5588200 K swap cache
      8273916 K total swap
      1221680 K used swap
      7052236 K free swap
      1088633 non-nice user cpu ticks
          224 nice user cpu ticks
      6014115 system cpu ticks
     75510165 idle cpu ticks
      2774503 IO-wait cpu ticks
           61 IRQ cpu ticks
        28052 softirq cpu ticks
            0 stolen cpu ticks
    475326274 pages paged in
    476810360 pages paged out
       344960 pages swapped in
       636601 pages swapped out
    192548532 interrupts
    315798094 CPU context switches
   1429909993 boot time
      1583936 forks

$ iostat
Linux 3.10.0-229.1.2.el7.x86_64 (backupmaster.siege.local)      27/04/2015      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,27    0,00    7,07    3,25    0,00   88,40

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdc               1,71        17,07        15,31    3677537    3296769
sda              17,45       375,39       756,95   80854880  163036844
sdb              17,51       379,63       756,95   81768396  163036844
sdd               1,23         0,57        15,31     123160    3296769
sde              28,39      1434,16       653,83  308901121  140827692
dm-0              5,25        16,50        15,31    3554405    3296769
dm-1              0,02         0,13         0,01      28558       2068
dm-2              5,23        16,37        15,30    3525422    3294701
dm-3              4,56         6,41        11,82    1381004    2546404
dm-4              0,50         8,06         2,48    1736931     533463
dm-5              0,17         1,89         1,00     407071     214834

When the crash happened, there was a backup running over SMB and i was rolling back a snapshot.

Anything else i can provide ?

@dweeezil
Copy link
Contributor

@deajan All the processes you show are blocked on a superblock's znodes lock. See #3308 for a bit of rationale as to why they shouldn't even be doing this. You've unfortunately got a kernel version on which the referenced callback is actually enabled. I have a feeling the triggering event is the rollback you mentioned since that can cause the lock to be taken in other places, none of which appear to be shown in your list of blocked processes.

@deajan
Copy link
Author

deajan commented Apr 27, 2015

@dweeezil Thanks for the background :)
I'll be happy to upgrade to zfs-testing whenever your PR is merged and report back if can reproduce the error.

Note for my future me:
To reproduce the error:

zfs snapshot storage/alfresco@whatever
zfs clone storage/alfresco@whatever storage/wip
zfs promote storage/wip
zfs destroy storage/alfresco@whatever -r

@deajan
Copy link
Author

deajan commented Sep 17, 2016

In order to close this old bug report, I've retried these commands today on the same system with zfs 0.6.5.7.
No more deadlocks.

@deajan deajan closed this as completed Sep 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants