Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Slow Write Performance / z_wr_iss stuck in native_queued_spin_lock_slowpath #4834

Closed
bdaroz opened this issue Jul 8, 2016 · 22 comments
Closed
Labels
Type: Performance Performance improvement or performance problem

Comments

@bdaroz
Copy link

bdaroz commented Jul 8, 2016

Running Ubuntu 15.10 with ZoL 0.6.5.7-1~wily from the zfs-native ppa. This seemed to happen in one of the last few releases. Not 100% reproducible, but when it does occur it's most noticeable copying a large file from one volume to another location on the same volume. Normal transfer rates would be in the ~100MB/s range, but when this does occur transfer rates drop to <5MB/s.

System is a 16GB (ECC) Xeon E3-1241. Problem occurs on any pool (including some test mirrors, raidz1 and raidz2), info form the main pool storage is included here. All info taken during the slow transfer speeds, all drives are 4TB 5400RPM hard drives, no log or cache drives. All drives are LUKS encrypted (which is why the normal high water mark for transfers is in the ~100MB/s - ~130MB/s range).

perf top output during the slow transfer period:

Samples: 530K of event 'cycles', Event count (approx.): 189958246497
Overhead  Shared Object                   Symbol
  31.80%  [kernel]                        [k] native_queued_spin_lock_slowpath
   8.86%  [kernel]                        [k] isolate_lru_pages.isra.46
   5.77%  [kernel]                        [k] shrink_page_list
   5.49%  [kernel]                        [k] putback_inactive_pages
   4.44%  [kernel]                        [k] __page_check_address
   4.13%  [kernel]                        [k] __isolate_lru_page
   4.07%  [kernel]                        [k] unlock_page
   3.10%  [kernel]                        [k] __anon_vma_interval_tree_subtree_search
   2.69%  [kernel]                        [k] down_read_trylock
   2.28%  [kernel]                        [k] page_lock_anon_vma_read
   2.20%  [kernel]                        [k] __wake_up_bit
   2.13%  [kernel]                        [k] mm_find_pmd
   2.12%  [kernel]                        [k] _raw_spin_lock
   1.87%  [kernel]                        [k] mutex_spin_on_owner.isra.3
   1.69%  [kernel]                        [k] page_evictable
   1.68%  [kernel]                        [k] page_mapping
   1.37%  [kernel]                        [k] rmap_walk
   1.29%  [kernel]                        [k] page_referenced
   1.05%  [kernel]                        [k] up_read
   1.01%  [kernel]                        [k] page_referenced_one
   0.75%  [kernel]                        [k] __mod_zone_page_state
   0.70%  [kernel]                        [k] osq_lock
   0.66%  [kernel]                        [k] mem_cgroup_page_lruvec
   0.66%  [kernel]                        [k] anon_vma_interval_tree_iter_first
   0.59%  [kernel]                        [k] mem_cgroup_update_lru_size
   0.33%  [kernel]                        [k] shrink_inactive_list
   0.31%  [kernel]                        [k] _aesni_dec4

Output from top - (some process names scrubbed)

top - 17:56:49 up 7 days, 19:26,  3 users,  load average: 1.77, 2.86, 2.73
Tasks: 1117 total,  10 running, 1107 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us, 38.4 sy,  0.0 ni, 60.5 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16391864 total, 12492060 used,  3899804 free,      408 buffers
KiB Swap:  8388604 total,   397600 used,  7991004 free.   181956 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                
11322 root      20   0   19288   1652   1348 R  95.9  0.0  12:57.19 cp                                                                                                                                                                     
19802 [scrub]  30  10  341492  17628      0 R  27.2  0.1 600:52.53 [scrub]                                                                                                                                                       
 3299 root       1 -19       0      0      0 D  23.3  0.0 199:31.94 z_wr_iss                                                                                                                                                               
 3300 root       1 -19       0      0      0 R  23.3  0.0 199:43.56 z_wr_iss                                                                                                                                                               
 3304 root       1 -19       0      0      0 R  22.0  0.0 200:18.37 z_wr_iss                                                                                                                                                               
 3302 root       1 -19       0      0      0 R  21.7  0.0 200:08.62 z_wr_iss                                                                                                                                                               
 3301 root       1 -19       0      0      0 R  20.7  0.0 199:44.48 z_wr_iss                                                                                                                                                               
 3303 root       1 -19       0      0      0 D  18.4  0.0 200:16.53 z_wr_iss                                                                                                                                                               
 3648 root      20   0       0      0      0 D  17.1  0.0 137:38.64 txg_sync                                                                                                                                                               
 3333 root       0 -20       0      0      0 S   4.3  0.0   1:39.13 z_wr_int_1                                                                                                                                                             
 3341 root       0 -20       0      0      0 S   3.9  0.0   1:43.50 z_wr_int_2                                                                                                                                                             
 3315 root       0 -20       0      0      0 S   3.6  0.0   1:36.53 z_wr_int_0                                                                                                                                                             
 3334 root       0 -20       0      0      0 S   3.3  0.0   1:38.98 z_wr_int_2                                                                                                                                                             
 3336 root       0 -20       0      0      0 S   3.3  0.0   1:37.05 z_wr_int_2                                                                                                                                                             
 3346 root       0 -20       0      0      0 S   2.6  0.0   1:39.53 z_wr_int_3 

ARC Summary output:

ZFS Subsystem Report                Fri Jul 08 17:57:48 2016
ARC Summary: (HEALTHY)
    Memory Throttle Count:          0

ARC Misc:
    Deleted:                95.87m
    Mutex Misses:               1.44k
    Evict Skips:                1.44k

ARC Size:               100.01% 7.82    GiB
    Target Size: (Adaptive)     100.00% 7.82    GiB
    Min Size (Hard Limit):      0.40%   32.00   MiB
    Max Size (High Water):      250:1   7.82    GiB

ARC Size Breakdown:
    Recently Used Cache Size:   9.28%   742.66  MiB
    Frequently Used Cache Size: 90.72%  7.09    GiB

ARC Hash Breakdown:
    Elements Max:               866.86k
    Elements Current:       100.00% 866.86k
    Collisions:             25.69m
    Chain Max:              7
    Chains:                 135.59k

ARC Total accesses:                 849.17m
    Cache Hit Ratio:        85.43%  725.44m
    Cache Miss Ratio:       14.57%  123.73m
    Actual Hit Ratio:       78.98%  670.69m

    Data Demand Efficiency:     99.32%  565.66m
    Data Prefetch Efficiency:   38.40%  136.58m

    CACHE HITS BY CACHE LIST:
      Anonymously Used:     6.48%   46.99m
      Most Recently Used:       21.13%  153.26m
      Most Frequently Used:     71.33%  517.43m
      Most Recently Used Ghost: 0.67%   4.84m
      Most Frequently Used Ghost:   0.40%   2.92m

    CACHE HITS BY DATA TYPE:
      Demand Data:          77.44%  561.81m
      Prefetch Data:        7.23%   52.45m
      Demand Metadata:      15.01%  108.88m
      Prefetch Metadata:        0.32%   2.30m

    CACHE MISSES BY DATA TYPE:
      Demand Data:          3.11%   3.85m
      Prefetch Data:        68.00%  84.13m
      Demand Metadata:      28.22%  34.91m
      Prefetch Metadata:        0.68%   840.09k


File-Level Prefetch: (HEALTHY)
DMU Efficiency:                 1.16b
    Hit Ratio:          81.81%  945.40m
    Miss Ratio:         18.19%  210.23m

    Colinear:               210.23m
      Hit Ratio:            0.01%   15.91k
      Miss Ratio:           99.99%  210.21m

    Stride:                 866.23m
      Hit Ratio:            99.95%  865.79m
      Miss Ratio:           0.05%   434.10k

DMU Misc: 
    Reclaim:                210.21m
      Successes:            2.36%   4.97m
      Failures:         97.64%  205.24m

    Streams:                79.65m
      +Resets:          0.06%   44.26k
      -Resets:          99.94%  79.61m
      Bogus:                0


ZFS Tunable:
    metaslab_debug_load                               0
    zfs_arc_min_prefetch_lifespan                     0
    zfetch_max_streams                                8
    zfs_nopwrite_enabled                              1
    zfetch_min_sec_reap                               2
    zfs_dbgmsg_enable                                 0
    zfs_dirty_data_max_max_percent                    25
    zfs_arc_p_aggressive_disable                      1
    spa_load_verify_data                              1
    zfs_zevent_cols                                   80
    zfs_dirty_data_max_percent                        10
    zfs_sync_pass_dont_compress                       5
    l2arc_write_max                                   8388608
    zfs_vdev_scrub_max_active                         2
    zfs_vdev_sync_write_min_active                    10
    zvol_prefetch_bytes                               131072
    metaslab_aliquot                                  524288
    zfs_no_scrub_prefetch                             0
    zfs_arc_shrink_shift                              0
    zfetch_block_cap                                  256
    zfs_txg_history                                   0
    zfs_delay_scale                                   500000
    zfs_vdev_async_write_active_min_dirty_percent     30
    metaslab_debug_unload                             0
    zfs_read_history                                  0
    zvol_max_discard_blocks                           16384
    zfs_recover                                       0
    l2arc_headroom                                    2
    zfs_deadman_synctime_ms                           1000000
    zfs_scan_idle                                     50
    zfs_free_min_time_ms                              1000
    zfs_dirty_data_max                                1678526873
    zfs_vdev_async_read_min_active                    1
    zfs_mg_noalloc_threshold                          0
    zfs_dedup_prefetch                                0
    zfs_vdev_max_active                               1000
    l2arc_write_boost                                 8388608
    zfs_resilver_min_time_ms                          3000
    zfs_vdev_async_write_max_active                   10
    zil_slog_limit                                    1048576
    zfs_prefetch_disable                              0
    zfs_resilver_delay                                2
    metaslab_lba_weighting_enabled                    1
    zfs_mg_fragmentation_threshold                    85
    l2arc_feed_again                                  1
    zfs_zevent_console                                0
    zfs_immediate_write_sz                            32768
    zfs_dbgmsg_maxsize                                4194304
    zfs_free_leak_on_eio                              0
    zfs_deadman_enabled                               1
    metaslab_bias_enabled                             1
    zfs_arc_p_dampener_disable                        1
    zfs_object_mutex_size                             64
    zfs_metaslab_fragmentation_threshold              70
    zfs_no_scrub_io                                   0
    metaslabs_per_vdev                                200
    zfs_dbuf_state_index                              0
    zfs_vdev_sync_read_min_active                     10
    metaslab_fragmentation_factor_enabled             1
    zvol_inhibit_dev                                  0
    zfs_vdev_async_write_active_max_dirty_percent     60
    zfs_vdev_cache_size                               0
    zfs_vdev_mirror_switch_us                         10000
    zfs_dirty_data_sync                               67108864
    spa_config_path                                   /etc/zfs/zpool.cache
    zfs_dirty_data_max_max                            4196317184
    zfs_arc_lotsfree_percent                          10
    zfs_zevent_len_max                                128
    zfs_scan_min_time_ms                              1000
    zfs_arc_sys_free                                  0
    zfs_arc_meta_strategy                             1
    zfs_vdev_cache_bshift                             16
    zfs_arc_meta_adjust_restarts                      4096
    zfs_max_recordsize                                1048576
    zfs_vdev_scrub_min_active                         1
    zfs_vdev_read_gap_limit                           32768
    zfs_arc_meta_limit                                0
    zfs_vdev_sync_write_max_active                    10
    l2arc_norw                                        0
    zfs_arc_meta_prune                                10000
    metaslab_preload_enabled                          1
    l2arc_nocompress                                  0
    zvol_major                                        230
    zfs_vdev_aggregation_limit                        131072
    zfs_flags                                         0
    spa_asize_inflation                               24
    zfs_admin_snapshot                                0
    l2arc_feed_secs                                   1
    zio_taskq_batch_pct                               75
    zfs_sync_pass_deferred_free                       2
    zfs_disable_dup_eviction                          0
    zfs_arc_grow_retry                                0
    zfs_read_history_hits                             0
    zfs_vdev_async_write_min_active                   1
    zfs_vdev_async_read_max_active                    3
    zfs_scrub_delay                                   4
    zfs_delay_min_dirty_percent                       60
    zfs_free_max_blocks                               100000
    zfs_vdev_cache_max                                16384
    zio_delay_max                                     30000
    zfs_top_maxinflight                               32
    spa_slop_shift                                    5
    zfs_vdev_write_gap_limit                          4096
    spa_load_verify_metadata                          1
    spa_load_verify_maxinflight                       10000
    l2arc_noprefetch                                  1
    zfs_vdev_scheduler                                noop
    zfs_expire_snapshot                               300
    zfs_sync_pass_rewrite                             2
    zil_replay_disable                                0
    zfs_nocacheflush                                  0
    zfs_arc_max                                       0
    zfs_arc_min                                       0
    zfs_read_chunk_size                               1048576
    zfs_txg_timeout                                   5
    zfs_pd_bytes_max                                  52428800
    l2arc_headroom_boost                              200
    zfs_send_corrupt_data                             0
    l2arc_feed_min_ms                                 200
    zfs_arc_meta_min                                  0
    zfs_arc_average_blocksize                         8192
    zfetch_array_rd_sz                                1048576
    zfs_autoimport_disable                            1
    zfs_arc_p_min_shift                               0
    zio_requeue_io_start_cut_in_line                  1
    zfs_vdev_sync_read_max_active                     10
    zfs_mdcomp_disable                                0
    zfs_arc_num_sublists_per_state                    8

zpool iostat 5 output as well as a sample verbose output, slow copy in progress is from and to the storage pool.

# zpool iostat 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T    125     67  9.02M  1.39M
storage2    5.41T  5.46T      9     38   743K  1.30M
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     68     89  1.05M  1.01M
storage2    5.41T  5.46T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     50     69   784K   605K
storage2    5.41T  5.46T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     37     64   615K   558K
storage2    5.41T  5.46T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     57     46   913K   390K
storage2    5.41T  5.46T      6      0   482K      0
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     80     91  1.25M  1.00M
storage2    5.41T  5.46T      9      7   710K  34.5K
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     38     83   616K   901K
storage2    5.41T  5.46T      1     38  8.79K   185K
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     78     19  1.24M   306K
storage2    5.41T  5.46T      0      8      0  39.4K
----------  -----  -----  -----  -----  -----  -----
storage     27.2T  16.4T     37     97   669K  1.08M
storage2    5.41T  5.46T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

And verbose sample:

                            capacity     operations    bandwidth
pool                     alloc   free   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
storage                  27.2T  16.4T    108     88  1.69M  1.36M
  raidz2                 13.6T  8.15T     51     50   817K   801K
    scsi-4TB-XJ40_crypt      -      -     17     24   112K   202K
    scsi-4TB-MQEF_crypt      -      -     18     25   122K   202K
    scsi-4TB-8RAB_crypt      -      -     18     26   194K   202K
    scsi-4TB-MTA7_crypt      -      -     20     27   146K   202K
    scsi-4TB-37AM_crypt      -      -     10     27   117K   201K
    scsi-4TB-3EDW_crypt      -      -      5     25  83.6K   200K
  raidz2                 13.6T  8.23T     57     37   915K   587K
    scsi-4TB-MXV5_crypt      -      -     22     20   113K   154K
    scsi-4TB-XPPY_crypt      -      -     29     13   135K   149K
    scsi-4TB-1FZ4_crypt      -      -     23     18   167K   149K
    scsi-4TB-1G0N_crypt      -      -     29     19   154K   155K
    scsi-4TB-CNKU_crypt      -      -     29     17   127K   149K
    scsi-4TB-1G6M_crypt      -      -     15     20  66.1K   155K
-----------------------  -----  -----  -----  -----  -----  -----
storage2                 5.41T  5.46T      0      0      0      0
  raidz2                 5.41T  5.46T      0      0      0      0
    scsi-2TB-4536_crypt      -      -      0      0      0      0
    scsi-2TB-9126_crypt      -      -      0      0      0      0
    scsi-2TB-3929_crypt      -      -      0      0      0      0
    scsi-2TB-CRLZ_crypt      -      -      0      0      0      0
    scsi-2TB-8061_crypt      -      -      0      0      0      0
    scsi-2TB-NVR2_crypt      -      -      0      0      0      0
-----------------------  -----  -----  -----  -----  -----  -----

Sample iostat from a drive in the array - typical for all the drives:

# iostat -dx 30 sdy
Linux 4.2.0-41-generic (tyche)  07/08/2016  _x86_64_    (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.54     0.00   10.92    7.27   766.20   189.66   105.12     0.16    8.53   12.54    2.49   5.11   9.29

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.30     0.00    4.60   22.30    46.00   122.13    12.50     0.38   14.11   79.36    0.65  11.77  31.67

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.43     0.00    4.97   21.60    53.33   153.33    15.56     0.33   12.42   64.16    0.52   8.55  22.71

Properties of the volume:

# zfs get all storage/[scrubbed]
NAME                PROPERTY              VALUE                  SOURCE
storage/[scrubbed]  type                  filesystem             -
storage/[scrubbed]  creation              Mon Jun 20  4:08 2016  -
storage/[scrubbed]  used                  118G                   -
storage/[scrubbed]  available             10.0T                  -
storage/[scrubbed]  referenced            118G                   -
storage/[scrubbed]  compressratio         1.00x                  -
storage/[scrubbed]  mounted               yes                    -
storage/[scrubbed]  quota                 none                   default
storage/[scrubbed]  reservation           none                   default
storage/[scrubbed]  recordsize            16K                    local
storage/[scrubbed]  mountpoint            /mnt/zfs/[scrubbed]    inherited from storage
storage/[scrubbed]  sharenfs              off                    default
storage/[scrubbed]  checksum              on                     default
storage/[scrubbed]  compression           lz4                    local
storage/[scrubbed]  atime                 off                    inherited from storage
storage/[scrubbed]  devices               on                     default
storage/[scrubbed]  exec                  on                     default
storage/[scrubbed]  setuid                on                     default
storage/[scrubbed]  readonly              off                    default
storage/[scrubbed]  zoned                 off                    default
storage/[scrubbed]  snapdir               hidden                 default
storage/[scrubbed]  aclinherit            restricted             default
storage/[scrubbed]  canmount              on                     default
storage/[scrubbed]  xattr                 sa                     inherited from storage
storage/[scrubbed]  copies                1                      default
storage/[scrubbed]  version               5                      -
storage/[scrubbed]  utf8only              off                    -
storage/[scrubbed]  normalization         none                   -
storage/[scrubbed]  casesensitivity       sensitive              -
storage/[scrubbed]  vscan                 off                    default
storage/[scrubbed]  nbmand                off                    default
storage/[scrubbed]  sharesmb              off                    default
storage/[scrubbed]  refquota              none                   default
storage/[scrubbed]  refreservation        none                   default
storage/[scrubbed]  primarycache          all                    default
storage/[scrubbed]  secondarycache        all                    default
storage/[scrubbed]  usedbysnapshots       0                      -
storage/[scrubbed]  usedbydataset         118G                   -
storage/[scrubbed]  usedbychildren        0                      -
storage/[scrubbed]  usedbyrefreservation  0                      -
storage/[scrubbed]  logbias               latency                default
storage/[scrubbed]  dedup                 off                    default
storage/[scrubbed]  mlslabel              none                   default
storage/[scrubbed]  sync                  standard               default
storage/[scrubbed]  refcompressratio      1.00x                  -
storage/[scrubbed]  written               118G                   -
storage/[scrubbed]  logicalused           116G                   -
storage/[scrubbed]  logicalreferenced     116G                   -
storage/[scrubbed]  filesystem_limit      none                   default
storage/[scrubbed]  snapshot_limit        none                   default
storage/[scrubbed]  filesystem_count      none                   default
storage/[scrubbed]  snapshot_count        none                   default
storage/[scrubbed]  snapdev               hidden                 default
storage/[scrubbed]  acltype               posixacl               inherited from storage
storage/[scrubbed]  context               none                   default
storage/[scrubbed]  fscontext             none                   default
storage/[scrubbed]  defcontext            none                   default
storage/[scrubbed]  rootcontext           none                   default
storage/[scrubbed]  relatime              off                    default
storage/[scrubbed]  redundant_metadata    all                    default
storage/[scrubbed]  overlay               off                    default

And finally zpool list output:

# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  43.6T  27.2T  16.4T     1.44G    14%    62%  1.00x  ONLINE  -
  raidz2  21.8T  13.6T  8.15T         -    11%    62%
    scsi-4TB-XJ40_crypt      -      -      -         -      -      -
    scsi-4TB-MQEF_crypt      -      -      -         -      -      -
    scsi-4TB-8RAB_crypt      -      -      -         -      -      -
    scsi-4TB-MTA7_crypt      -      -      -         -      -      -
    scsi-4TB-37AM_crypt      -      -      -         -      -      -
    scsi-4TB-3EDW_crypt      -      -      -         -      -      -
  raidz2  21.8T  13.6T  8.22T     1.44G    18%    62%
    scsi-4TB-MXV5_crypt      -      -      -         -      -      -
    scsi-4TB-XPPY_crypt      -      -      -         -      -      -
    scsi-4TB-1FZ4_crypt      -      -      -         -      -      -
    scsi-4TB-1G0N_crypt      -      -      -         -      -      -
    scsi-4TB-CNKU_crypt      -      -      -         -      -      -
    scsi-4TB-1G6M_crypt      -      -      -         -      -      -

I started a copy of a small number of files (about 4-5GB each) totaling about 55GB, as I'm finishing writing this issue about 49 minutes later only about 6.5GB has been copied so far.

I've been watching #4512 but the conversation there seems to be trending toward issues in older kernels, not the 4.2.0 kernel running here.

Any help, or guidance, would be appreciated.

@bdaroz
Copy link
Author

bdaroz commented Jul 11, 2016

In my searching I also found this discussion of a person with what appears to be the same problem, but not solution:

https://forum.proxmox.com/threads/server-stalls-with-zfs.26721/

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jul 12, 2016
@dweeezil
Copy link
Contributor

@bdaroz Any chance you're burning all your CPU doing the encryption? Maybe run perf top and see what's going on.

@bdaroz
Copy link
Author

bdaroz commented Jul 12, 2016

@dweeezil Thanks for looking at this.... There's a perf top output at the top of my original comment. Short answer, no, encryption isn't burning all the CPU when the problem happens.

Long answer: When the problem is not happening, and I/O maxes out around 150MB/s, the CPU does load up above 90% across all cores. When the problem does happen the I/O throughput drops to around 5MB/s, and the CPU does have quite a bit of idle time (see top output above, showing about 60% idle when the problem occurs), and perf top is showing a substantial amount of calls to native_queued_spin_lock_slowpath. The only encryption call I recognize on the perf top output is at the bottom of the snippet with some 0.3% of the calls, as opposed to >30% for native_queued_spin_lock_slowpath.

@dweeezil
Copy link
Contributor

@bdaroz Oops, missed your perf output. Looks like a memory management issue based on the functions shown in your perf output. Without knowing anything else about your test setup such as, for example, where is the source of the files you're copying, it's really hard to guess what's going on. Your next step ought to be to run perf top -g and expand the stacks of some of the top CPU users and see what's actually causing all the spinning.

Best guess at the moment is that this is caused by too much dirty data for the capability of the vdevs (and that's just a total out-of-left field guess).

@bdaroz
Copy link
Author

bdaroz commented Jul 12, 2016

@dweeezil Thanks. I did note in the opening paragraph the files (ranging from 3-10GB ea) are being copied from a volume on the storage pool to another directory on the same volume. It also occurs copying files between volumes on the same pool.

I'll look at perf top -g next time I catch it in the act and post back. Thanks again.

@bdaroz
Copy link
Author

bdaroz commented Jul 14, 2016

Here's a somewhat expanded view of perf top -g, though to be honest I'm not sure quite what I'm looking for here.

Samples: 2M of event 'cycles', Event count (approx.): 661970552859
  Children      Self  Shared Object                   Symbol                                                                                                                                                                                                                                                                                                                                             
-   29.36%    29.36%  [kernel]                        [k] native_queued_spin_lock_slowpath                                                                                                                                                 ▒
   - native_queued_spin_lock_slowpath                                                                                                                                                                                                      ▒
      - 71.14% _raw_spin_lock_irq                                                                                                                                                                                                          ▒
         - shrink_inactive_list                                                                                                                                                                                                            ▒
           shrink_lruvec                                                                                                                                                                                                                   ▒
           shrink_zone                                                                                                                                                                                                                     ▒
           do_try_to_free_pages                                                                                                                                                                                                            ▒
           try_to_free_pages                                                                                                                                                                                                               ▒
           __alloc_pages_nodemask                                                                                                                                                                                                          ▒
           alloc_pages_current                                                                                                                                                                                                             ▒
           new_slab                                                                                                                                                                                                                        ▒
           __slab_alloc                                                                                                                                                                                                                    ▒
           kmem_cache_alloc                                                                                                                                                                                                                ▒
         - spl_kmem_cache_alloc                                                                                                                                                                                                            ▒
            - 72.51% zio_create                                                                                                                                                                                                            ▒
               - 75.22% zio_vdev_child_io                                                                                                                                                                                                  ▒
                  - 78.92% vdev_raidz_io_start                                                                                                                                                                                             ▒
                       zio_vdev_io_start                                                                                                                                                                                                   ▒
                       zio_nowait                                                                                                                                                                                                          ▒
                       vdev_mirror_io_start                                                                                                                                                                                                ▒
                     - zio_vdev_io_start                                                                                                                                                                                                   ▒
                        + 63.93% zio_execute                                                                                                                                                                                               ▒
                        + 23.88% zio_ready                                                                                                                                                                                                 ▒
                        + 12.19% zio_nowait                                                                                                                                                                                                ▒
                  - 21.08% vdev_mirror_io_start                                                                                                                                                                                            ▒
                     - zio_vdev_io_start                                                                                                                                                                                                   ▒
                        + 67.79% zio_execute                                                                                                                                                                                               ▒
                        + 20.71% zio_ready                                                                                                                                                                                                 ▒
                        + 11.51% zio_nowait                                                                                                                                                                                                ▒
               + 13.08% zio_vdev_delegated_io                                                                                                                                                                                              ▒
               + 6.05% zio_write                                                                                                                                                                                                           ▒
               + 2.63% zio_read                                                                                                                                                                                                            ▒
                 1.02% zio_free_sync                                                                                                                                                                                                       ▒
                 0.98% zio_null                                                                                                                                                                                                            ▒
               + 0.71% zio_write_phys                                                                                                                                                                                                      ▒
            + 16.97% lz4_compress_zfs                                                                                                                                                                                                      ▒
            + 7.09% zio_data_buf_alloc                                                                                                                                                                                                     ▒
              3.43% zio_buf_alloc                                                                                                                                                                                                          ▒
      - 28.61% _raw_spin_lock                                                                                                                                                                                                              ▒
         - 99.09% kvm_mmu_notifier_clear_flush_young                                                                                                                                                                                       ▒
              __mmu_notifier_clear_flush_young                                                                                                                                                                                             ▒
           0.58% __page_check_address                                                                                                                                                                                                      ▒
              page_referenced_one                                                                                                                                                                                                          ▒
+   23.17%     0.39%  [kernel]                        [k] __mmu_notifier_clear_flush_young       

Also, I should be able to upgrade this server from 16GB to 32GB of ECC RAM sometime in the next 24-48 hours, if we are looking at a arc cache malloc issue.

@bdaroz
Copy link
Author

bdaroz commented Jul 14, 2016

I also bumped up (set, actually) the zfs_arc_max value from 0 to about 12GB - with 16GB of RAM it would have defaulted to about 8GB. (Don't mind the diagonal line from 0 to 12GB, that value isn't polled for that frequently.) The value was changed about 12:15am or so on the graph. If it's starved for ARC cache, one would expect the in-use amount to jump up considerably more after the setting change. The server had at least 1.5GB of free ram + some buffers available the entire time. I also shut down some memory-intensive processes (java) to push free RAM over the 3GB mark. ZFS never went on a RAM-grab, and the performance was intermittently abysmal throughout this period.

chart2

@bdaroz
Copy link
Author

bdaroz commented Jul 14, 2016

One other interesting graph... The pool was mostly quiet until 12:01PM when a 15GB file was moved between 2 volumes on the storage pool. It completed at 12:56PM. During that time the arc cache shed, at one point, 7GB of RAM, going from about 10GB used to a low of ~2.5GB before slowly climbing upwards.

chart2 1

@dweeezil
Copy link
Contributor

@bdaroz Since this does seem to be a memory management issue, please post the arcstats (/proc/spl/kstat/zfs/arcstats ) while this is happening. It sounds like there must be a bunch of other memory-hungry userland paocess (you mentioned java) so there's likely a good bit of fighting for memory going on. You mention setting zfs_arc_max but in this case, it might be a lot more useful to set zfs_arc_min to something a lot higher than its very low default of 32MiB.

@bdaroz
Copy link
Author

bdaroz commented Jul 26, 2016

Good or bad, the system was upgraded to 32GB of RAM and the problem has largely subsided. There are periods of slower throughput, but we're looking at a loss of at most 25%-30%, not 95%+. Given how long this machine was "fine" on 0.6.4 with no substantive changes other than the ZFS upgrade, there does seem to be something changed in the code to create this. Unfortunately with it no longer happening at that severity, and duration, I don't know how much use I can be with diagnostics.

Perhaps some of the work being done in #4880 and #4512 will help.

@bdaroz
Copy link
Author

bdaroz commented Jul 27, 2016

No sooner than I post that and 30 minutes later performance goes to complete crap again.... Here's /proc/spl/kstat/zfs/arcstats

root@tyche:~# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 143357928572 966772984640833
name                            type data
hits                            4    356749883
misses                          4    73637920
demand_data_hits                4    287163427
demand_data_misses              4    1989306
demand_metadata_hits            4    40188127
demand_metadata_misses          4    49147364
prefetch_data_hits              4    26139437
prefetch_data_misses            4    22104810
prefetch_metadata_hits          4    3258892
prefetch_metadata_misses        4    396440
mru_hits                        4    46166118
mru_ghost_hits                  4    3544800
mfu_hits                        4    281185701
mfu_ghost_hits                  4    1921456
deleted                         4    27321519
mutex_miss                      4    549
evict_skip                      4    21009
evict_not_enough                4    967
evict_l2_cached                 4    0
evict_l2_eligible               4    1459570099200
evict_l2_ineligible             4    180475055616
evict_l2_skip                   4    0
hash_elements                   4    1067889
hash_elements_max               4    1308887
hash_collisions                 4    13987373
hash_chains                     4    115221
hash_chain_max                  4    6
p                               4    15723703936
c                               4    16848345088
c_min                           4    33554432
c_max                           4    16848345088
size                            4    16850167784
hdr_size                        4    415743880
data_size                       4    14735588864
metadata_size                   4    885912576
other_size                      4    812922464
anon_size                       4    10921984
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    12872051200
mru_evictable_data              4    12125305344
mru_evictable_metadata          4    98547200
mru_ghost_size                  4    2568221696
mru_ghost_evictable_data        4    505816576
mru_ghost_evictable_metadata    4    2062405120
mfu_size                        4    2738528256
mfu_evictable_data              4    2606089216
mfu_evictable_metadata          4    17219584
mfu_ghost_size                  4    14278662144
mfu_ghost_evictable_data        4    12973140480
mfu_ghost_evictable_metadata    4    1305521664
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    339
memory_indirect_count           4    6944
arc_no_grow                     4    0
arc_tempreserve                 4    294912
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    2114578920
arc_meta_limit                  4    12636258816
arc_meta_max                    4    2467053232
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    526508032

@feisuzhu
Copy link

feisuzhu commented Jul 8, 2017

Running into the same issue here, no more updates?

@jakubfijolek
Copy link

I'm Also having exactly same issue.
@bdaroz Did you found solution?

@ItsEcholot
Copy link

This is driving me crazy... This gets so bad for me that any large file transfers over smb stall and run into a timeout. Anything bigger than 1-2GB can't really be transferred because of that.

@bdaroz
Copy link
Author

bdaroz commented Sep 5, 2018

@jakubfijolek Sorry. since I upgraded RAM significantly, and later upgraded ZoL to later versions I haven't had this reoccur.

@ItsEcholot
Copy link

Using atop I have noticed that one hard disk in my pool seems to have a higher average io time than the others and thus have way higher busy percentage values than the other hard disks.

Is it possible that a somehow slower drive is pulling the complete pool to such low performance?

@gmelikov
Copy link
Member

gmelikov commented Sep 6, 2018

@ItsEcholot yes, for example in RAIDZ you'll have IOPS of slowest device.

@ItsEcholot
Copy link

@gmelikov Alright thanks I will look into buying a replacement drive and hope that will improve my situation

@gmelikov
Copy link
Member

gmelikov commented Sep 6, 2018

@ItsEcholot you can try to offline it temporarily to check this thought.

@ItsEcholot
Copy link

@gmelikov Just did, it really seems to help. Write speeds are still very slow (10MB/s) but it's worlds better than before especially because transfers don't seem to freeze anymore.

Gonna order a new hdd to replace it. Also maybe my RAID card is hurting my performance. Would getting a real HBA / JBOD card improve the performance noticeably or is it not worth it? Also would I have to transfer the files from the pool and then back again when replacing the RAID card or could I do it without losing data in the pool?

@gmelikov
Copy link
Member

gmelikov commented Sep 6, 2018

@ItsEcholot RAID card might be the case too.

Also would I have to transfer the files from the pool and then back again when replacing the RAID card or could I do it without losing data in the pool?

It depends on your RAID card, if it makes passthrough - then it's ok, if not - send/recv. If you have more questions - feel free to use our mailing lists https://github.com/zfsonlinux/zfs/wiki/Mailing-Lists

@ItsEcholot
Copy link

@gmelikov Alright thanks a lot for your answers this has helped me a lot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

8 participants
@behlendorf @feisuzhu @dweeezil @ItsEcholot @gmelikov @bdaroz @jakubfijolek and others