Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC is not used for zvol #7897

Closed
aaronjwood opened this issue Sep 13, 2018 · 18 comments
Closed

ARC is not used for zvol #7897

aaronjwood opened this issue Sep 13, 2018 · 18 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@aaronjwood
Copy link

aaronjwood commented Sep 13, 2018

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 4.15.0-34-generic
Architecture x64
ZFS Version 0.7.5-1ubuntu16.3
SPL Version 0.7.5-1ubuntu1

Describe the problem you're observing

I'm exposing a zvol via iSCSI to a Windows machine. The zvol is in a second pool of mine and is almost 1 TB in size. I'm seeing that any reads or writes to the zvol continually decrease the ARC size while increasing the buffered and inactive memory amounts.

Here's an overview of my memory usage for the past hour (I started using the zvol at the point where the ARC size started to decrease):
screen shot 2018-09-12 at 11 54 06 pm

And some of my ZFS stats for the same period (you can see that my ARC hit ratio has gone to 80%; my ratio normally is never below 90%):
screen shot 2018-09-12 at 11 54 45 pm

See that there are no ARC hits or misses as the Linux buffers/inactive counts are growing.

Describe how to reproduce the problem

Create a zvol in a pool, expose this zvol to Windows via iSCSI, have Windows use it and format the block device with NTFS, start using the block device in Windows and watch the ARC size go down while the buffered and inactive amounts go up.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 13, 2018
@behlendorf
Copy link
Contributor

@aaronjwood what you're observing is the result of generic kernel block layer double caching what's already in the ARC. This additional cache is the space you see consumed as buffered and inactive. As that cache increases in size the ARC will automatically reduce its own size in order to leave enough free space available on the system. This additional cache is also likely responsible for your reduced hit rate. It not only reduces the size of the ARC, but since it will handle cache hits without notifying ZFS the ARC may get a distorted view of the most critical blocks to keep cached.

Unfortunately, the kernel doesn't provide a way I'm aware of to disable or limit this additional layer of caching. I've tagged this a a performance issue we do want to investigate. Thanks for filing such a clear issue so we can track this.

@GregorKopka
Copy link
Contributor

I don't see this effect on a server with kernel 4.4.6, zfs 0.6.5 and LIO iSCSI exporting zvols to ~20 diskless Windows clients (~1TB logicalused in total) using backstore/block (emulate_write_cache=0 in LIO and sync=disabled on the zvols).

The same machine is also serving TB sized databases accessed by these clients (postgres and, sadly, also legacy flat files on SMB shares) and the usual SMB shares in a windows environment (user profiles with folder redirection for %appdata% and such, business data dumps, ...)

top:
top - 10:53:17 up 346 days, 16:34,  2 users,  load average: 0,81, 0,42, 0,29
Tasks: 1973 total,   1 running, 1972 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,1 us,  2,1 sy,  0,0 ni, 96,2 id,  1,4 wa,  0,0 hi,  0,1 si,  0,0 st
KiB Mem : 65768864 total,  5867324 free, 11384472 used, 48517068 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  5270844 avail Mem

atop:
MEM | tot    62.7G | free    5.6G | cache 440.9M | dirty   0.0M | buff    0.0M | slab   45.8G |
SWP | tot     2.0G | free    2.0G |              |              | vmcom   1.9G | vmlim  33.4G |

cat /proc/meminfo
MemTotal:       65768864 kB
MemFree:         5998768 kB
MemAvailable:    5402300 kB
Buffers:               4 kB
Cached:           451532 kB
SwapCached:            0 kB
Active:           929540 kB
Inactive:         199456 kB
Active(anon):     906668 kB
Inactive(anon):   193876 kB
Active(file):      22872 kB
Inactive(file):     5580 kB
Unevictable:        7900 kB
Mlocked:            9384 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:               156 kB
Writeback:             0 kB
AnonPages:        685360 kB
Mapped:           377248 kB
Shmem:            416744 kB
Slab:           47921108 kB
SReclaimable:      89292 kB
SUnreclaim:     47831816 kB
KernelStack:       32208 kB
PageTables:        33252 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    34981580 kB
Committed_AS:    2101820 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:         0 kB
DirectMap4k:        8460 kB
DirectMap2M:     2955264 kB
DirectMap1G:    63963136 kB


arcstats 60
    time     c  arcsz  read  miss  hit%  dread  dmis  dh%  mread  mmis  mh%  pread  pmis  ph%  l2read  l2hit%   mfu  mfug   mru  mrug  mtxmis  eskip
10:43:33   48G    48G     0     0   100      0     0  100     0     0    0      0     0    0       0       0     0     0     0     0       0      0
10:44:33   48G    48G   327     7    97    313     6   98   100     2   97     14     1   89       7       0   280     0    26     1       0      0
10:45:33   48G    48G   202     6    96    187     3   97    77     2   96     15     2   82       6       0   155     0    28     0       0      0
10:46:33   48G    48G   196     8    95    189     3   97    74     2   97      6     4   31       8       0   160     0    25     5       0      0
10:47:33   48G    48G   139     3    97    132     2   97    66     1   97      6     0   93       3       1   108     0    21     0       0      0
10:48:33   48G    48G   189     3    98    180     2   98    76     1   97      9     0   95       3       0   156     0    20     0       0      0
10:49:33   48G    48G   166    14    91    147     3   97    76     2   96     18    10   45      14       0   123     0    20     3       0      0
10:50:34   48G    48G   139     4    96    137     3   97    66     2   96      1     1   42       4       0   112     0    21     1       0      0
10:51:34   48G    48G   482     5    98    477     4   99    78     2   97      5     0   88       5       0   447     0    25     1       0      0
10:52:34   48G    48G   191     3    98    188     2   98    70     1   98      3     0   87       3       0   160     0    24     1       0      0

Running that system with zfs_arc_max=52000000000 (to avoid OOM killer as zfs 0.6 has problems shrinking ARC on demand). Peaks with arc hit% below 90 are possible as the databases dosn't fit into RAM (most of it is cold and the parts becoming hot on a day get relieably cached).

@shodanshok
Copy link
Contributor

The proper method to avoid double buffering when writing to a ZVOL is to use the O_DIRECT flag or, as @GregorKopka did, using emulate_write_cache=0 (which as far I know should disable IO buffering).

@aaronjwood how are you sharing the disk via iSCSI? What iSCSI stack are you using?

@aaronjwood
Copy link
Author

aaronjwood commented Oct 5, 2018

I'm using tgt installed from my distro's repos but I think it's still LIO underneath since STGT was replaced back in 2010 (I think?) Looks like O_DIRECT is not available for me yet #7823 (comment)

@shodanshok
Copy link
Contributor

shodanshok commented Oct 6, 2018

Yes, I think you are using LIO underneath. ZVOLs already supports the O_DIRECT flag, which bypasses the system pagecache but not the ARC. Try using emulate_write_cache as @GregorKopka suggested above.

@aaronjwood
Copy link
Author

aaronjwood commented Oct 6, 2018

I added write-cache off in my tgt config but it didn't seem to change anything:

<target myname>
        backing-store /dev/zvol/ssd/iscsi
        initiator-name myothername
        write-cache off
</target>

I don't see any emulate_write_cache setting to tweak with tgt. I'm assuming write-cache is the same thing here.

@shodanshok
Copy link
Contributor

shodanshok commented Oct 7, 2018

From what I read here, write-cache should be about the default setting for the physical device cache (only applicable when exporting a physical device, of course).

Lets try another approach: can you try to unset device cache from within Windows (give a look here or here)

@aaronjwood
Copy link
Author

Hmm...it looks like write caching is already off for my iSCSI drive.

@aaronjwood
Copy link
Author

The write-cache setting I used in my tgt config seems to control this setting from the Windows side as well. I reverted my config and now write caching shows as on from the Windows side.

@shodanshok
Copy link
Contributor

shodanshok commented Oct 8, 2018

Interesting. I'll do some tests in the coming days; lets see if I can replicate this.

@aaronjwood
Copy link
Author

I was able to resolve it and match the behavior @GregorKopka is seeing. I ditched tgt and switched to using targetcli while following this. Setting emulate_write_cache=0 seemed to do the trick and my graphs are back to showing what I had originally expected:
mem
zfs

@GregorKopka
Copy link
Contributor

@aaronjwood There are two flavours of targetcli (original targetcli and the fork at https://github.com/open-iscsi/targetcli-fb), the latter works better (at least for the systems I have iSCSI active) - as it dosn't have the problems of the former with loading/saving configurations.

@aaronjwood
Copy link
Author

Thanks, good to know. I had installed targetcli-fb so I guess I'll stick with that :)

@shodanshok
Copy link
Contributor

Excellent! If the problem is solved, remember to close the issue ;)

@aaronjwood
Copy link
Author

I'm not sure if it's right to close it since the issue exists if anyone uses a zvol in ways other than with iSCSI. Is there some generic way to do what emulate_write_cache=0 does for block devices in the kernel?

@shodanshok
Copy link
Contributor

In my opinion, it is not a zvol-related problem. Writing to any block devices without O_DIRECT will result in heavy buffering from the host kernel. The proper solution to avoid double-buffering is using O_DIRECT from the application writing to the zvol. I strongly suspect that emulate_write_cache=0 does exactly that, opening the underlying backing dev (a zvol, in this case) for direct access.

@aaronjwood
Copy link
Author

Makes sense, after doing some more reading on O_DIRECT it sounds like a valid scenario. @behlendorf did you want to leave this open since you mentioned it needed investigation or should I close it? At least in the case of iSCSI there is a way to solve the problem.

@behlendorf
Copy link
Contributor

Let's close it out since the root cause was identified and the behavior is as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

4 participants