Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC memory overhead not accounted #16557

Open
shodanshok opened this issue Sep 23, 2024 · 3 comments
Open

ARC memory overhead not accounted #16557

shodanshok opened this issue Sep 23, 2024 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@shodanshok
Copy link
Contributor

shodanshok commented Sep 23, 2024

System information

Type Version/Name
Distribution Name Rocky Linux
Distribution Version 9.4
Kernel Version 5.14.0-427.22.1.el9_4.x86_64
Architecture x86_64
OpenZFS Version 2.2.6-1

Describe the problem you're observing

When doing stat on many files, available memory shrink much faster than the corresponding ARC grow. For example:

# create test dataset with 100k files
zfs destroy tank/fsmark; zfs create tank/fsmark -o compression=lz4 -o xattr=off
fs_mark -k -S0 -D10 -N1000 -n 100000 -d /tank/fsmark/

# reset ARC via export/import
zpool export tank; zpool import tank

# get initial ARC statistics via "arcstat 1"
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
15:33:37     0       0     0       0     0      0    0   5.7M   1.7G   3.1G

# use find to read inode metadata (ie: stat)
find /tank/fsmark/ -ctime -1 | wc -l

# when done, check ARC statistics again - notice how ARC increased by about 700M but avail decreased by 1.3G
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
15:33:47  416K       0     0    413K    99   2.4K   13   706M   1.7G   1.8G

# arc_summary show ARC at ~700M, with no accounting for 600M (1.3G - 700M) which seem "lost"
ARC size (current):  38.6 %  707.2 MiB

# slabtop --sort=c shows the following
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
300048 300048 100%    1.13K  21432       14    342912K zfs_znode_cache
300216 300216 100%    0.96K  37527        8    300216K dnode_t
 10296  10296 100%   16.00K   5148        2    164736K zio_buf_comb_16384
302992 302981  99%    0.50K  37874        8    151496K kmalloc-512
310540 310540 100%    0.38K  31054       10    124216K dmu_buf_impl_t 
300045 300045 100%    0.26K  20003       15     80012K sa_cache
  9432   9432 100%    8.00K   2358        4     75456K kmalloc-8k
124026 124026 100%    0.19K   5906       21     23624K dentry
324160 324160 100%    0.06K   5065       64     20260K lsm_inode_cache
 20364  19919  97%    0.65K   1697       12     13576K inode_cache   

Describe how to reproduce the problem

stat many files and see how available memory shrink much faster than ARC grow.

Include any warning/errors/backtraces from the system logs

None.

@shodanshok shodanshok added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 23, 2024
@amotin
Copy link
Member

amotin commented Sep 23, 2024

I am not sure it is really a problem of ZFS, at least not it alone. Each time you stat a new file, there are several structures allocated: dnode, sa, dentry, etc. ZFS already accounts dnodes (which are the biggest consumer) and their backing dbufs as part of ARC. Please see arc_summary output in 2.3 that I recently updated to show it. I am not sure whether sa's are accounted to ARC, but that may need thinking indeed. dentry and some others are allocated by Linux kernel and out of ZFS scope, so even if everything else is perfect, memory size will still reduce.

@shodanshok
Copy link
Contributor Author

shodanshok commented Sep 23, 2024

I really think is ZFS related, as doing the same on a XFS mountpoint shows much less overhead. The difference seems massive. Example:

# create 100k on root XFS and drop caches
fs_mark -k -S0 -D10 -N1000 -n 100000 -d /opt/fsmark/
sync; echo 3 > /proc/sys/vm/drop_caches

# get mem/cache stats
free -m
               total        used        free      shared  buff/cache   available
Mem:            3659         348        3402           5          65        3311
Swap:           2083           6        2077

# stat files via find
find /opt/fsmark/ -ctime -1 | wc -l

# show mem/cache stats, notice how little additional memory is used (~120M)
free -m
               total        used        free      shared  buff/cache   available
Mem:            3659         464        3219           5         199        3195
Swap:           2083           6        2077

# slabtop --sort=c shows the following
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
100410 100410 100%    1.06K   6694       15    107104K xfs_inode
121128 121128 100%    0.19K   5768       21     23072K dentry
 20328  19379  95%    0.65K   1694       12     13552K inode_cache
121088 121088 100%    0.06K   1892       64      7568K lsm_inode_cache
113472 113456  99%    0.06K   1773       64      7092K kmalloc-64
100224 100224 100%    0.06K   1566       64      6264K kmalloc-rcl-64
 28384  28354  99%    0.12K    887       32      3548K kernfs_node_cache
 44160  44160 100%    0.06K    690       64      2760K ebitmap_node
   536    481  89%    4.00K     67        8      2144K kmalloc-4k
 90780  90780 100%    0.02K    534      170      2136K avtab_node

@shodanshok
Copy link
Contributor Author

shodanshok commented Sep 23, 2024

Well, I just noticed a surprising behavior: xattr=off behaves the same as xattr=on (the directory-based implementation). So the tests I reported on the first post really apply to xattr=on. Is that expected behavior? I thought xattr=off (with the corresponding mount option noxattr) to be the highest performing mode.

With xattr=sa ZFS shows much better results: the same 100k file stats shows ARC at ~250M with a decrease in available memory of ~400M. More details:

cat arcstats | grep size | grep -v l2
size                            4    257063224
compressed_size                 4    22286848
uncompressed_size               4    73855488
overhead_size                   4    62395904
hdr_size                        4    1105680
data_size                       4    512
metadata_size                   4    84682240
dbuf_size                       4    40656672
dnode_size                      4    98568264
bonus_size                      4    32038080
anon_size                       4    0
mru_size                        4    82289152
mru_ghost_size                  4    0
mfu_size                        4    2393600
mfu_ghost_size                  4    0
uncached_size                   4    0
arc_raw_size                    4    0
abd_chunk_waste_size            4    11776

slabtop --sort=c
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
100030 100030 100%    1.13K   7145       14    114320K zfs_znode_cache
100176 100176 100%    0.96K  12522        8    100176K dnode_t
  3582   3567  99%   16.00K   1791        2     57312K zio_buf_comb_16384
102896 102880  99%    0.50K  12862        8     51448K kmalloc-512
103730 103730 100%    0.38K  10373       10     41492K dmu_buf_impl_t
100020 100020 100%    0.26K   6668       15     26672K sa_cache
  3180   3180 100%    8.00K    795        4     25440K kmalloc-8k
201664 201664 100%    0.12K   6302       32     25208K kmalloc-128
121590 121590 100%    0.19K   5790       21     23160K dentry
 20268  19490  96%    0.65K   1689       12     13512K inode_cache

Comparing slabtop output between ZFS and XFS, dnode_t is very similar (in size) to xfs_inode. On ZFS side, I see zfs_znode_cache consuming 100M alone, then we have the various bufs (zio_buf_comb, dmu_buf_impl_t) but these are expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants