Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

l2arc_noprefetch documentation is unclear - what does this do? #10464

Closed
recklessnl opened this issue Jun 15, 2020 · 31 comments · Fixed by #10743
Closed

l2arc_noprefetch documentation is unclear - what does this do? #10464

recklessnl opened this issue Jun 15, 2020 · 31 comments · Fixed by #10743
Labels
Type: Documentation Indicates a requested change to the documentation Type: Question Issue for discussion

Comments

@recklessnl
Copy link

recklessnl commented Jun 15, 2020

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10
Linux Kernel 5.4.14
Architecture x64
ZFS Version 0.8.4
SPL Version 0.8.4-pve1

Describe the problem you're observing

I'm trying to tune L2ARC for maximum performance but I'm having trouble understand how exactly it operates in ZFS. The device that I want to use as an L2ARC cache vdev is an Intel P4608 enterprise SSD which is an x8 PCIe3 SSD device. It features 2 seperate pools of 3.2TB, each with x4 lanes, and I want to stripe both of these together for the combined speed and IOPS (would use x8 PCIe lanes for this). You can view additional stats about this drive here to give you an idea of the performance it is capable of.
RAM on this system is 512GB in total. I want to stripe this device as a cache vdev, so total L2ARC would be ~6.4TB.

Random reads and smaller size reads will surely be much faster compared to the pool of disks, but I'm confused about sequential reads and the docs do not explain properly. I do want sequential cached reads to be pulled from these drives as well for now because I can't see the harddrive pool outperforming this SSD when striped. I'm planning to use zpool add poolname cache ssd1 ssd2 as command, this will stripe the SSDs together instead of creating a JBOD pool, right?

Additionally, I'm seeing information that you need to set the l2arc_noprefetch tunable to 0 in order to properly allow seqential reads but is this how it actually works? Does it not do sequential reads unless you set that to 0 (default is 1) or am I not understanding it correctly?

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZFS%20on%20Linux%20Module%20Parameters.html#l2arc-noprefetch

I'm also wondering what other performance tuning I should do in order to get the most out of L2ARC with modern hardware and hope you guys can give me some pointers.

@behlendorf behlendorf added Type: Documentation Indicates a requested change to the documentation Type: Question Issue for discussion labels Jun 16, 2020
@shodanshok
Copy link
Contributor

When issuing sequential reads, much data is prefetched - loaded before the actual demand load. These prefetched buffers are stored in the MRU list and tagged with a "prefetched" flag. If these buffers are referenced by a demand load, the "prefetched" flag is cleared (and if multiple references happen, they are stored in the MFU list).

By default, any buffer flagged with the "prefetched" tag is not eligible for L2ARC, so that one-shot sequential reads does not pollute the L2ARC. Multiple sequential reads will gradually flag more buffer as eligible for L2ARC (clearing the "prefetched" flag), and the same will do random reads for these prefetched buffers.

The rationale is that HDDs are quite fast for sequential reads, and so it is better to use the available L2ARC for random reads. This is especially true for large pools (ie: 12+ vdevs), where the combined sequential read speed can easily be >1GB/s. However, if having a large and fast L2ARC as your Intel drive, you can set l2arc_noprefetch=0 to ignore the "prefetched" flag and to always cache any buffer which is to be evicted by ARC.

I agree that the docs are not very clear on this point.

@richardelling
Copy link
Contributor

Since I wrote the module parameters doc, I'm a bit biased, but always looking to improve.

In the modules parameters, the range says:
0=write prefetched but unused buffers to cache devices, 1=do not write prefetched but unused buffers to cache devices

There is not a relationship between "sequential" and "prefetch" access patterns. They are orthogonal ideas. Sequential most often refers to contiguous LBA regions on a HDD (there is no equivalent for SSD). Prefetch is done on an object's ordered blocks. It is rare that prefetches are also sequential.

With this knowledge, how would we write this document better?

@shodanshok
Copy link
Contributor

@richardelling "sequential", in my reply above, refers to contiguous reads as done by the user application - ie: the same a cp or dd does. I was not speaking about physically sequential LBAs. Is this view incorrect?

@richardelling
Copy link
Contributor

@shodanshok Yes, I think your interpretation of sequential as in the object (a file is an object in a dataset) is correct. This is where the ZFS prefetcher works: traversing the list of blocks in an object.

@recklessnl
Copy link
Author

recklessnl commented Jun 17, 2020

Thank you for the replies, and thank you for clearing up some of the confusion @shodanshok.
I think with 2.0 coming up and introducing persistent L2ARC, plus the SSD revolution still developing now with PCI 4.0, there will be more interest in L2ARC in the coming years so it's a good time to update some of the documentation.

I'm not as versed in this as you guys are so I can give the perspective from an intermediate user. I found the documentation surrounding this to be confusing enough to create an issue here and in my opinion I think it would help if we added an example for this in the docs too, to explain it more in layman's terms on what this setting is actually doing.

L2ARC was first designed in a time where very fast NVMe SSDs didn't exist. The default recommendation for L2ARC these days should be a fast NVMe SSD, preferably more striped like I'm planning to do with the Intel P4608. However, this default setting would prevent the SSD from properly delivering sequential reads of bigger files, that I do want cached on there. Some users might not want that but I feel most will want to with newer technology.

The main thing I've learned is that this setting should be disabled for large sized, NVMe based flash storage cache devices - because these will likely always outperform an even very high amount of striped rust disks. However, if you don't want it to do larger sequential reads, then this setting should be set to 1. I might not be fully understanding it right but I feel the documentation should at least give an example in layman's terms like I tried to do here. It would help with the understanding of how the L2ARC works in general too.

@richardelling
Copy link
Contributor

@recklessnl I think you missed an important point. If the data is touched, then it would not be tagged as "prefetch" and therefore would be eligible for L2ARC.

Also, there are many other considerations for L2ARC that are much more important than prefetching tunables. However, that is beyond scope for an issue.

Lastly, L2ARC exists because the cost of RAM >> cost of SSD. That cost difference is reduced over time, so you are always better off investing in RAM if the cost difference isn't large. Back in the time when L2ARC was first being developed, systems had 2-4GB of RAM and SSDs were 32GB. Today, many systems can easily hold 1.5TB of RAM. So for L2ARC to be cost effective, your working set size needs to be > 1.5TB, which is not common for cache-friendly workloads.

Finally, the ARC tunables are per-node, not per-pool, so there can be no decisions on tunables based on the configuration of a single pool.

@recklessnl
Copy link
Author

recklessnl commented Jun 18, 2020

Also, there are many other considerations for L2ARC that are much more important than prefetching tunables.

I'm very curious if you could give me some pointers in this, and regardless of this specific issue, it's good information to have. Would you share some of the more important tuneables? I would appreciate it (and I also asked about it in the OP, so it is somewhat relevant).

As far as the prefetching goes, thanks for clearing it up more, this makes more sense now.

@adamdmoss
Copy link
Contributor

l2arc_noprefetch=1 is absolutely disastrous for l2arc reads too, so I think there's another angle... or a terrible bug.

@richardelling
Copy link
Contributor

@adamdmoss l2arc_noprefetch has no real affect on non-prefetch reads. What behaviour do you see that indicates otherwise?

@adamdmoss
Copy link
Contributor

adamdmoss commented Jun 21, 2020

@adamdmoss l2arc_noprefetch has no real affect on non-prefetch reads. What behaviour do you see that indicates otherwise?

# echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch                    
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:09.59 elapsed (sys:2.74 user:0.07)
# echo 1 > /sys/module/zfs/parameters/l2arc_noprefetch
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:44.18 elapsed (sys:3.15 user:0.12)
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:44.01 elapsed (sys:3.14 user:0.08)
# echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:09.47 elapsed (sys:2.67 user:0.07)
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:09.66 elapsed (sys:2.69 user:0.08)
# echo 1 > /sys/module/zfs/parameters/l2arc_noprefetch             
# drop;precache /zfs4/250b/SteamLibrary/steamapps/common/BorderlandsPreSequel
0:43.13 elapsed (sys:3.14 user:0.07)

... and so on. (drop = echo 3 > /proc/sys/vm/drop_caches, precache = read data [in this case 8GB already warm in l2arc] at max rate)

It's trivially reproducible here.

@richardelling
Copy link
Contributor

That seems like a contrived experiment. What about real life where folks don't go around dropping caches?

@adamdmoss
Copy link
Contributor

The dropped cache is so the l2arc gets hit rather than the arc, that being the whole point of the test...?

@richardelling
Copy link
Contributor

The question at hand is whether prefetched but unused data should be sent to L2ARC. Obviously, during prefetch, there is only speculation that the data will be used, but no real confidence. There is some period of time between the data being prefetched and its eviction from ARC. That time is based on many variables, such as the size of the MRU and the churn rate. If the MRU size is small and the churn rate is high, then caching prefetched data makes sense. However, when you're in that mode, life isn't very pleasant and there are better cures for the problem (better than kicking the can down to L2ARC).

One method to observe how well the prefetcher is working is to monitor prefetch hit rate in arcstats and monitor the zfetchstats. If the prefetch hit rate is high and MRU size is low, then it is probably a good idea to enable prefetched data caching in L2ARC.

@shodanshok
Copy link
Contributor

@adamdmoss from what I know, l2arc_noprefetch=1 avoid reading prefetched buffer from L2ARC (even when they already are in the L2ARC). This is by design, and based on the very same idea written above: on large pool, sequential reads from L2ARC can be slower than reading from main pool. However, I agree that modern SSDs (especially NVMe ones) change the balance of things, and setting l2arc_noprefetch=0 can be useful in some cases. But you had to try by yourself, as no silver buller exists.

@richardelling
Copy link
Contributor

@shodanshok not quite. l2arc_noprefetch controls whether prefetched blocks are written to L2ARC. Reads are unaffected by l2arc_noprefetch

@shodanshok
Copy link
Contributor

shodanshok commented Jun 23, 2020

@richardelling it seems that prefetched reads are affected by l2arc_noprefetch, indeed. Give a look here:

zfs/module/zfs/arc.c

Lines 6051 to 6062 in ae7b167

/*
* Read from the L2ARC if the following are true:
* 1. The L2ARC vdev was previously cached.
* 2. This buffer still has L2ARC metadata.
* 3. This buffer isn't currently writing to the L2ARC.
* 4. The L2ARC entry wasn't evicted, which may
* also have invalidated the vdev.
* 5. This isn't prefetch and l2arc_noprefetch is set.
*/
if (HDR_HAS_L2HDR(hdr) &&
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {

As you can see, if l2arc_noprefetch=1, prefetched reads are not issues to L2ARC device.
@adamdmoss tests seems to confirm that (and they show the same results I obtained in the past analyzing l2arc_noprefetch)
Am I missing something? Thanks.

@recklessnl
Copy link
Author

recklessnl commented Jun 29, 2020

Good point @shodanshok and I think this reinforces the point of this issue that the documentation regarding these L2ARC parameters needs to be improved, both for experts like you as well as more intermediate users like me.

What I'd also like to confirm is that cache devices will automatically get striped when I run zpool add poolname cache ssd1 ssd2 as command.

@richardelling
Copy link
Contributor

l2arc_noprefetch arrived in the first L2ARC commit back in 2007. When the block is
finished reading from the pool, the buffer was tagged to not be stored in L2ARC. Later
when l2arc_write_buffers() was called, if the tag was there, the block was not written
to L2ARC. So it wasn't a matter of trying to read the block from the L2ARC, the block
would never be there to begin with. Since that time, much has changed, but the basic
logic remains.

Pro tip: evict_l2_ineligible in arcstats counts how much data is ineligible for L2ARC
(for various reasons) and evicted from ARC. So if your system shows little or no
evict_l2_ineligible then l2arc_noprefetch won't help.

You should see the data striped over all of the cache devices. Try watching
zpool iostat -v during your workload.

@shodanshok
Copy link
Contributor

l2arc_noprefetch arrived in the first L2ARC commit back in 2007. When the block is
finished reading from the pool, the buffer was tagged to not be stored in L2ARC. Later
when l2arc_write_buffers() was called, if the tag was there, the block was not written
to L2ARC. So it wasn't a matter of trying to read the block from the L2ARC, the block
would never be there to begin with. Since that time, much has changed, but the basic
logic remains.

While generally true, if l2arc_noprefetch was 0 (and so prefetched buffers where written to L2ARC) and later is changed to 1, the prefetched buffers already in L2ARC will be ignored and re-loaded from disks when needed (at least this is my understanding).

Pro tip: evict_l2_ineligible in arcstats counts how much data is ineligible for L2ARC
(for various reasons) and evicted from ARC. So if your system shows little or no
evict_l2_ineligible then l2arc_noprefetch won't help.

I did not know about evict_l2_ineligible, thank you for sharing!

@bghira
Copy link

bghira commented Jul 1, 2020

I worked with Veeam a couple years ago to improve performance of synthetic merges on ZFS using object storage backed vdev and it was required to set l2arc_noprefetch=0 as well as some other tweaks to ensure the L2ARC fills up and stays filled - otherwise, synthetic merge operations would fail as they would take too long.

@richardelling
Copy link
Contributor

@misterbigstuff interesting perspective. Does that mean Veeam also requires L2ARC?

@bghira
Copy link

bghira commented Jul 1, 2020

not on sufficiently fast storage, which balloons the cost quite a lot.

@gamanakis
Copy link
Contributor

I was looking in the code in arc.c because of #10710.
l2arc_noprefetch has definitely to do with writing buffers in L2ARC, as @richardelling suggested. In arc_read_done() if the buffer read from ARC is a prefetch and l2arc_noprefetch=1, then the flag marking it as eligible for L2ARC is cleared.

Meaning prefetched buffers are not written to L2ARC if l2arc_noprefetch=1 and they have been already read from ARC (otherwise arc_read_done() will not be called and the L2ARC flag will not be cleared).

@gamanakis
Copy link
Contributor

gamanakis commented Aug 14, 2020

I wonder though if this behavior (if I have it right) does the L2ARC injustice. Say we have a prefetched buffer, it is read from ARC, its L2ARC eligibility flag is cleared.

However, shouldn't this buffer be cached in L2ARC? If it was read from ARC, then it is no longer a prefetch.

Edit: In the whole module/zfs/arc.c this is the only place the L2ARC eligibility flag is cleared.

@adamdmoss
Copy link
Contributor

Also in module/zfs/arc.c the ARC_FLAG_PREFETCH flag is cleared by add_reference() (supposedly when a prefetch is actually used), which implies that

if (l2arc_noprefetch && HDR_PREFETCH(hdr))
		arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);

... won't then actually clear the ARC_FLAG_L2CACHE flag.

@gamanakis
Copy link
Contributor

Yes, exactly. The same thing happens in arc_access().

@adamdmoss
Copy link
Contributor

adamdmoss commented Aug 14, 2020

arc_read() has a mismatch between comments and behavior though, from a quick glance.

* Read from the L2ARC if the following are true:
...
* 5. This isn't prefetch and l2arc_noprefetch is set.

...

			if (HDR_HAS_L2HDR(hdr) &&
			    !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
			    !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
  • note !(l2arc_noprefetch && HDR_PREFETCH(hdr)) actually means 'this isn't a prefetch OR l2arc_noprefetch is NOT set' rather than the comment's 'This isn't a prefetch and l2arc_prefetch is set'.

(edit: I think I prefer the interpretation as it exists in code rather than as it exists in the comment.)

@adamdmoss
Copy link
Contributor

adamdmoss commented Aug 14, 2020

To be explicitly clear, this is the version of the code which would match the comment:

			if (HDR_HAS_L2HDR(hdr) &&
			    !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
			    (l2arc_noprefetch && !HDR_PREFETCH(hdr))) {

(I'm not saying this is better or tested, just that it's what matches the comment. 😄 - I think the comment is wrong and the code is right, but I'm not 100% sure of the real intent.)

@gamanakis
Copy link
Contributor

@adamdmoss you are correct, the code in arc.c means read from L2ARC if this isn't a prefetch or if l2arc_noprefetch is not set.

I also think that the code is the intended behavior, not what the comment says.

@adamdmoss
Copy link
Contributor

@adamdmoss you are correct, the code in arc.c means read from L2ARC if this isn't a prefetch or if l2arc_noprefetch is not set.

I also think that the code is the intended behavior, not what the comment says.

I gave the comment-matching code a quick spin and it was completely missing l2arc everywhere for noprefetch=0, as might be guessed. But it was fun to verify anyway. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Indicates a requested change to the documentation Type: Question Issue for discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants