-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
l2arc_noprefetch documentation is unclear - what does this do? #10464
Comments
When issuing sequential reads, much data is prefetched - loaded before the actual demand load. These prefetched buffers are stored in the MRU list and tagged with a "prefetched" flag. If these buffers are referenced by a demand load, the "prefetched" flag is cleared (and if multiple references happen, they are stored in the MFU list). By default, any buffer flagged with the "prefetched" tag is not eligible for L2ARC, so that one-shot sequential reads does not pollute the L2ARC. Multiple sequential reads will gradually flag more buffer as eligible for L2ARC (clearing the "prefetched" flag), and the same will do random reads for these prefetched buffers. The rationale is that HDDs are quite fast for sequential reads, and so it is better to use the available L2ARC for random reads. This is especially true for large pools (ie: 12+ vdevs), where the combined sequential read speed can easily be >1GB/s. However, if having a large and fast L2ARC as your Intel drive, you can set I agree that the docs are not very clear on this point. |
Since I wrote the module parameters doc, I'm a bit biased, but always looking to improve. In the modules parameters, the range says: There is not a relationship between "sequential" and "prefetch" access patterns. They are orthogonal ideas. Sequential most often refers to contiguous LBA regions on a HDD (there is no equivalent for SSD). Prefetch is done on an object's ordered blocks. It is rare that prefetches are also sequential. With this knowledge, how would we write this document better? |
@richardelling "sequential", in my reply above, refers to contiguous reads as done by the user application - ie: the same a |
@shodanshok Yes, I think your interpretation of sequential as in the object (a file is an object in a dataset) is correct. This is where the ZFS prefetcher works: traversing the list of blocks in an object. |
Thank you for the replies, and thank you for clearing up some of the confusion @shodanshok. I'm not as versed in this as you guys are so I can give the perspective from an intermediate user. I found the documentation surrounding this to be confusing enough to create an issue here and in my opinion I think it would help if we added an example for this in the docs too, to explain it more in layman's terms on what this setting is actually doing. L2ARC was first designed in a time where very fast NVMe SSDs didn't exist. The default recommendation for L2ARC these days should be a fast NVMe SSD, preferably more striped like I'm planning to do with the Intel P4608. However, this default setting would prevent the SSD from properly delivering sequential reads of bigger files, that I do want cached on there. Some users might not want that but I feel most will want to with newer technology. The main thing I've learned is that this setting should be disabled for large sized, NVMe based flash storage cache devices - because these will likely always outperform an even very high amount of striped rust disks. However, if you don't want it to do larger sequential reads, then this setting should be set to 1. I might not be fully understanding it right but I feel the documentation should at least give an example in layman's terms like I tried to do here. It would help with the understanding of how the L2ARC works in general too. |
@recklessnl I think you missed an important point. If the data is touched, then it would not be tagged as "prefetch" and therefore would be eligible for L2ARC. Also, there are many other considerations for L2ARC that are much more important than prefetching tunables. However, that is beyond scope for an issue. Lastly, L2ARC exists because the cost of RAM >> cost of SSD. That cost difference is reduced over time, so you are always better off investing in RAM if the cost difference isn't large. Back in the time when L2ARC was first being developed, systems had 2-4GB of RAM and SSDs were 32GB. Today, many systems can easily hold 1.5TB of RAM. So for L2ARC to be cost effective, your working set size needs to be > 1.5TB, which is not common for cache-friendly workloads. Finally, the ARC tunables are per-node, not per-pool, so there can be no decisions on tunables based on the configuration of a single pool. |
I'm very curious if you could give me some pointers in this, and regardless of this specific issue, it's good information to have. Would you share some of the more important tuneables? I would appreciate it (and I also asked about it in the OP, so it is somewhat relevant). As far as the prefetching goes, thanks for clearing it up more, this makes more sense now. |
|
@adamdmoss |
... and so on. ( It's trivially reproducible here. |
That seems like a contrived experiment. What about real life where folks don't go around dropping caches? |
The dropped cache is so the l2arc gets hit rather than the arc, that being the whole point of the test...? |
The question at hand is whether prefetched but unused data should be sent to L2ARC. Obviously, during prefetch, there is only speculation that the data will be used, but no real confidence. There is some period of time between the data being prefetched and its eviction from ARC. That time is based on many variables, such as the size of the MRU and the churn rate. If the MRU size is small and the churn rate is high, then caching prefetched data makes sense. However, when you're in that mode, life isn't very pleasant and there are better cures for the problem (better than kicking the can down to L2ARC). One method to observe how well the prefetcher is working is to monitor prefetch hit rate in arcstats and monitor the zfetchstats. If the prefetch hit rate is high and MRU size is low, then it is probably a good idea to enable prefetched data caching in L2ARC. |
@adamdmoss from what I know, |
@shodanshok not quite. |
@richardelling it seems that prefetched reads are affected by Lines 6051 to 6062 in ae7b167
As you can see, if l2arc_noprefetch=1 , prefetched reads are not issues to L2ARC device.@adamdmoss tests seems to confirm that (and they show the same results I obtained in the past analyzing l2arc_noprefetch )Am I missing something? Thanks. |
Good point @shodanshok and I think this reinforces the point of this issue that the documentation regarding these L2ARC parameters needs to be improved, both for experts like you as well as more intermediate users like me. What I'd also like to confirm is that cache devices will automatically get striped when I run |
Pro tip: You should see the data striped over all of the cache devices. Try watching |
While generally true, if
I did not know about |
I worked with Veeam a couple years ago to improve performance of synthetic merges on ZFS using object storage backed vdev and it was required to set |
@misterbigstuff interesting perspective. Does that mean Veeam also requires L2ARC? |
not on sufficiently fast storage, which balloons the cost quite a lot. |
I was looking in the code in arc.c because of #10710. Meaning prefetched buffers are not written to L2ARC if |
I wonder though if this behavior (if I have it right) does the L2ARC injustice. Say we have a prefetched buffer, it is read from ARC, its L2ARC eligibility flag is cleared. However, shouldn't this buffer be cached in L2ARC? If it was read from ARC, then it is no longer a prefetch. Edit: In the whole |
Also in
... won't then actually clear the |
Yes, exactly. The same thing happens in arc_access(). |
...
(edit: I think I prefer the interpretation as it exists in code rather than as it exists in the comment.) |
To be explicitly clear, this is the version of the code which would match the comment:
(I'm not saying this is better or tested, just that it's what matches the comment. 😄 - I think the comment is wrong and the code is right, but I'm not 100% sure of the real intent.) |
@adamdmoss you are correct, the code in I also think that the code is the intended behavior, not what the comment says. |
I gave the comment-matching code a quick spin and it was completely missing l2arc everywhere for noprefetch=0, as might be guessed. But it was fun to verify anyway. :) |
System information
Describe the problem you're observing
I'm trying to tune L2ARC for maximum performance but I'm having trouble understand how exactly it operates in ZFS. The device that I want to use as an L2ARC cache vdev is an Intel P4608 enterprise SSD which is an x8 PCIe3 SSD device. It features 2 seperate pools of 3.2TB, each with x4 lanes, and I want to stripe both of these together for the combined speed and IOPS (would use x8 PCIe lanes for this). You can view additional stats about this drive here to give you an idea of the performance it is capable of.
RAM on this system is 512GB in total. I want to stripe this device as a cache vdev, so total L2ARC would be ~6.4TB.
Random reads and smaller size reads will surely be much faster compared to the pool of disks, but I'm confused about sequential reads and the docs do not explain properly. I do want sequential cached reads to be pulled from these drives as well for now because I can't see the harddrive pool outperforming this SSD when striped. I'm planning to use
zpool add poolname cache ssd1 ssd2
as command, this will stripe the SSDs together instead of creating a JBOD pool, right?Additionally, I'm seeing information that you need to set the
l2arc_noprefetch
tunable to 0 in order to properly allow seqential reads but is this how it actually works? Does it not do sequential reads unless you set that to 0 (default is 1) or am I not understanding it correctly?https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZFS%20on%20Linux%20Module%20Parameters.html#l2arc-noprefetch
I'm also wondering what other performance tuning I should do in order to get the most out of L2ARC with modern hardware and hope you guys can give me some pointers.
The text was updated successfully, but these errors were encountered: