-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] add lru cache age metric #536
Conversation
a7bcf9b
to
0c4d9ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR. This sounds like something worth measuring, but I have a couple of suggestions...
cache/disk/disk.go
Outdated
@@ -156,8 +157,15 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (Cache, error) { | |||
|
|||
f := filepath.Join(dir, c.FileLocation(kind, value.legacy, hash, value.size, value.random)) | |||
|
|||
ts, err := atime.Stat(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try moving this into the removeFile
function? It runs in another goroutine without holding the LRU index lock, and has a semaphore to limit the number of OS threads we spawn when there are eviction spikes.
cache/disk/lru.go
Outdated
gaugeCacheLastEvictAtime: prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "bazel_remote_disk_cache_last_evicted_atime", | ||
Help: "The access time of the most recently evicted item from the LRU cache.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a better metric would be the amount of time since the last evicted item was last touched? ie "now - atime of evicted item" - that way a graph of the metric over time would be a flat line when things are fairly consistent, instead of a line with a constant positive gradient.
I assume that most bazel-remote deployments use the ‘relatime’ mount option for performance reasons. That is also the default for linux. In other words, the filesystem atime is often updated only once per day. See mount(8) man page. I think the metric ‘now – oldest atime’ is still useful and a good idea! I think it should be reported as number of seconds, because second is the base unit for Prometheus, but with a comment referring to ‘relatime’ and explaining that the accuracy is typically days. However I would like to avoid the extra overhead of a file system stat operation whenever a file is evicted, since file system accesses are one of the main bottlenecks in my deployments. And I think it is overkill considering the typical accuracy of number of days. What do you think about having a periodical task scheduled once per minute, that could stat the oldest entry in the lru and update a prometheus gauge? |
0c4d9ee
to
cb79ed1
Compare
Thanks @ulrfa and @mostynb, Those are great and helpful notes. Based on the concern of putting the atime in the critical path, I re-worked the guage to get the atime of the next-to-evict item on-demand during a metric scrape, thus always being up to date but not overly stat'ing files for every eviction. Since this metric is always kept up to date and never potentially stale, I was able to incorporate the |
What if instead of Also, how frequently are prometheus gauges sampled? |
ctime is certainly a valuable metric as fallback to atime. If atime is available, I think it is a much more valuable metric. Since the LRU should ideally be sorted exactly by atime, it gives a good representation of the age of the cache. Ctime would not be near as valuable as it might often show a much older cache time and be quite a bit jumpier. It could easily have large swings in value. I've be considering the primary use case of a cache that is at size and therefore actively evicting as new items come in. I do think this makes the most sense as the use case to optimize for. Would you like to perhaps add both ctime and atime metrics, or maybe configure it behind a flag? The Prometheus gauge func is only called during an active scrape, so it depends on the frequency of the scraper, which is outside of scope of the client providing the metric. The default scrape interval is 30s I believe. |
I think the mu mutex in disk.go must be locked while accessing the lru data structure. And not be released until after the stat system call finish, in order to prevent the file from being deleted between the lru data access and the stat call. Then we also need to avoid any risk for deadlock between our mu mutex and other potential locks in the prometheus library. Is the prometheus library making any explicit guarantee that it is not holding any internal locks while invoking our gauge callback function? Even if prometheus would claim such guarantee today, I would sleep better if we do not rely on it. I think the easiest way to avoid risk for deadlock is to have our own periodical task scheduled once per minute, instead of callback from prometheus. |
I also prefer atime instead of ctime, for that reason.
With relatime mount option, the atime is always updated at least when ctime is updated, and then I don't see a use case for ctime for this metric. I would not recommend deploying bazel-remote with noatime mount option, because bazel-remote use the atime when reloading lru after restart, but if someone is doing that and also are interested in this metric, then perhaps that could be a future improvement outside of this pull request? What do you think @mostynb? |
cb79ed1
to
477a5d6
Compare
Great, thanks for the direction. I updated the flow back to a standard metric that updates once a minute, and I ensured we lock the lru mu during that process. (Given this constraint, it made more sense to me to push the metric into disk itself instead of lru.) I think this metric will be very valuable in helping users discover the proper sizing for their cache. Please let me know if there is anything else you would like to see changed :-) |
477a5d6
to
ed49e8f
Compare
Thanks @joeljeske, implementation looks good to me, and I think this metric is useful. Let's see what @mostynb say. |
0531a74
to
37b3143
Compare
@mostynb should be ready for review now. This last push was to green up tests; I had missed the "empty cache" case causing an error upon initial startup but everything is green now. Thanks! |
Sorry it took me a while to get back to this PR.
If possible, I would prefer to settle on a single metric and avoid adding another flag.
If we have an accurate atime, then that would give a nice accurate metric. But my concern is that "now - atime" is useless if using noatime, and possibly too coarse (1 day) if using relatime with a high-churn cache. I can imagine that some long-running bazel-remote instances might be using the noatime mount option, because we currently only use the atime on startup, not during normal operation of the cache. If we used "now - ctime" we'd still have a meaningful and fairly accurate value when using either noatime or relatime mount options, at the cost of a more confusing alert condition on the metric: you'd need to watch for situations where the oldest cache item is frequently not very old (and how do you define "frequently"?). Alternatively, if we stored the last access time in the LRU index, it would always be accurate regardless of the filesystem mount options except immediately after cache startup, since it's initialised with atime. I have one cache with about 10 million items, if we stored int64 unix times in the cache that would add 76M to the index (ignoring potential struct-packing overhead). That's not an unreasonable amount of extra memory usage to add. I'm unsure if I I would prefer this over the other options. I will add some comments to the code, but I'd like to think through the options metioned above a bit longer before merging this (both of your opinions are welcome on this). |
cache/disk/lru.go
Outdated
@@ -280,3 +280,13 @@ func (c *SizedLRU) removeElement(e *list.Element) { | |||
func roundUp4k(n int64) int64 { | |||
return (n + BlockSize - 1) & -BlockSize | |||
} | |||
|
|||
// Get the back item of the LRU cache. | |||
func (c *SizedLRU) PeakOldestCacheItem() (Key, lruItem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used outside the disk package, so it doesn't need to be exported. How about calling it getTailItem
, since the least recently used item might not be the oldest item in the index (and we already used lruItem
for the element type so getLRUItem
might be confusing) ?
cache/disk/disk.go
Outdated
Name: "bazel_remote_disk_cache_age_seconds", | ||
Help: "The file `atime` of oldest item in the LRU cache. Depending on filemount opts (e.g. relatime), the resolution may be meausured in 'days' and not accurate to the second", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "age" is the right word to describe "now - atime" of the least recently used item. If we do end up using atime, what do you think of the term "longest item idle time"?
cache/disk/disk.go
Outdated
go c.pollCacheAge() | ||
} | ||
|
||
// Update metric every minute with the current age of the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "Update metric every minute with the idle time of the least recently used item in the cache" (assuming we stick with "now - atime")
cache/disk/disk.go
Outdated
} | ||
} | ||
|
||
// Get the back item in the cache, and store its atime in a metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "Get the idle time of the least-recently used item in the cache, and store the value in a metric" (assuming we stick with "now - atime")
cache/disk/disk.go
Outdated
ts, err := atime.Stat(f) | ||
|
||
if err != nil { | ||
log.Printf("ERROR: failed to determine cache age: %v, unable to stat %s", err, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case, we'll set the metric to 0.0 below, which might trigger alarms. I guess we should set a bool here, and only update the metric below if the bool is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fine with me to only update it on success. I was thinking that if the cache was consistently unable to stat the file then we would not want to keep broadcasting a potentially very old previous value. I was thinking that most would use an average metric over time (30m?) for something like this and occasional and rare 0
values would be naturally filtered out.
Thank you for the feedback and thoughtful responses. atime vs ctime Let me explain my hesitations of ctime In a bazel managed repo & cache, some actions and cache entries are rather static, whereas some are frequently changing. Consider rules_go, where the go compiler wrapper itself is compiled and cached as an action within bazel. This action result item should be frequently accessed within every build and thus never be evicted. In a right-sized cache, it will only be evicted when rules_go is upgraded or perhaps bazel itself is upgraded. At that time, it will be eventually be evicted as it is no longer accessed. When it becomes the next-to-evict item, this item will have a very old ctime, but a recently accessed atime. In a given bazel managed repo, there likely many actions that fall along the spectrum of "frequently invalided" to "never invalidated". Even consider a leaf package in this repo such as Due to this, I would think that if measuring ctime it would be very difficult to get an accurate view of age of the least-recently-accessed item. FS atime or user-land atime A user-land atime could be a fine solution, however, I was concerned with the overhead for folks that do not care about this metric. I don't have a great view of the average consumers' size of cache & number of items in cache. This could be costly so I'll defer that decision to you. I had seen some recent issues regarding mem size and I did not want to contribute to that :) With FS managed atime, I would first argue that relatime would not result in too-coarse of data. With Bazel, likely a right-sized cache would be on the order of 1 week of cached content. Dropping too much lower than that likely would result in active items being evicted prematurely. If you are only retaining 1 day of content, likely there is a sizable ratio of items that being re-computed to populated previously evicted cache entries. Additionally, when items are being populated and evicted in less than 1 day, the atime would reflect the ctime and still be considered a valuable datapoint. This is all variable depending on installation needs and subject to the age-gap between various source control branches, however, I would still reason that 1day of coarseness would be an acceptable trade-off. Lastly, since this is an additional metric and not necessarily a piece of core functionality, it could be documented that noatime is not compatible with this metric and that this metric reports |
I'll wait to update with review comments until a strategy forward is decided 👍 |
Yes, that would be confusing and hard to define "frequently". For me it feels hard to draw conclusions based on ctime.
I agree.
I guess the overhead (request latency and memory consumption) would be insignificant for most scenarios, even for FindMissingBlobs where a single request could affect lots of entries. But I don't know for sure and would prefer keeping it as lean as possible without additional information in the LRU index. It could also be confusing if this metric suddenly changes by just restarting bazel-remote. (Trying to store LRU index atime persistently would cause unnecessary complexity in my opinion).
Sounds reasonable to me.
For my use cases it is enough with 1 day accuracy. I now checked two of my cache instances, the oldest atime was about 4 days ago. |
This is a very good point, and valid enough for me to feel strongly against user-land atime What are your thoughts @mostynb? |
OK, let's go with "now - atime" and document the limitations. |
37b3143
to
904e79a
Compare
904e79a
to
1a5a3da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, this can land once the metric description has been fixed.
Thanks for persevering with this- merged with the documentation update + more detailed commit message. |
Thanks so much! I hope it's useful. What is your typical release cadence, as to use this metric? |
Development isn't active enough at the moment to have a regular release frequency, so it depends a bit on how important the updates are. I'll try to take a look at #539 over the weekend (didn't manage to get to it this week) and see it it's worth trying to land that before making the next release. |
Thanks @joeljeske and @mostynb! |
@mostynb do you think we could cut a new release soon? 😃 |
This PR introduces a new metric
bazel_remote_disk_cache_longest_item_idle_time_seconds
that is tracks the age of the next-to-evict item using the FS atime. This allows a cache monitor to see the relative lifespan of a cache which is useful in tuning the size of the cache.This only functions for owned disk cache items, and not items proxied to another cache or s3, but I think that is how the rest of these metrics are currently functioning.