-
Notifications
You must be signed in to change notification settings - Fork 107
Conversation
Prior to this change we were only counting the number of bytes that the actual data occupied. However, there is a lot of overhead in our cache system that also needs to be tracked. We were seeing large differences in the total cache size used and and live heap allocations. These changes attempt to track most of the accounting overhead. The numbers used to track overhead are estimates based on data type sizes. * Added stats for Flat Acccounting, the LRU, and Chunks * Added benchmark for LRU allocations Resolves: #1086 See also: #963
mdata/cache/accnt/flat_accnt.go
Outdated
@@ -231,6 +254,11 @@ func (a *FlatAccnt) eventLoop() { | |||
} | |||
} | |||
|
|||
// // totalUsed returns the sum of cacheSizeUsed, cacheOverheadChunk, cacheOverheadFlat, and cacheOverheadLru | |||
// func (a *FlatAccnt) totalUsed() uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suspect that's not supposed to get merged, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I forgot that was in there. I had some plans for it, but then ended up not using it. I will take care of that.
mdata/cache/accnt/flat_accnt.go
Outdated
lenChunks := len(met.chunks) | ||
totalFlat := uint64((lenChunks * flatChunkSize) + flatAccntMetSize) | ||
totalChunk := uint64((lenChunks * ccacheMetChunkSize) + ccacheCacheMetSize) | ||
totalLru := uint64((lenChunks * lruItemSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double brackets are not necessary here
mdata/cache/accnt/flat_accnt.go
Outdated
@@ -251,12 +284,16 @@ func (a *FlatAccnt) delMet(metric schema.AMKey) { | |||
} | |||
|
|||
cacheSizeUsed.DecUint64(met.total) | |||
cacheOverheadFlat.DecUint64(totalFlat) | |||
cacheOverheadLru.DecUint64(totalLru) | |||
cacheOverheadChunk.DecUint64(totalChunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not important at all: it seems kind of mixed up to me that this function will first delete all the chunks of the metric, then update the related stats about those chunks, and then delete the metric from the metrics map. i think it should either first do all the deletes, and then update the stats, or the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it now, it does look a little weird. I will change it to update the stats and then delete everything.
mdata/cache/accnt/stats.go
Outdated
@@ -33,6 +33,9 @@ var ( | |||
|
|||
cacheSizeMax = stats.NewGauge64("cache.size.max") | |||
cacheSizeUsed = stats.NewGauge64("cache.size.used") | |||
cacheOverheadChunk = stats.NewGauge64("cache.overhead.chunk") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be helpful to have a comment describing what each of those new metrics measure.
mdata/cache/accnt/flat_accnt.go
Outdated
lruItemSize = 76 | ||
|
||
// k: 4 bytes + v: 8 bytes (map[uint32]uint64) | ||
flatChunkSize = 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this naming pattern inconsistent? I think if the name of the variable below is flatAccntMetSize
, then the name of flatChunkSize
should actually be flatAccntChunkSize
, then their prefix would match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be flatAccntMetChunkSize
which is long. All of those names are actually pretty long, open to suggestions for abbreviations.
What about something like famChunkSize
, famSize
, ccmSize
, and ccmChunkSize
?
I think lruItemSize
is already short enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you're suggestions look good, and they're consistent
mdata/cache/accnt/flat_accnt.go
Outdated
totalChunk += ccacheMetChunkSize | ||
// this func is called from the event loop so lru will be touched with new EvictTarget | ||
totalLru += lruItemSize | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary empty line
reorder stats updates add comments to stats variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Better let @Dieterbe take a look before merging, as he was involved in the discussions this was based on
you should run
see https://github.com/grafana/metrictank/blob/master/docs/CONTRIBUTING.md |
also, add the new metrics to the dashboard |
@@ -89,3 +91,28 @@ func TestLRUDelete(t *testing.T) { | |||
t.Fatalf("Expected lru to contain %d items, but have %d / %d", expectedSize, len(lru.items), lru.list.Len()) | |||
} | |||
} | |||
|
|||
func BenchmarkLRUGrowth(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment describing what's the goal of this would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tweaks needed as mentioned.
i'm not going to dig too deep into this. if it looks good to both robert and mauro, then it's good for me
update metrics.md
dashboards/main/metrictank.json
Outdated
{ | ||
"refId": "F", | ||
"target": "alias(sumSeries(metrictank.stats.$environment.$instance.cache.overhead.lru.gauge64), 'lru overhead')", | ||
"refCount": 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can combine these 3 into one query and use aliasByNode()
to set the right alias for all of them
Prior to this change we were only counting the number of bytes
that the actual data occupied. However, there is a lot of overhead
in our cache system that also needs to be tracked. We were seeing
large differences in the total cache size used and and live heap
allocations.
These changes attempt to track most of the accounting overhead.
The numbers used to track overhead are estimates based on
data type sizes.
Resolves: #1086
See also: #963