-
Notifications
You must be signed in to change notification settings - Fork 107
WIP: Automatic detection of metric interval #849
Conversation
5b8a3e6
to
c57b9b7
Compare
Interval int | ||
DataPoints []*schema.MetricData | ||
Last time.Time | ||
pos int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the values going into the map should be as small & pointerless as possible, for two reasons: memory usage and GC workload.
we don't need the full metricdata of each datapoint, we don't need a lock for each individual record (the logic is so concise that a lock on just the map is enough), pos is trivially implied (check whether the values in the array are 0), and time.Time is a fairly bloated structure (in particular, holds a pointer to a timezone)
thus, instead of a sync.Map holding these items, I suggest something like one of these two:
map[string][4]uint32
type record struct {
interval uint32
points [3]uint32
}
map[string]record
input/input.go
Outdated
|
||
// To protect against dropped metrics and out of order metrics, use the smallest delta as the interval. | ||
// Because we have 3 datapoints, it doesnt matter what order they are in. The smallest delta is always | ||
// going to be correct. (unless their are out of order and dropped metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more precise, the out-of-orderness must be constrained to the window of 3, for the smallest delta to be correct.
consider this sequence of timestamps:
40, 60, 80, 30, 50, 70
any window of 3 will result in an outcome of 20s interval, which is not correct.
note that it's common to have out-of-orderness beyond 3 points. (default reorderbuffer is 20 which I think is not always enough)
note also that currently the reorderbuffer requires the interval to be known for it to properly bucketize values.
i want to think a bit more about how we can make the reorderbuffer work better together with interval detection.
especially because interval detection can work better if it's done after data is reordered.
(maybe a separate ROB implementation for unknown-interval data?)
this is an interesting problem.
correct me if i'm wrong, but your heuristic allows for a resolution change once every 24h and will assume out of 3 points, there's 2 of them that are subsequent and exhibit the "real" interval (barring some timing artifacts causing some error as you point out) it seems the level of correctness is a tunable simply based on how many points we can track and inspect. |
to a degree. Metrics dont become queryable until we determine the interval as the interval is used when generating the MetricId. So we can add the metricDef to the index until we know the correct interval. However once we have received the first 3 points, we can then use the ROB to periodically determine the inteval. And with a larger ROB the accuracy of the interval detection improves. |
65c69af
to
79d620d
Compare
@Dieterbe this is ready for first review. it still has local modifications in vendor/gopkg.in/raintank/schema.v1, But that will be addressed soon You can use the docker/docker-cloud stack to test. You will need to build a docker image for tsdb-gw using the optimizeKafkaMsg branch of tsdb-gw first. |
- move aggmetrics/aggmetric/reoderbuffer into their own package called memorystore - move notifier into its own package - update raintank/schema to support 2 types of metric. the existing MetricData and a new MetricPoint. Both implement a new DataPoint interface - update the input plugins and index to handle processing schema.DataPoints instead of schema.MetricData - Use global variables in the mdata package instead of local fields being used all over the place. These services and variables dont change after startup, so there is need to pass them around - BackendStore: Store - MemoryStore: AggMetrics - Cache: cache.CachePusher - Idx: idx.MetricIndex - DropFirstChunk: bool - remove the intervalGetter logic from carbon input plugin. Interval will now be autodetected. - create metricBuffer wich implements mdata.Metric. DataPoints that are received that dont contain enough information to update the index (either it is a MetricPoint message or the Interval is not known) are placed first placed into a metricBuffer. Once the missing information is received, the points buffered are moved into an AggMetric - add a docker-cloud stack that uses tsdb-gw.
79d620d
to
fece21d
Compare
what's with the |
MetricData.Validate() rejects metrics that have an Interval == 0. but not if it is less then 0. Though we can obviously change that. |
Add benchmarks to compare decoding performance of MetricData vs MetricPoint messages go test -run none -bench . -benchmem -benchtime 5s goos: linux goarch: amd64 pkg: github.com/grafana/metrictank/input/kafkamdm BenchmarkDecodeMetricData-4 10000000 773 ns/op 352 B/op 10 allocs/op BenchmarkDecodeMetricPoint-4 30000000 246 ns/op 80 B/op 2 allocs/op PASS ok github.com/grafana/metrictank/input/kafkamdm 16.335s
* split different serializers into different files for easy file diffing * follow benchmark best practice of having each iteration represent 1 metric to be serialized and rely on automated conversion instead of hardcoded amount of 3000 * show size on per-metric basis, which is more useful * simplify error handling
I added some commits, cleaning up the old benchmarks to be a better reference, adding the schema as discussed in comment #199 (comment) and onwards (minus the byte payload header which is a message-specific implementation detail, and the effect on it is the same irrespective of what new serialization format we'll use anyway), and added an experimental simplistic manual serializing/deserializing step. the format without orgid is codenamed MetricPointId1, the one with orgid is MetricPointId2 here's how it compares:
conclusion: => data payload is much smaller, because messagepack adds metadata and attribute names (it writes strings like "Id", "Time", etc preceding the actual fields). mine doesn't do that and is exactly the 28 bytes we were aiming for. it's almost too good to be true. maybe there's a bug somewhere but I added unit tests that validate the encoding works, and they succeed. so i'm pretty happy with this experiment. let's avoid the overhead of interfaces and function calls, we don't need them. do you plan to clean up this PR? otherwise I can create a new one with my stuff. |
4eb3c92
to
88692fa
Compare
all of these changes to raintank/schema need to be moved to raintank/schema#15 |
Without using interfaces, how do you plan on passing datapoints throughout metrictank ingestion path? Are you just going to create two variants of every function that the datapoints pass through. One that takes a the full metricData and one that takes the optimized data structure? |
yes, that's more or less what I was thinking of. will have to prototype it to see if it can be done in a way that's not too ugly. my main concern is really just allocations in the ingestion path, we've seen those get expensive fast and have effects on various other characteristics (like GC induced latency spikes). |
closing this as we decided not to pursue this for now. |
closes #741
If a metric is received with an Interval field set to 0 or less, then we
need to automatically detect the interval. This is achieved by buffering
the last 3 points in memory. We then sort the points by timestamp
and get the time difference between the first 2, and last 2. The lower
difference is used as the interval. Metrics are delayed from being processed
until the Interval is known, so when a new series is sent it will take 3x interval
before any data is queriable. The interval is re-calculated for the series every 24hours
Future effort needs to be done to adjust the interval to known valid intervals. eg 1s, 5s, 10s, 30s, 1m, 5m, etc....