Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Metricpoint batch format #1002

Closed
Dieterbe opened this issue Aug 19, 2018 · 8 comments · Fixed by #2011
Closed

Metricpoint batch format #1002

Dieterbe opened this issue Aug 19, 2018 · 8 comments · Fixed by #2011

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 19, 2018

#876 introduced the MetricPoint format which helped us reduce resources, but not to the extent we liked.
the hypothesis is that kafka message overhead has become significant, and now we should batch up MetricPoint (or MetricPoint-like) messages within single kafka messages.
Likely this will help us reduce kafka diskspace/network io (and perhaps some kafka cpu), but likely not ingest speed. The above link has some experiments and numbers that lead us to believe this.

@woodsaj
Copy link
Member

woodsaj commented Sep 6, 2018

I have created #1032 with some benchmarks to compare the performance of 1 metricPoint per kafkaMessage to 10,50 and 100 metricPoints per message.

anthony:~/go/src/github.com/grafana/metrictank/input/kafkamdm$ go test -v -run none -bench Snappy  -benchtime 2s -cpu=8 -benchmem
goos: linux
goarch: amd64
pkg: github.com/grafana/metrictank/input/kafkamdm
Benchmark1PointPerMessageSnappy-8        2000000              1327 ns/op             838 B/op          6 allocs/op
Benchmark10PointPerMessageSnappy-8      20000000               267 ns/op             244 B/op          0 allocs/op
Benchmark50PointPerMessageSnappy-8      20000000               181 ns/op             196 B/op          0 allocs/op
Benchmark100PointPerMessageSnappy-8     20000000               165 ns/op             190 B/op          0 allocs/op
PASS
ok      github.com/grafana/metrictank/input/kafkamdm    71.806s

The improvement is significant.
batching the messages also yields about a 35% saving in diskspace on kafka (when using Snappy compression). This would also translate to a 35% reduction in network traffic on kafka.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Sep 6, 2018

hmm only 35%, i would have expected more. as a reminder, from a typical MD message (203B for a fakemetrics message) down to a MetricPoint message (28B) is is a reduction of 86% (leaving only 14%)
with MetricPoint v1 we saw disk usage reduced by about half, now we shave another 35% off, so that means disk usage is reduced -across both optimizations- down to about 32% of what it used to be (65% of 50%), not 14%, although my "about half" was a bit rough. but still, looks like we have kafka overhead.

not sure how relevant https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines and https://www.cloudera.com/documentation/kafka/1-2-x/topics/kafka_performance.html still are, but seems best overall throughput would be 1kB upto 1MB (?)
since MetricPoint is 28~32 bytes i would be interested in seeing tests up to 50k messages.

i imagine a producer like fakemetrics or carbon-relay-ng (when writing to kafka) would create a batch until either 50k (or whatever we find as optimal) has been reached or 1 second has passed.
a producer like tsdb-gw would do the same but is also capped based on the number of metrics it receives in its ingest path (as each ingest request needs to be individually acked)

note that we can save upto 14.3% (4B out of 28B) more if we can batch messages together based on their timestamp. so while the producers are building their batches, they could also create a separate batch per timestamp seen. though then we have to worry about the - unlikely - scenario of a sender using different timestamps, which could create too many small batches. so for now let's not worry about this additional optimization.

@woodsaj
Copy link
Member

woodsaj commented Sep 6, 2018

we are never going to be able to put thousands of points in a single message in production environments. In production environments we have
a) 2, 4 or 16 tsdb-gw's
b) 8, 32 or 128 Kafka partitions.

We definitely dont want to buffer messages for more than 500ms as we cant send a 200 back to crng until the points have been committed to kafka.

So that means the most we can batch is the (rate /2)/(tsdb-gws * partitions)
eg, for a medium instance with 100K metrics/second
the most we could put in a kafka message is (100,000/2)/(4 * 32) = 390

There is no change in disk saving between 100points per message and 500points per message.
So, it is best to just target between 100 and 500 points per message if possible.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Sep 10, 2018

This is why i specifically mentioned producers like carbon-relay-ng (for those who want to write carbon-relay-ng directly to kafka. pretty sure we have customers with on-prem production environments wanting to do this, or already doing this) and fakemetrics for benchmarking.

Furthermore your math is based on average throughput and equal distributions.
Consider that carbon-relay-ng uses a flushMaxNum of 5000k for grafanaNet (tsdbgw) and 10k for kafka-mdm, that these values can be tweaked at will and traffic can be (and often is) bursty.
And since we shard via hashing we can expect slightly uneven distributions in general, and very unbalanced discributions in the worst case (rare, but not non-existent).

So it is possible to have thousands of metrics to produce into a given partition at once.
Your suggestion of a 100 point cap - which means about 3kB per message currently - is fine, though knowing that a burst of metrics may cause multiple batches to a given partition at once (e.g. within one tsdb-gw ingest request). I suspect with a higher cap (e.g. in the thousands) we could squeeze out some more savings (in terms of ingestion latency, or possibly a few % of kafka cpu usage) but this is higher hanging fruit, maybe not worth spending time on now.

PS: the carbon-relay-ng<->tsdbgw interaction has always been (or used to be ) very latency sensitive.
I'm not sure if this is entirely solved by grafana/carbon-relay-ng#272 (probably it is), but a 500ms cap sounds better than 1s for sure, even just for optics.

So that means the most we can batch is the (rate /2)/(tsdb-gws * partitions)

I would advise against deriving a batch size based on a preconceived rate combined with the max-flush-wait condition (rate/2 because of the 500ms clause).
As we will use a flush trigger based on size, and one on time, we can let them each be defined and triggered independently, as their function is complimentary.
IOW the max wait should be based on the max wait we want, and the max batch size should be whatever the max batch is we're willing to support. each condition will be applied appropriately, as needed. No need to try to set one based on the other.

@woodsaj
Copy link
Member

woodsaj commented Sep 10, 2018

i dont think we should cap the number of points per message to 100. My comment "target between 100 and 500 points per message" should probably have been written as "optimize for 100 to 500 points per message"

The primary reason i commented was to ensure we dont waste time benchmarking use cases that would never happen in production. ie.

i would be interested in seeing tests up to 50k messages

@stale
Copy link

stale bot commented Apr 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 4, 2020
@stale stale bot closed this as completed Apr 11, 2020
@shanson7
Copy link
Collaborator

shanson7 commented Oct 7, 2021

I think there is still benefit to this. I think that a simple batching scheme of just concatenating MetricPoint msgs together should be sufficient.

@shanson7
Copy link
Collaborator

I'm working on this to see if it will reduce resource usage both on the producer side and consumer (metrictank) side. I have a PoC running that looks promising. Before I roll it out to our staging cluster it would be good to get some tentative agreement on the batch format. The PR lives at bloomberg#111 and the format is very basic.

Essentially, instead of a single format byte like the existing messages it takes 2 format bytes. The first byte should be FormatMetricPointArray and the second byte is one of the existing formats FormatMetricPoint or FormatMetricPointWithoutOrg. Following this byte is a simple concatenation of points formatted according to the specified format.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants