Sarama update #906

replay · 2018-05-04T12:27:46Z

Updates the Sarama library from 1.10.1 to 1.16.0.

Dieterbe · 2018-05-07T13:27:49Z

Have you done any testing of this ? What kind ? With what kafka version ?

replay · 2018-05-07T18:03:16Z

yeah, tested with kafka 0.11.0.2 and with kafka 0.10.0.1 and it seems to work fine with both of them.
I ran it on my workstation in the docker env and fed it with fakemetrics data.
I also tested replaying the same data from kafka 0.10.0.1 with both branches and the performance seems pretty similar, although the new version seems slightly slower, but the difference is quite small:

master: https://snapshot.raintank.io/dashboard/snapshot/gXmIVHiTTDQSsJVXWddyboijdl2jdufg
sarama_update: https://snapshot.raintank.io/dashboard/snapshot/oUUFr6AThkdamF3d9fGF7PjI1u2aOBKu

Dieterbe · 2018-05-08T09:12:03Z

can we strip out the 13MiB of testdata from github.com/pierrec/lz4 (rewrite the commit that introduced it)

Dieterbe · 2018-05-08T09:21:58Z

scripts/build.sh

@@ -1,4 +1,8 @@
 #!/bin/bash
+
+set -ex


i don't think scripts should print out every single thing they do. that would only be useful during development of the script.

Dieterbe · 2018-05-08T09:24:10Z

docker/docker-dev-custom-cfg-kafka/metrictank.ini

@@ -285,7 +285,7 @@ partition-scheme = bySeries
 # offset to start consuming from. Can be one of newest, oldest,last or a time duration
 # When using a duration but the offset request fails (e.g. Kafka doesn't have data so far back), metrictank falls back to `oldest`.
 # Should match your kafka-mdm-in setting
-offset = last
+offset = oldest


this results in a minute long of

metrictank_1 | 2018/05/08 09:17:42 [W] stats dialing localhost:2003 failed: dial tcp 127.0.0.1:2003: connect: connection refused. will retry metrictank_1 | 2018/05/08 09:17:43 [W] stats dialing localhost:2003 failed: dial tcp 127.0.0.1:2003: connect: connection refused. will retry statsdaemon_1 | 2018/05/08 09:17:43 WARN: dialing metrictank:2003 failed: dial tcp 172.19.0.11:2003: getsockopt: connection refused. will retry

followed by.

metrictank_1 | 2018/05/08 09:18:43 [W] kafka-cluster: Processing metricPersist backlog has taken too long, giving up lock after 1m0s.

which is strange, though for doing the benchmarks with consuming backfilled data this is of course a useful change, but not necessarily something we should commit, or at least not with commit message 'update sarama'

Dieterbe · 2018-05-08T11:53:00Z

so in your run it was about 880kHz to 830kHz (6%)
I can confirm similar results. for me it went from 600kHz to 540kHz (10%), with also a minor increase in memory used (90MiB to 200MiB).

Dieterbe · 2018-05-09T07:10:23Z

note: when i change DefaultHandler.ProcessMetricPoint and comment out the adding to the index, and the adding of the data to the tank, ingestion improves from 540kHz to 1MHz, so there are things we can do to counteract the slower sarama. but not sure yet what. when I change AggMetrics.GetOrCreate to leverage the RWLock, ingestion does not improve.

furthermore:

disabling the pressure metrics does not improve throughput when the above logic is commented out (at that point sarama is the bottleneck)
however, with the idx/tank logic re-instated, we can see that disabling the pressure metrics gives us about 5-6% performance increase

Dieterbe · 2018-05-09T18:19:21Z

backfilling metricdata on my system : 350kHz
everything same, but now MetricPoint data: 530kHz
so while the kafka perf regression is unfortunate, at least the format upgrade seems to make up for it

seems to have a negligible / non-existant benefit. -> looks like sarama is the bottleneck, not this

we don't really seem to need them. this gives us about 5% ingest throughput improvement perhaps we can re-instate them when we do batched operations

Dieterbe · 2018-05-15T14:30:30Z

@replay please approve of my changes.

replay · 2018-05-15T14:43:55Z

mdata/aggmetrics.go

+	schema := Schemas.Get(schemaId)
+	m = NewAggMetric(ms.store, ms.cachePusher, k, schema.Retentions, schema.ReorderWindow, &agg, ms.dropFirstChunk)
+	ms.Metrics[key] = m
+	metricsActive.Set(len(ms.Metrics))
 	ms.Unlock()


couldn't the .Unlock() happen before metricsActive.Set()? IIRC that's thread safe, just len() probably needs to be in the locked block

replay · 2018-05-15T14:45:16Z

mdata/aggmetrics.go

 	}
+	agg := Aggregations.Get(aggId)
+	schema := Schemas.Get(schemaId)


instantiation of schema.AMKey, Aggregations.Get() and Schemas.Get() could all happen a little further up, after we know that we need to create an entry but before we acquire the write lock, to keep the write lock short

change default compression to snappy

6711c4b

Dieterbe added this to the 0.9.0 milestone May 4, 2018

replay requested a review from Dieterbe May 7, 2018 13:17

Dieterbe reviewed May 8, 2018

View reviewed changes

Dieterbe mentioned this pull request May 8, 2018

regression in consumer throughput when upgrading 1.10 to 1.16 IBM/sarama#1101

Closed

replay and others added 8 commits May 15, 2018 16:19

update sarama

2d7fa77

add lz4 dependency

e2e0b6d

add go-metrics dependency

e1e5d42

remove unnecessary stanzas from Gopkg.toml

625f2e9

update Gopkg.lock

25d09a3

AggMetrics.GetOrCreate leverage RWLock

973585d

seems to have a negligible / non-existant benefit. -> looks like sarama is the bottleneck, not this

remove pressure metrics.

0a08182

we don't really seem to need them. this gives us about 5% ingest throughput improvement perhaps we can re-instate them when we do batched operations

alerting on decode_err is a good thing

e5f9f11

Dieterbe force-pushed the sarama_update branch from cd3c89f to e5f9f11 Compare May 15, 2018 14:21

Dieterbe mentioned this pull request May 15, 2018

upgrade to latest sarama, confirm stability with our current kafka and newer versions #900

Closed

replay commented May 15, 2018

View reviewed changes

address comments re critical section

81fa7f5

Dieterbe approved these changes May 15, 2018

View reviewed changes

Dieterbe merged commit 85a628a into master May 15, 2018

This was referenced May 15, 2018

Kafka 0.11.0.0 incompatibility #684

Closed

Use Confluent and move the kafka-consumers into one consumer struct #879

Closed

woodsaj deleted the sarama_update branch May 16, 2018 05:32

Dieterbe mentioned this pull request Sep 19, 2018

support kafka 2.0.0 #1053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sarama update #906

Sarama update #906

replay commented May 4, 2018 •

edited

Loading

Dieterbe commented May 7, 2018

replay commented May 7, 2018 •

edited

Loading

Dieterbe commented May 8, 2018

Dieterbe May 8, 2018

Dieterbe May 8, 2018

Dieterbe May 8, 2018

Dieterbe commented May 8, 2018 •

edited

Loading

Dieterbe commented May 9, 2018 •

edited

Loading

Dieterbe commented May 9, 2018

Dieterbe commented May 15, 2018

replay May 15, 2018 •

edited

Loading

replay May 15, 2018 •

edited

Loading

Sarama update #906

Sarama update #906

Conversation

replay commented May 4, 2018 • edited Loading

Dieterbe commented May 7, 2018

replay commented May 7, 2018 • edited Loading

Dieterbe commented May 8, 2018

Dieterbe May 8, 2018

Choose a reason for hiding this comment

Dieterbe May 8, 2018

Choose a reason for hiding this comment

Dieterbe May 8, 2018

Choose a reason for hiding this comment

Dieterbe commented May 8, 2018 • edited Loading

Dieterbe commented May 9, 2018 • edited Loading

Dieterbe commented May 9, 2018

Dieterbe commented May 15, 2018

replay May 15, 2018 • edited Loading

Choose a reason for hiding this comment

replay May 15, 2018 • edited Loading

Choose a reason for hiding this comment

replay commented May 4, 2018 •

edited

Loading

replay commented May 7, 2018 •

edited

Loading

Dieterbe commented May 8, 2018 •

edited

Loading

Dieterbe commented May 9, 2018 •

edited

Loading

replay May 15, 2018 •

edited

Loading

replay May 15, 2018 •

edited

Loading