Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Sarama update #906

Merged
merged 10 commits into from
May 15, 2018
Merged

Sarama update #906

merged 10 commits into from
May 15, 2018

Conversation

replay
Copy link
Contributor

@replay replay commented May 4, 2018

Updates the Sarama library from 1.10.1 to 1.16.0.

@Dieterbe Dieterbe added this to the 0.9.0 milestone May 4, 2018
@replay replay requested a review from Dieterbe May 7, 2018 13:17
@Dieterbe
Copy link
Contributor

Dieterbe commented May 7, 2018

Have you done any testing of this ? What kind ? With what kafka version ?

@replay
Copy link
Contributor Author

replay commented May 7, 2018

yeah, tested with kafka 0.11.0.2 and with kafka 0.10.0.1 and it seems to work fine with both of them.
I ran it on my workstation in the docker env and fed it with fakemetrics data.
I also tested replaying the same data from kafka 0.10.0.1 with both branches and the performance seems pretty similar, although the new version seems slightly slower, but the difference is quite small:

master: https://snapshot.raintank.io/dashboard/snapshot/gXmIVHiTTDQSsJVXWddyboijdl2jdufg
sarama_update: https://snapshot.raintank.io/dashboard/snapshot/oUUFr6AThkdamF3d9fGF7PjI1u2aOBKu

@Dieterbe
Copy link
Contributor

Dieterbe commented May 8, 2018

can we strip out the 13MiB of testdata from github.com/pierrec/lz4 (rewrite the commit that introduced it)

scripts/build.sh Outdated
@@ -1,4 +1,8 @@
#!/bin/bash

set -ex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think scripts should print out every single thing they do. that would only be useful during development of the script.

@@ -285,7 +285,7 @@ partition-scheme = bySeries
# offset to start consuming from. Can be one of newest, oldest,last or a time duration
# When using a duration but the offset request fails (e.g. Kafka doesn't have data so far back), metrictank falls back to `oldest`.
# Should match your kafka-mdm-in setting
offset = last
offset = oldest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this results in a minute long of

metrictank_1    | 2018/05/08 09:17:42 [W] stats dialing localhost:2003 failed: dial tcp 127.0.0.1:2003: connect: connection refused. will retry
metrictank_1    | 2018/05/08 09:17:43 [W] stats dialing localhost:2003 failed: dial tcp 127.0.0.1:2003: connect: connection refused. will retry
statsdaemon_1   | 2018/05/08 09:17:43 WARN: dialing metrictank:2003 failed: dial tcp 172.19.0.11:2003: getsockopt: connection refused. will retry

followed by.

metrictank_1    | 2018/05/08 09:18:43 [W] kafka-cluster: Processing metricPersist backlog has taken too long, giving up lock after 1m0s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is strange, though for doing the benchmarks with consuming backfilled data this is of course a useful change, but not necessarily something we should commit, or at least not with commit message 'update sarama'

@Dieterbe
Copy link
Contributor

Dieterbe commented May 8, 2018

so in your run it was about 880kHz to 830kHz (6%)
I can confirm similar results. for me it went from 600kHz to 540kHz (10%), with also a minor increase in memory used (90MiB to 200MiB).

@Dieterbe
Copy link
Contributor

Dieterbe commented May 9, 2018

note: when i change DefaultHandler.ProcessMetricPoint and comment out the adding to the index, and the adding of the data to the tank, ingestion improves from 540kHz to 1MHz, so there are things we can do to counteract the slower sarama. but not sure yet what. when I change AggMetrics.GetOrCreate to leverage the RWLock, ingestion does not improve.

furthermore:

  • disabling the pressure metrics does not improve throughput when the above logic is commented out (at that point sarama is the bottleneck)
  • however, with the idx/tank logic re-instated, we can see that disabling the pressure metrics gives us about 5-6% performance increase

@Dieterbe
Copy link
Contributor

Dieterbe commented May 9, 2018

backfilling metricdata on my system : 350kHz
everything same, but now MetricPoint data: 530kHz
so while the kafka perf regression is unfortunate, at least the format upgrade seems to make up for it

replay and others added 8 commits May 15, 2018 16:19
seems to have a negligible / non-existant benefit.
-> looks like sarama is the bottleneck, not this
we don't really seem to need them.
this gives us about 5% ingest throughput improvement
perhaps we can re-instate them when we do batched operations
@Dieterbe
Copy link
Contributor

@replay please approve of my changes.

schema := Schemas.Get(schemaId)
m = NewAggMetric(ms.store, ms.cachePusher, k, schema.Retentions, schema.ReorderWindow, &agg, ms.dropFirstChunk)
ms.Metrics[key] = m
metricsActive.Set(len(ms.Metrics))
ms.Unlock()
Copy link
Contributor Author

@replay replay May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't the .Unlock() happen before metricsActive.Set()? IIRC that's thread safe, just len() probably needs to be in the locked block

}
agg := Aggregations.Get(aggId)
schema := Schemas.Get(schemaId)
Copy link
Contributor Author

@replay replay May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instantiation of schema.AMKey, Aggregations.Get() and Schemas.Get() could all happen a little further up, after we know that we need to create an entry but before we acquire the write lock, to keep the write lock short

@woodsaj woodsaj deleted the sarama_update branch May 16, 2018 05:32
@Dieterbe Dieterbe mentioned this pull request Sep 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants