-
Notifications
You must be signed in to change notification settings - Fork 107
kafka offset #236
Comments
To get the offset you want use https://godoc.org/github.com/Shopify/sarama#Broker.GetAvailableOffsets |
i was too vague, i meant how should we control or choose the value ,like how do we decide on a value? should it be manually specified in the config? should MT try to figure it out by itself? are those int32's in the response map unix timestamps? |
i think that the "replay" period should be driven by config options. But rather then a new config option i think it should just be based on the ChunkSize. The response map
is OffsetResponseBlock.Offsets is contains the offset that you want to set in config.Consumer.Offsets.Initial OffsetResponseBlock.Offsets is an array because the OffsetRequest is a request for all offsets before the specified timestamp, but you can use the maxOffsets to limit the response to 1 offset. |
so after looking at the code, setting the offset is not really possible when using bsm/sarama-cluster The initial offset will be used the very first time metricTank starts up, but on all future startups, sarama-cluster will start from the last committed offset. This is not really surprising given our use case is a non-standard. The solution to this is not complex, but requires forking sarama-cluster and modifying https://github.com/bsm/sarama-cluster/blob/master/consumer.go#L529-L572 to suit our needs. |
maybe we shouldn't use sarama-cluster to begin with. What value does sarama-cluster provide that we can't do with sarama itself? |
from the sarama GoDoc https://godoc.org/github.com/Shopify/sarama
As we dont intend to use consumer-group re-balancing (for now) or offset tracking, then we dont need sarama-cluster |
ok as it seems like a lot of the more in-depth kafka functionality is hidden when you use sarama-cluster, will try with sarama directly. also there's https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index which will in the future make it easier to find exactly the right offset we need to consume from in situations like "I want to start consuming as of 180 minutes ago (as is the case in our prod setup).". It does also mention we can sortof do this now, but at log segment granularity, which has some caveats. I think for the near term it's best just stick to the default of NewestOffset, and keep our old approach where a node needs to warm up in real time. This will let us roll out kafka much quicker, and we can do the "speed up warming up a fresh instance by seeking back to older offset" afterwards. |
@woodsaj any ideas on how to tell MT which kafka offset to use at startup? in general I think we'd want to to load a couple hours of data at start up, and in disaster scenarios maybe more (eg to rewrite faulty data in cassandra)
The text was updated successfully, but these errors were encountered: