Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

kafka offset #236

Closed
Dieterbe opened this issue Jul 10, 2016 · 8 comments
Closed

kafka offset #236

Dieterbe opened this issue Jul 10, 2016 · 8 comments

Comments

@Dieterbe
Copy link
Contributor

@woodsaj any ideas on how to tell MT which kafka offset to use at startup? in general I think we'd want to to load a couple hours of data at start up, and in disaster scenarios maybe more (eg to rewrite faulty data in cassandra)

@woodsaj
Copy link
Member

woodsaj commented Jul 11, 2016

To get the offset you want use https://godoc.org/github.com/Shopify/sarama#Broker.GetAvailableOffsets

@Dieterbe
Copy link
Contributor Author

i was too vague, i meant how should we control or choose the value ,like how do we decide on a value? should it be manually specified in the config? should MT try to figure it out by itself? are those int32's in the response map unix timestamps?

@woodsaj
Copy link
Member

woodsaj commented Jul 11, 2016

i think that the "replay" period should be driven by config options. But rather then a new config option i think it should just be based on the ChunkSize.

The response map

type OffsetResponse struct {
    Blocks map[string]map[int32]*OffsetResponseBlock
}

is map[<topic>]map[<partition>]*OffsetResponseBlock

OffsetResponseBlock.Offsets is contains the offset that you want to set in config.Consumer.Offsets.Initial

OffsetResponseBlock.Offsets is an array because the OffsetRequest is a request for all offsets before the specified timestamp, but you can use the maxOffsets to limit the response to 1 offset.

@woodsaj
Copy link
Member

woodsaj commented Jul 11, 2016

so after looking at the code, setting the offset is not really possible when using bsm/sarama-cluster

The initial offset will be used the very first time metricTank starts up, but on all future startups, sarama-cluster will start from the last committed offset.

This is not really surprising given our use case is a non-standard.

The solution to this is not complex, but requires forking sarama-cluster and modifying https://github.com/bsm/sarama-cluster/blob/master/consumer.go#L529-L572 to suit our needs.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 16, 2016

maybe we shouldn't use sarama-cluster to begin with.
I think we want to explicitly configure to consume certain partitions (and don't auto-balance) otherwise data for given metrics might move automatically to other MT instances which we don't want.
And we want more control over which offsets to start from.

What value does sarama-cluster provide that we can't do with sarama itself?
I actually still haven't been able to properly figure this out, which is why i opened bsm/sarama-cluster#54 see also bsm/sarama-cluster#41

@woodsaj
Copy link
Member

woodsaj commented Jul 17, 2016

from the sarama GoDoc https://godoc.org/github.com/Shopify/sarama

To consume messages, use the Consumer. Note that Sarama's Consumer implementation does not currently support automatic consumer-group rebalancing and offset tracking. For Zookeeper-based tracking (Kafka 0.8.2 and earlier), the https://github.com/wvanbergen/kafka library builds on Sarama to add this support. For Kafka-based tracking (Kafka 0.9 and later), the https://github.com/bsm/sarama-cluster library builds on Sarama to add this support.

As we dont intend to use consumer-group re-balancing (for now) or offset tracking, then we dont need sarama-cluster

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 17, 2016

ok as it seems like a lot of the more in-depth kafka functionality is hidden when you use sarama-cluster, will try with sarama directly. also there's https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index which will in the future make it easier to find exactly the right offset we need to consume from in situations like "I want to start consuming as of 180 minutes ago (as is the case in our prod setup).". It does also mention we can sortof do this now, but at log segment granularity, which has some caveats.

I think for the near term it's best just stick to the default of NewestOffset, and keep our old approach where a node needs to warm up in real time. This will let us roll out kafka much quicker, and we can do the "speed up warming up a fresh instance by seeking back to older offset" afterwards.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants