fix how the kafka offsets get defined in the notifier #1350

replay · 2019-06-18T19:00:05Z

Today we saw an issue with an instance crashlooping due to an invalid partition offset. I wasn't able to reproduce the issue & verify this fix yet, but I think it would probably make sense to make this change anyway.

When the offset gets set to 0 as it previously was, doesn't that basically guarantee an error due to an invalid offset?

replay · 2019-06-18T23:59:17Z

mdata/notifierKafka/notifierKafka.go

-		}
-		if startOffset < 0 {
-			// happens when OffsetOldest or an offsetDuration was used and there is no message in the partition
-			startOffset = 0


I confirmed, I don't think we want to set this to 0 because it would just result in an error from sarama: https://github.com/Shopify/sarama/blob/master/consumer.go#L397

You're right. This change makes sense. However we need to also update consumePartition() a little. Variable lastReadOffset has an incorrect value with that patch when startOffset is negative. We need to have lastReadOffset set to a value compatible with what getLastAvailableOffset() returns when there is no message in the partition (which I think is -1).
This should be tested with an empty partition to make sure we don't hit the timeout unnecessarily.

Couple of other issues:

it's not offsetTime that should be set to sarama.OffsetOldest but startOffset.

offsetTime should be logged instead of offsetDuration

Finally, I think that if c.client.GetOffset fails we probably want to log.Fatalf anyway. If we don't do that now it will very likely happen a little later next time we do c.client.GetOffset.

it looks like I had not clicked the 'submit review' yesterday...

We will first try to use the offset supplied by the configuration. If that doesn't work we will try all other offset possibilities, in priority from custom, oldest, and then newest. If we still can't find a valid offset we will just crash.

mdata/notifierKafka/notifierKafka.go

woodsaj · 2019-06-19T11:56:52Z

mdata/notifierKafka/notifierKafka.go

-		processBacklog.Add(1)
-		go c.consumePartition(topic, partition, startOffset, processBacklog)
+
+		// in case we did not originally have a valid offset, we need to re-check here


This comment is a bit confusing. For clarity i think you should check and fail if the offset is still not valid. eg.

// if we still dont have a valid offset, we can't proceed. if !validOffset { log.Fatalf("kafka-cluster: tried all fallbacks, could not find a valid offset for topic: %s using %s\n", topic, offsetStr) } processBacklog.Add(1) go c.consumePartition(topic, partition, startOffset, processBacklog)

woodsaj · 2019-06-19T11:57:12Z

mdata/notifierKafka/notifierKafka.go

+		var validOffset bool
+		var offsetFromDuration int64
+
+		offsets := make([]offset, 0, 3)


I think this would be easier to follow by using a map[string]offset, keyed by the offsetName.
eg

offsets := map[string]offset{ "oldest": {offsetTime: sarama.OffsetOldest}, "newest": {offsetTime: sarama.OffsetNewest}, "custom": {offsetTime: offsetFromDuration}, }

To iterate over them in order you can use

for _, name := range []string{"oldest", "newest", "custom"} { tmpOffset, err := c.client.GetOffset(topic, partition, offsets[name].offsetTime) offsets[name].offsetStart = tmpOffset offsets[name].offsetError = err }

woodsaj · 2019-06-19T12:15:05Z

I think this code might need a large comment block explaining how GetOffsets work and the different scenarios, eg new partition with no messages, partition that has messages but is new, partition that is old but has no messages, etc...

Requesting offsetOldest or offsetTime will return the offset number of a message that exists in the log.
Requesting OffsetNewest will return the offset number that will be assigned to the next message kafka receives.

scenario	offsetOldest	offsetNewest	offsetTime
new empty partition	error	0	error
new with messages	0	validOffset	validOffset or error if offsetTime is earlier then first message
existing with messages	validOffset	validOffset	validOffset
existing with no messages	error	validOffset	error

I am pretty sure this table is accurate, but we should do some simple testing to verify.

Add block comment to describe scenarios Add more error messages

robert-milan · 2019-06-19T13:16:14Z

It's looking pretty good so far. We do need to verify the information in the various scenarios.

replay · 2019-06-19T14:44:28Z

Should we make this logic generic in some function, so it can be reused in the kafka input?

replay · 2019-06-19T14:54:25Z

mdata/notifierKafka/notifierKafka.go

+		}
+
+		// get all of the offsets
+		for _, name := range []string{"newest", "custom", "oldest"} {


I assume in most cases there won't be an error when querying the offset from Kafka. So is it necessary to always query it 3 times per partition, even when there's no error? Wouldn't it be more efficient to only query the other possible offsets when the initial one returned with an error?

I did think about that, but in this case I don't think it matters. In the grand scheme of things when we talk about consuming the partitions a few milliseconds or seconds won't make a difference AFAIK. I'm up for it though.

I tried to keep the logic as it is, while removing unnecessary calls to get offsets and duplicate code

replay · 2019-06-19T18:09:55Z

I pushed another commit. I tried to keep the logic unmodified while removing unnecessary calls to get the offset from kafka and duplicate code

fkaleo · 2019-06-20T08:01:53Z

Ok, so we are hitting timeouts again with that patch merged when there is no message in the partition:
2019-06-20 07:59:40.613 [WARNING] kafka-cluster: Processing metricPersist backlog has taken too long, giving up lock after 5m0s.
The first row of the table is probably incorrect (new empty partition) and probably no error is returned but instead a negative offset instead.

fkaleo · 2019-06-20T09:59:21Z

Possible fix: #1352

replay requested a review from fkaleo June 18, 2019 19:00

replay force-pushed the prevent_crashloop_on_empty_kafka_partition branch 2 times, most recently from b3479e1 to 8a3411b Compare June 18, 2019 19:03

fix how the kafka offsets get defined in the notifier

eb2da37

replay force-pushed the prevent_crashloop_on_empty_kafka_partition branch from 8a3411b to eb2da37 Compare June 18, 2019 19:06

replay commented Jun 18, 2019

View reviewed changes

Add stricter error checking

adfaae8

We will first try to use the offset supplied by the configuration. If that doesn't work we will try all other offset possibilities, in priority from custom, oldest, and then newest. If we still can't find a valid offset we will just crash.

woodsaj reviewed Jun 19, 2019

View reviewed changes

mdata/notifierKafka/notifierKafka.go Outdated Show resolved Hide resolved

woodsaj reviewed Jun 19, 2019

View reviewed changes

Replace slices with maps

c6b424b

Add block comment to describe scenarios Add more error messages

Fix comment formatting

ed8fe89

replay commented Jun 19, 2019

View reviewed changes

refactor of notifierKafka.start()

3fe7e4b

I tried to keep the logic as it is, while removing unnecessary calls to get offsets and duplicate code

robert-milan approved these changes Jun 20, 2019

View reviewed changes

robert-milan merged commit 8bcd6c8 into master Jun 20, 2019

robert-milan deleted the prevent_crashloop_on_empty_kafka_partition branch June 20, 2019 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix how the kafka offsets get defined in the notifier #1350

fix how the kafka offsets get defined in the notifier #1350

replay commented Jun 18, 2019

replay Jun 18, 2019

fkaleo Jun 19, 2019

fkaleo Jun 20, 2019

woodsaj Jun 19, 2019

woodsaj Jun 19, 2019

woodsaj commented Jun 19, 2019

robert-milan commented Jun 19, 2019

replay commented Jun 19, 2019 •

edited

Loading

replay Jun 19, 2019

robert-milan Jun 19, 2019

replay commented Jun 19, 2019

fkaleo commented Jun 20, 2019

fkaleo commented Jun 20, 2019

fix how the kafka offsets get defined in the notifier #1350

fix how the kafka offsets get defined in the notifier #1350

Conversation

replay commented Jun 18, 2019

replay Jun 18, 2019

Choose a reason for hiding this comment

fkaleo Jun 19, 2019

Choose a reason for hiding this comment

fkaleo Jun 20, 2019

Choose a reason for hiding this comment

woodsaj Jun 19, 2019

Choose a reason for hiding this comment

woodsaj Jun 19, 2019

Choose a reason for hiding this comment

woodsaj commented Jun 19, 2019

robert-milan commented Jun 19, 2019

replay commented Jun 19, 2019 • edited Loading

replay Jun 19, 2019

Choose a reason for hiding this comment

robert-milan Jun 19, 2019

Choose a reason for hiding this comment

replay commented Jun 19, 2019

fkaleo commented Jun 20, 2019

fkaleo commented Jun 20, 2019

replay commented Jun 19, 2019 •

edited

Loading