Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling upgrade from 0.10 to 0.11 causes unknown magic byte errors #1021

Closed
bgreenlee opened this issue Jan 11, 2018 · 3 comments
Closed

Rolling upgrade from 0.10 to 0.11 causes unknown magic byte errors #1021

bgreenlee opened this issue Jan 11, 2018 · 3 comments

Comments

@bgreenlee
Copy link
Contributor

bgreenlee commented Jan 11, 2018

Versions

Sarama Version: f0c3255
Kafka Version: 0.10.1.0 -> 0.11.0.1
Go Version: 1.9.2

Configuration

(happy to provide more information if this is necessary)

Logs

Relevant Sarama error included below.

Problem Description

We just upgraded one of our Kafka clusters from 0.10.1.0 (Confluent 3.1) to 0.11.0.1 (Confluent 3.3.1). While doing the final rolling restart to change the log.message.format.version from 0.10.1-IV2 to 0.11.0, our Go clients started throwing error decoding packet: unknown magic byte (2) errors. When this happened, they would disconnect from the client and reconnect, picking up from where they left off, so they would never advance.

This had us stumped for a while, as the clients worked fine on another cluster running 0.11. I dug through the Sarama source and think I know what is happening. Sarama pulls a block of records from Kafka, and then calls setTypeFromMagic to determine if they are legacy (0.10) messages or 0.11 records. It assumes that every message in that block is of the same type. But that is not necessarily going to be the case if a message format upgrade is in progress.

The "fix" for us was to restart the clients so they would start consuming from the end of the partitions. Luckily all of our Go clients could tolerate skipping messages.

@eapache
Copy link
Contributor

eapache commented Jan 12, 2018

Hmm, this is a bit confusing to me since the framing should be different for the two record formats so I'm not sure how they would get mixed in a response. Maybe this is another function of having multiple batches in a stream (aka #1022 cc @wladh)? Except if that was the case I would expect that bug to mask this one.

@wladh
Copy link
Contributor

wladh commented Jan 12, 2018

It's not masked by #1022 because the legacy messages don't have that issue. A message set decodes messages in a loop until the buffer is completely processed. So if the first messages are legacy, followed by records, then you'd see this behaviour.
I'll take this scenario into consideration when fixing #1022.

@eapache
Copy link
Contributor

eapache commented Jan 22, 2018

Should be fixed by #1023.

@eapache eapache closed this as completed Jan 22, 2018
ghost pushed a commit to hyperledger/fabric that referenced this issue Nov 13, 2018
Update to 1.19 and pick up the following bug fixes:

1. IBM/sarama#1021 (for FAB-11977)
2. IBM/sarama#1087 (for FAB-12827)

FAB-11977 #done
FAB-12827 #done

Change-Id: I85f89aeabb619a084902dc9e76491b981848c752
Signed-off-by: Kostas Christidis <kostas@christidis.io>
ghost pushed a commit to hyperledger/fabric that referenced this issue Nov 13, 2018
Update to 1.19 and pick up the following bug fixes:

1. IBM/sarama#1021 (for FAB-11977)
2. IBM/sarama#1087 (for FAB-12827)

FAB-11977 #done
FAB-12827 #done

Change-Id: Ifc73cbc4d205e9ce1e19c403666c4420a5538b0c
Signed-off-by: Kostas Christidis <kostas@christidis.io>
ghost pushed a commit to hyperledger/fabric that referenced this issue Nov 15, 2018
Update to 1.19 and pick up the following bug fixes:

1. IBM/sarama#1021 (for FAB-11977)
2. IBM/sarama#1087 (for FAB-12827)

FAB-11977 #done
FAB-12827 #done

Change-Id: I3be588a3f293079971af5c20c72c1b32bf613968
Signed-off-by: Kostas Christidis <kostas@christidis.io>
Shuo93 pushed a commit to Shuo93/fabric that referenced this issue Feb 6, 2019
Update to 1.19 and pick up the following bug fixes:

1. IBM/sarama#1021 (for FAB-11977)
2. IBM/sarama#1087 (for FAB-12827)

FAB-11977 #done
FAB-12827 #done

Change-Id: I3be588a3f293079971af5c20c72c1b32bf613968
Signed-off-by: Kostas Christidis <kostas@christidis.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants