Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "broker received out of order sequence" when brokers die #1661

Merged
merged 2 commits into from
May 4, 2020

Conversation

KJTsanaktsidis
Copy link

@KJTsanaktsidis KJTsanaktsidis commented Apr 9, 2020

When the following three conditions are satisfied, the producer code can skip message sequence numbers and cause the broker to complain that the sequences are out of order:

  • config.Producer.Idempotent is set
  • The producer loses, and then regains, its connection to a broker
  • The client code continues to attempt to produce messages whilst the broker is unavailable.

For every message the client attempted to send while the broker is unavailable, the transaction manager sequence number will be incremented, however these messages will eventually fail and return an error to the caller. When the broker re-appears, and another message is published, it's sequence number is higher than the last one the broker remembered - the values that were attempted while it was down were never seen. Thus, from the broker's perspective, it's seeing out-of-order sequence numbers.

The fix to this has a few parts:

  • Don't obtain a sequence number from the transaction manager until we're sure we want to try publishing the message
  • Affix the producer ID and epoch to the message once the sequence is generated
  • Increment the transaction manager epoch (and reset all sequence numbers to zero) when we permanently fail to publish a message. That represents a sequence that the broker will never see, so the only safe thing to do is to roll over the epoch number.
  • Ensure we don't publish message sets that contain messages from multiple transaction manager epochs.

This should be a fix for #1430 I think. I tested it with this test harness: https://gist.github.com/KJTsanaktsidis/12a33a9e6e864857b91f639947567ac3 and tried a few things while it was running:

  • Drop broker traffic with iptables, wait for messages to be retried, and then re-enable broker traffic: this should not bump the epoch, because the messages are successfully published after retrying
  • Drop broker traffic with iptables, wait for retries to be exhaused and for Sarama to start reporting errors to the client code, and then re-enable traffic. This should bump the epoch, because publishes threw errors to sarama's caller after having caused a sequence number increment
  • Cleanly stop and start the kafka broker while the test is running

Seemed to all come out OK.

This is, however, very much still a WIP PR. In particular:
* It needs tests I added a test case that covers this.
* It needs to be used in anger in one of our production systems We've deployed this to production in Zendesk now over the last week
* I couldn't get the e2e tests running on my machine I provided some suggestions about how we could make the e2e tests work more easily on dev machines
* I very much would love your eyeballs on this to see if this looks like the right solution to you!

@KJTsanaktsidis KJTsanaktsidis requested a review from bai as a code owner April 9, 2020 07:05
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch 4 times, most recently from 9c6582c to bb8dd43 Compare April 14, 2020 05:54
When the following three conditions are satisfied, the producer code can
skip message sequence numbers and cause the broker to complain that the
sequences are out of order:

* config.Producer.Idempotent is set
* The producer loses, and then regains, its connection to a broker
* The client code continues to attempt to produce messages whilst the
broker is unavailable.

For every message the client attempted to send while the broker is
unavailable, the transaction manager sequence number will be
incremented, however these messages will eventually fail and return an
error to the caller. When the broker re-appears, and another message is
published, it's sequence number is higher than the last one the broker
remembered - the values that were attempted while it was down were never
seen. Thus, from the broker's perspective, it's seeing out-of-order
sequence numbers.

The fix to this has a few parts:

* Don't obtain a sequence number from the transaction manager until
we're sure we want to try publishing the message
* Affix the producer ID and epoch to the message once the sequence is
generated
* Increment the transaction manager epoch (and reset all sequence
numbers to zero) when we permenantly fail to publish a message. That
represents a sequence that the broker will never see, so the only safe
thing to do is to roll over the epoch number.
* Ensure we don't publish message sets that contain messages from
multiple transaction manager epochs.
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch from bb8dd43 to 9df3038 Compare April 14, 2020 06:05
@KJTsanaktsidis
Copy link
Author

Think " CI / Go 1.14.x with Kafka 2.4.0 on Ubuntu" job just needs to be poked, looks like it failed before it even got to running the test suite

@dnwe
Copy link
Collaborator

dnwe commented Apr 14, 2020

@KJTsanaktsidis thanks for debugging this through and coming up with a solution. The changes that you propose here look good to me. It would be good if we could replicate this in the functional tests with toxiproxy, but that doesn't necessarily have to be done as part of this PR nor before merging.

I've noticed a few people now saying they've been unable to get the functional tests to run locally. I wonder if we should investigate making the,m simpler to run by using docker or k3s for the brokers rather than the existing vagrant-based setup (/cc @bai)

@KJTsanaktsidis
Copy link
Author

KJTsanaktsidis commented Apr 14, 2020

FWIW I too didn't manage to get the functional tests to run properly either! (EDIT: Just realised I said this in my PR description, which is why you brought it up)

In theory it should be relatively simple to set up a functional testcase for this - all that's needed is to publish a bunch of messages, use toxiproxy to blackhole the broker for a while, and then bring the broker back and check that messages can still be published.

If i get a moment this week I'll see if I can wrestle with vagrant enough to get an integration test going. We're currently working out how to get enough confidence in this change to ship it to production, so integration tests could help us with that anyway.

@bai
Copy link
Contributor

bai commented Apr 14, 2020

Yeah I've had that on my TODO but haven't had time to attack this to be honest. I've been looking into either docker-compose or k3s but personally leaning towards the latter.

I think current vagrant setup is somewhat outdated.

@KJTsanaktsidis
Copy link
Author

I think the Vagrantfile won't have worked since it was bumped from trusty to bionic in b8c5f7c - setup_services.sh tries to set up classic upstart/sysv-init style jobs for toxiproxy/zookeeper/kafka.

This could probably be fixed by turning them into systemd units, but a more wholesale fix to use containers for this could simplify it a lot, you're right.

In any case i'll bodge something on my machine to try and get the tests to work, submit a functional test to this PR, and actually fixing the local functional test runner is probably a job for a different PR

@KJTsanaktsidis
Copy link
Author

@bai @dnwe it was certainly an ordeal, but I got the integration tests running on my machine. What I ended up doing was making a docker-compose file with kafka/zookeeper/toxiproxy, and moving all of the topic creation/seeding stuff to the golang test code itself. I did have a look at k3s for this, but it doesn't seem to have a good macOS story.

This is it: zendesk@0650324

so, the tests can be run like:

docker-compose up -d
# wait for everything to calm down
TOXIPROXY_ADDR="http://localhost:8474" KAFKA_PEERS="localhost:29091,localhost:29092,localhost:29093,localhost:29094,localhost:29095" KAFKA_VERSION="2.4.1" CI="true" DEBUG="true"  make test

Could probably replace the CI runner with the docker compose file with a bit of tweaking too.

In any case, after all that, I did write a functional test and add it to this PR.

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch from 955b7df to ca14191 Compare April 16, 2020 07:36
@@ -96,6 +97,83 @@ func TestFuncProducingToInvalidTopic(t *testing.T) {
safeClose(t, producer)
}

func TestFuncProducingIdempotentWithBrokerFailure(t *testing.T) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@KJTsanaktsidis
Copy link
Author

@bai @Nwe Just as a heads up, we've been shipping this change to production at Zendesk slowly over the last week. It's gone off without a hitch. So I think we can say this got some usage in anger now!

@dnwe
Copy link
Collaborator

dnwe commented Apr 24, 2020

@KJTsanaktsidis thanks for the update. Great news. Are you running against kafka 2.3/2.4 in the backend?

@KJTsanaktsidis
Copy link
Author

Ha, unfortunately we're using kafka 2.1.1 in production (and 2.2.2 in our staging environment).

Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KJTsanaktsidis thanks for this fix 👍

@dnwe dnwe merged commit f7b64cf into IBM:master May 4, 2020
@dnwe
Copy link
Collaborator

dnwe commented May 4, 2020

@KJTsanaktsidis did you want to also propose a PR for your vagrant --> docker-compose migration for the functional tests?

@KJTsanaktsidis
Copy link
Author

KJTsanaktsidis commented May 5, 2020

@dnwe yeah, if you want to go in that direction I think I can put together a PR that has

  • A docker compose file that works on mac, linux & windows for running the functional tests
  • Tweak the github actions CI runners to use compose to run the tests
  • Get rid of the vagrant stuff since it now doesn't work

Sounds like a job for my next lab day, which is end of next week. Sounds like a plan?

@dnwe
Copy link
Collaborator

dnwe commented May 5, 2020

@KJTsanaktsidis that sounds great to me. I think making functional tests against a real kafka easier to run and hence easier to write will be extremely valuable.

@bai / @d1egoaz any thoughts?

@bai
Copy link
Contributor

bai commented May 5, 2020

I'm 👍 on getting rid of Vagrant and simplifying this setup and I think docker-compose is a good candidate for this. I'd also suggest k3s but assuming docker-compose up is better in terms of simplicity, so we don't have to kubectl apply things. Thoughts?

@dnwe
Copy link
Collaborator

dnwe commented May 5, 2020

Yeah so we use k3s internally in our CI builds, but in this scenario it doesn't really buy us much above and beyond docker-compose as we'd invariably be running it within docker (for non-Linux devs) so it would just add an additional layer and we'd have the complexities of statefulset yaml or adopting something like strimzi, but all of that increases the complexity of things for devs to run/understand so I think compose is a reasonable solution.

The alternative would be to use the docker go bindings and drive the start and stop of containers directly from the tests themselves and not having any "bring up the cluster" outside of Go at all

@bai
Copy link
Contributor

bai commented May 5, 2020

Makes sense, thanks. So my vote goes for docker-compose then 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants