Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dead brokers die harder #548

Merged
merged 1 commit into from
Oct 1, 2015
Merged

Make dead brokers die harder #548

merged 1 commit into from
Oct 1, 2015

Conversation

eapache
Copy link
Contributor

@eapache eapache commented Sep 30, 2015

When a broker gets an error trying to receive a response (either from the
network layer, or from failing to parse the minimal global header), it should
just abandon ship and die. Save that error and return it immediately for any
further requests we might have made.

  • The vast majority of the time the connection is going to be hosed anyways, if
    nothing else by being out-of-sync on correlation IDs (which we don't handle
    and which doesn't seem particularly urgent).
  • All of Sarama's built-in callers (producer/consumer/offset-manager)
    immediately Close a broker when they receive one of these errors anyways, so
    all this does is speed up that in the common case.

If one of these errors is recoverable, and if there is user-space code
somewhere which actually tries to recover in one of those cases, then that code
would break.

This neatly satisfies one of the XXX comments I left in about this issue from
way back in 2013. The TODOs about correlation ID matching are still present.

@wvanbergen @4578395263256 another approach to #546, one I am generally happier with. Thoughts?

When a broker gets an error trying to receive a response (either from the
network layer, or from failing to parse the minimal global header), it should
just abandon ship and die. Save that error and return it immediately for any
further requests we might have made.

- The vast majority of the time the connection is going to be hosed anyways, if
  nothing else by being out-of-sync on correlation IDs (which we don't handle
  and which doesn't seem particularly urgent).
- All of Sarama's built-in callers (producer/consumer/offset-manager)
  immediately `Close` a broker when they receive one of these errors anyways, so
  all this does is speed up that in the common case.

*If* one of these errors is recoverable, and *if* there is user-space code
somewhere which actually tries to recover in one of those cases, then that code
would break.

This neatly satisfies one of the XXX comments I left in about this issue from
way back in 2013. The TODOs about correlation ID matching are still present.
@wvanbergen
Copy link
Contributor

This change looks good to me 👍.

I can't really see any use code out there trying to salvage the connection, especially because the correlation IDs are probably broken after something like this happens.

@wvanbergen
Copy link
Contributor

Alternatively, we could put in a circuit breaker. Not sure if it is worth the effort.

@eapache
Copy link
Contributor Author

eapache commented Oct 1, 2015

A circuit-breaker would be neat, but we already have other ones higher in the stack so we'd end up with nested breakers on slightly different conditions which would be odd. Also, this is really not something that can heal without re-starting the network connection so it's not an ideal use.

eapache added a commit that referenced this pull request Oct 1, 2015
@eapache eapache merged commit bc4baeb into master Oct 1, 2015
@eapache eapache deleted the die-broker-die branch October 1, 2015 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants