Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For REST requests, remember successful fallback hosts for 10 minutes #554

Merged
merged 4 commits into from
Nov 12, 2018

Conversation

SimonWoolf
Copy link
Member

See customer discussion in thread: https://ably-real-time.slack.com/archives/CDLTXNZL2/p1541180154010400 . If the user's closest datacenter is broken in some way (erroring / timing out), then forcing every REST request to go through that first seems perverse -- it increases load on the faulty datacenter and reduces client lib performance. (Which in the case that REST requests are timing out rather than being rejected can be quite severe, since we wait 10s before cutting our losses and moving on to a fallback host).

Note that this applies to REST requests only, realtime connections still always start by trying the default primary host (since once connected they stay connected, so effectively they already get this behaviour naturally).

@paddybyers
Copy link
Member

paddybyers commented Nov 5, 2018

So this is about fallback affinity
Questions then:

  • what triggers affinity to a given fallback? it's chosen, or it's chosen and it successfully handles a number of requests;
  • what triggers going back to the default endpoint? Some retried requests succeed, or simply a timeout?
  • are any of the thresholds configurable by the client or do we hard-code them all?

@SimonWoolf
Copy link
Member Author

From your questions, not sure if you realised this is a PR -- mostly they're answered by the proposed spec change. For clarity I'll answer them here anyway.

what triggers affinity to a given fallback? it's chosen, or it's chosed and it successfully handles a number of requests;

In this proposal: it's chosen and succeeds once. Should it ever fail, it'll then just be unchosen again, so back to square 1 (the default fallback sequence starting with the default primary host). Any more state than that I think'd be more complicated than it needs to be.

what triggers going back to the default endpoint?

Either an RSC15d qualifying error occurs using the stored host, or 10 minutes pass.

are any of the thresholds configurable by the client or do we hard-code them all?

The only threshold in this proposal is the 10 minute timeout. Not sure need to have it be 'officially' configurable (in the spec), but ably-js will probably make it configurable anyway as an undocumented clientoption, if only to make it easier to test. But equally happy to make it officially configurable if you think it should be

Copy link
Member

@mattheworiordan mattheworiordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is largely right and will in most cases solve the problem we saw. However, I fear in reality this proposal assumes a binary outcome from each datacenter i.e. success / failure, and won't cope well with the reality when there is an incident i.e. 99% success from the alternative DC but perhaps not 100% due to migration of data / spikes etc. I realise equally that adding complexity to the client libraries should be avoided given the spec is already significantly complex.

@paddybyers can you think of any simple way that we could have a solution that is more robust? One solution could be (which is not simple per se) is to keep track of the number of failed requests by endpoint over the last 60 seconds, and make our first attempt to the endpoint with the least failures (by volume), or the primary endpoint if there are none. WDYT?

@paddybyers
Copy link
Member

paddybyers commented Nov 5, 2018

You mean that realtime tracks those statistics, and the library discovers it in some way?

The CircuitBreaker library does decide whether or not to switch based on a statistical sample, not just a single outcome: https://github.com/jhalterman/failsafe/blob/master/README.md#circuit-breakers

I realise this could give rise to more "region-hopping" than having a long timeout, and switching after a single successful attempt, but equally there are clearly situations where taking a statistical sample will give a more sane result. In any case this behaviour is a lot better than the purely random thing we have now.

@mattheworiordan
Copy link
Member

mattheworiordan commented Nov 5, 2018

You mean that realtime tracks those statistics, and the library discovers it in some way?

No, definitely not!

The CircuitBreaker library linked by HS does decide whether or not to switch based on a statistical sample, not just a single outcome: https://github.com/jhalterman/failsafe/blob/master/README.md#circuit-breakers

Yup, sure, but that's a library in one language. We can't depend on platform-specific libraries (they will all have different implementations which we have to avoid), so will have to come up with our own spec & solution sadly.

I realise this could give rise to more "region-hopping" than having a long timeout, and switching after a single successful attempt, but equally there are clearly situations where taking a statistical sample will give a more sane result. In any case this behaviour is a lot better than the purely random thing we have now.

Well quite. I don't think failing over with just one failed request is right.

@SimonWoolf
Copy link
Member Author

SimonWoolf commented Nov 6, 2018

this proposal assumes a binary outcome from each datacenter i.e. success / failure, and won't cope well with the reality when there is an incident i.e. 99% success from the alternative DC but perhaps not 100% due to migration of data / spikes etc

As discussed briefly in call yesterday, I think this proposal'll actually work ok in that scenario: most of the time it'll store and use the alternate DC; when a request to that fails, it'll discard it and start the fallback sequence again, which will try the main endpoint, fail, try the alternate, succeed, and re-store the alternate.

Clearly it's not the best possible solution, which would be region-aware, involve statistical sampling, and so on. But it's very simple to implement, adds no dependencies, and in most scenarios isn't that much worse than an optimum solution, so in the interests of avoiding client lib complexity I think it's a reasonable balance.

(The main scenario when it fails is if the main dc is timing out, and the first alternate is actually just coincidentally a CNAME for that same dc, so times out as well. But at least the fallback hosts are (already) randomly shuffled for each request, so with luck that will only affect one or two requests before a working DC becomes the first fallback. Though actually, one thought on that: for requests that are idempotent (which will soon include publishes), we could try two fallbacks at the same time if the primary endpoint fails / times out; then store whichever one succeeds first? That would solve the issue as at then at least one of those will be a DC that isn't the primary)

@paddybyers
Copy link
Member

Is there not an easy way to exclude any fallback hosts that resolve to the same region as the primary endpoint, and cache that result?

@SimonWoolf
Copy link
Member Author

Is there not an easy way to exclude any fallback hosts that resolve to the same region as the primary endpoint, and cache that result?

That's what I meant by region-aware, but not sure it's worth the trouble. For one thing, in order to know the region you need to have had at least one actual response from the primary endpoint; but the specific scenario when that information is most useful is when the primary endpoint is timing out. If the lib was only instantiated recently it may well have never have seen any actual responses from it. So we'd ideally want a strategy that works when we don't have region info. And if we have that, isn't it just simpler to just use that all the time?

(For another thing, we've seen before that clients change LBR region surprisingly often)

Copy link
Member

@mattheworiordan mattheworiordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by me. I think we should aim to get a release out for Java, Go, Ruby and JS at least soon as patch release update on any existing libraries (regardless of major & minor version)

@bbeaudreault
Copy link

My only concern here is that 10 minutes is sort of a blunt instrument. If our primary cluster is better located or better provisioned, there may be benefit in getting back to it as soon as possible. One thing that I've seen in circuit breaker libraries is that they trickle some small percentage of requests back to the primary over the course of the failure period. If a number of those requests succeed in a row, they switch back early.

I understand the need for cross platform support, and the cost of complexity mirrored across platforms. So the 10 minutes sounds like a great start which could be iterated on in the future if need be. But just wanted to bring up the above.

@paddybyers
Copy link
Member

0dfd55f has been added to enable that value to be configured

Copy link
Member

@mattheworiordan mattheworiordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me now

@bbeaudreault
Copy link

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants