For REST requests, remember successful fallback hosts for 10 minutes #554

SimonWoolf · 2018-11-05T15:22:16Z

See customer discussion in thread: https://ably-real-time.slack.com/archives/CDLTXNZL2/p1541180154010400 . If the user's closest datacenter is broken in some way (erroring / timing out), then forcing every REST request to go through that first seems perverse -- it increases load on the faulty datacenter and reduces client lib performance. (Which in the case that REST requests are timing out rather than being rejected can be quite severe, since we wait 10s before cutting our losses and moving on to a fallback host).

Note that this applies to REST requests only, realtime connections still always start by trying the default primary host (since once connected they stay connected, so effectively they already get this behaviour naturally).

paddybyers · 2018-11-05T15:27:50Z

So this is about fallback affinity
Questions then:

what triggers affinity to a given fallback? it's chosen, or it's chosen and it successfully handles a number of requests;
what triggers going back to the default endpoint? Some retried requests succeed, or simply a timeout?
are any of the thresholds configurable by the client or do we hard-code them all?

SimonWoolf · 2018-11-05T15:52:54Z

From your questions, not sure if you realised this is a PR -- mostly they're answered by the proposed spec change. For clarity I'll answer them here anyway.

what triggers affinity to a given fallback? it's chosen, or it's chosed and it successfully handles a number of requests;

In this proposal: it's chosen and succeeds once. Should it ever fail, it'll then just be unchosen again, so back to square 1 (the default fallback sequence starting with the default primary host). Any more state than that I think'd be more complicated than it needs to be.

what triggers going back to the default endpoint?

Either an RSC15d qualifying error occurs using the stored host, or 10 minutes pass.

are any of the thresholds configurable by the client or do we hard-code them all?

The only threshold in this proposal is the 10 minute timeout. Not sure need to have it be 'officially' configurable (in the spec), but ably-js will probably make it configurable anyway as an undocumented clientoption, if only to make it easier to test. But equally happy to make it officially configurable if you think it should be

mattheworiordan

I think this is largely right and will in most cases solve the problem we saw. However, I fear in reality this proposal assumes a binary outcome from each datacenter i.e. success / failure, and won't cope well with the reality when there is an incident i.e. 99% success from the alternative DC but perhaps not 100% due to migration of data / spikes etc. I realise equally that adding complexity to the client libraries should be avoided given the spec is already significantly complex.

@paddybyers can you think of any simple way that we could have a solution that is more robust? One solution could be (which is not simple per se) is to keep track of the number of failed requests by endpoint over the last 60 seconds, and make our first attempt to the endpoint with the least failures (by volume), or the primary endpoint if there are none. WDYT?

paddybyers · 2018-11-05T16:55:05Z

You mean that realtime tracks those statistics, and the library discovers it in some way?

The CircuitBreaker library does decide whether or not to switch based on a statistical sample, not just a single outcome: https://github.com/jhalterman/failsafe/blob/master/README.md#circuit-breakers

I realise this could give rise to more "region-hopping" than having a long timeout, and switching after a single successful attempt, but equally there are clearly situations where taking a statistical sample will give a more sane result. In any case this behaviour is a lot better than the purely random thing we have now.

mattheworiordan · 2018-11-05T17:20:35Z

You mean that realtime tracks those statistics, and the library discovers it in some way?

No, definitely not!

The CircuitBreaker library linked by HS does decide whether or not to switch based on a statistical sample, not just a single outcome: https://github.com/jhalterman/failsafe/blob/master/README.md#circuit-breakers

Yup, sure, but that's a library in one language. We can't depend on platform-specific libraries (they will all have different implementations which we have to avoid), so will have to come up with our own spec & solution sadly.

I realise this could give rise to more "region-hopping" than having a long timeout, and switching after a single successful attempt, but equally there are clearly situations where taking a statistical sample will give a more sane result. In any case this behaviour is a lot better than the purely random thing we have now.

Well quite. I don't think failing over with just one failed request is right.

SimonWoolf · 2018-11-06T08:17:10Z

this proposal assumes a binary outcome from each datacenter i.e. success / failure, and won't cope well with the reality when there is an incident i.e. 99% success from the alternative DC but perhaps not 100% due to migration of data / spikes etc

As discussed briefly in call yesterday, I think this proposal'll actually work ok in that scenario: most of the time it'll store and use the alternate DC; when a request to that fails, it'll discard it and start the fallback sequence again, which will try the main endpoint, fail, try the alternate, succeed, and re-store the alternate.

Clearly it's not the best possible solution, which would be region-aware, involve statistical sampling, and so on. But it's very simple to implement, adds no dependencies, and in most scenarios isn't that much worse than an optimum solution, so in the interests of avoiding client lib complexity I think it's a reasonable balance.

(The main scenario when it fails is if the main dc is timing out, and the first alternate is actually just coincidentally a CNAME for that same dc, so times out as well. But at least the fallback hosts are (already) randomly shuffled for each request, so with luck that will only affect one or two requests before a working DC becomes the first fallback. Though actually, one thought on that: for requests that are idempotent (which will soon include publishes), we could try two fallbacks at the same time if the primary endpoint fails / times out; then store whichever one succeeds first? That would solve the issue as at then at least one of those will be a DC that isn't the primary)

paddybyers · 2018-11-06T08:47:58Z

Is there not an easy way to exclude any fallback hosts that resolve to the same region as the primary endpoint, and cache that result?

SimonWoolf · 2018-11-06T09:36:53Z

Is there not an easy way to exclude any fallback hosts that resolve to the same region as the primary endpoint, and cache that result?

That's what I meant by region-aware, but not sure it's worth the trouble. For one thing, in order to know the region you need to have had at least one actual response from the primary endpoint; but the specific scenario when that information is most useful is when the primary endpoint is timing out. If the lib was only instantiated recently it may well have never have seen any actual responses from it. So we'd ideally want a strategy that works when we don't have region info. And if we have that, isn't it just simpler to just use that all the time?

(For another thing, we've seen before that clients change LBR region surprisingly often)

mattheworiordan

Approved by me. I think we should aim to get a release out for Java, Go, Ruby and JS at least soon as patch release update on any existing libraries (regardless of major & minor version)

bbeaudreault · 2018-11-07T16:33:42Z

My only concern here is that 10 minutes is sort of a blunt instrument. If our primary cluster is better located or better provisioned, there may be benefit in getting back to it as soon as possible. One thing that I've seen in circuit breaker libraries is that they trickle some small percentage of requests back to the primary over the course of the failure period. If a number of those requests succeed in a row, they switch back early.

I understand the need for cross platform support, and the cost of complexity mirrored across platforms. So the 10 minutes sounds like a great start which could be iterated on in the future if need be. But just wanted to bring up the above.

paddybyers · 2018-11-08T14:32:42Z

0dfd55f has been added to enable that value to be configured

…ack host

mattheworiordan

LGTM to me now

bbeaudreault · 2018-11-11T19:46:37Z

LGTM

For REST requests, remember successful fallbacks for 10 minutes

3c476db

SimonWoolf requested review from mattheworiordan and paddybyers November 5, 2018 15:22

mattheworiordan reviewed Nov 5, 2018

View reviewed changes

mattheworiordan approved these changes Nov 7, 2018

View reviewed changes

Add fallbackRetryTimeout as a configurable option

0dfd55f

paddybyers added 2 commits November 8, 2018 15:58

RSC15e: clarify relationship with RSC15f

d3c7ce0

Reword RSC15f to remove confusion about storage of a successful fallb…

720203f

…ack host

paddybyers force-pushed the fallback-host-remember branch from 406889c to 720203f Compare November 9, 2018 13:02

mattheworiordan mentioned this pull request Nov 11, 2018

Rsc15f remember fallback ably/ably-ruby#172

Merged

mattheworiordan approved these changes Nov 11, 2018

View reviewed changes

SimonWoolf merged commit c12fd7a into master Nov 12, 2018

SimonWoolf deleted the fallback-host-remember branch November 12, 2018 13:05

SimonWoolf mentioned this pull request Nov 12, 2018

Implement remembering fallback REST hosts for 10 minutes ably/ably-js#553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

For REST requests, remember successful fallback hosts for 10 minutes #554

For REST requests, remember successful fallback hosts for 10 minutes #554

Uh oh!

SimonWoolf commented Nov 5, 2018

Uh oh!

paddybyers commented Nov 5, 2018 •

edited

Loading

Uh oh!

SimonWoolf commented Nov 5, 2018

Uh oh!

mattheworiordan left a comment

Uh oh!

paddybyers commented Nov 5, 2018 •

edited

Loading

Uh oh!

mattheworiordan commented Nov 5, 2018 •

edited

Loading

Uh oh!

SimonWoolf commented Nov 6, 2018 •

edited

Loading

Uh oh!

paddybyers commented Nov 6, 2018

Uh oh!

SimonWoolf commented Nov 6, 2018

Uh oh!

mattheworiordan left a comment

Uh oh!

bbeaudreault commented Nov 7, 2018

Uh oh!

paddybyers commented Nov 8, 2018

Uh oh!

mattheworiordan left a comment

Uh oh!

bbeaudreault commented Nov 11, 2018

Uh oh!

Uh oh!

For REST requests, remember successful fallback hosts for 10 minutes #554

For REST requests, remember successful fallback hosts for 10 minutes #554

Uh oh!

Conversation

SimonWoolf commented Nov 5, 2018

Uh oh!

paddybyers commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonWoolf commented Nov 5, 2018

Uh oh!

mattheworiordan left a comment

Choose a reason for hiding this comment

Uh oh!

paddybyers commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattheworiordan commented Nov 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonWoolf commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddybyers commented Nov 6, 2018

Uh oh!

SimonWoolf commented Nov 6, 2018

Uh oh!

mattheworiordan left a comment

Choose a reason for hiding this comment

Uh oh!

bbeaudreault commented Nov 7, 2018

Uh oh!

paddybyers commented Nov 8, 2018

Uh oh!

mattheworiordan left a comment

Choose a reason for hiding this comment

Uh oh!

bbeaudreault commented Nov 11, 2018

Uh oh!

Uh oh!

paddybyers commented Nov 5, 2018 •

edited

Loading

paddybyers commented Nov 5, 2018 •

edited

Loading

mattheworiordan commented Nov 5, 2018 •

edited

Loading

SimonWoolf commented Nov 6, 2018 •

edited

Loading