fix(balancer) fix accidental ttl=0 switches #56

Tieske · 2018-08-23T14:49:57Z

Some servers will report ttl=0 when they are on the very edge
of their own cached ttl. This should never happen for a record
that has a non-0 ttl.

This fix makes sure we require ttl=0 reported twice in a row before
we switch the loadbalancer.

Fixes #51

Tieske · 2018-08-23T14:53:22Z

For reference: the related Kong issue

hishamhm · 2018-08-23T19:30:26Z

I did read the Kong issue but I'm struggling a bit to follow. Could two queries done in the same second cause this situation even with this check-twice fix?

Tieske · 2018-08-23T20:06:59Z

No, the problem really is an edge problem. It is not that the dns server should count down from ttl to 1, but in fact does ttl to 0. In that case it would report 0 for a full second. It's an edge problem where the remaining ttl as cached by the dns server is really close to 0 (or even actually 0 with a bad comparison like >= 0 instead of > 0). So this case exists only for a very small amount of time.

I'm not sure whether two consecutive queries within eg. 1ms of each other would be enough to re-trigger the issue. That said, default behaviour is to "synchronise" queries (requesting name-x while another query for name-x is already in progress, will not re-query, but simply use the results of the query already in progress).
Since those synced-queries return the same table, it will not trigger the "there is a new dns record" branch (https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer.lua#L355)

From there, if we have 2 different queries, we can safely assume that the latency between the 2 will at least be 10ms (and probably more). My guess (no certainty) would be that this would be more than enough to by pass the issue.

It worked for the user reporting the problem.

hishamhm · 2018-08-27T13:38:53Z

Makes sense — one last question about this topic: (more for the purposes of documenting this logic in this PR history than anything else)

Since the case for ttl=0 existed, I assume there are users who use that explicitly (for "query every time" behavior). Does adding this check for two requests change the behavior for those users? (In other words, will the first or second query for users of ttl=0 behave any different with this PR applied?)

Tieske · 2018-08-27T13:41:07Z

No, that is exactly the difference between this patch and the one proposed in the Kong issue. Note the or 0) in my patch, which is not in the Kong issue.

hishamhm

Thank you for the clarifications!

See #51 Some servers will report ttl=0 when they are on the very edge of their own cached ttl. This should never happen for a record that has a non-0 ttl. This fix makes sure we require ttl=0 reported twice in a row before we switch the loadbalancer. Fixes #51

Tieske self-assigned this Aug 23, 2018

Tieske requested a review from hishamhm August 23, 2018 14:50

Tieske mentioned this pull request Aug 23, 2018

Upstream targets go unhealthy every few seconds Kong/kong#3641

Closed

hishamhm approved these changes Aug 27, 2018

View reviewed changes

Tieske force-pushed the fix2-route53 branch from 0efdec8 to 8857539 Compare August 27, 2018 14:50

Tieske force-pushed the fix2-route53 branch from 8857539 to ea8d203 Compare August 27, 2018 14:51

Tieske merged commit c010840 into master Aug 27, 2018

Tieske deleted the fix2-route53 branch August 27, 2018 14:52

onematchfox mentioned this pull request May 28, 2021

Kong upstream health flip-flopping due to TTL=0 handling #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(balancer) fix accidental ttl=0 switches #56

fix(balancer) fix accidental ttl=0 switches #56

Tieske commented Aug 23, 2018

Tieske commented Aug 23, 2018

hishamhm commented Aug 23, 2018

Tieske commented Aug 23, 2018

hishamhm commented Aug 27, 2018

Tieske commented Aug 27, 2018

hishamhm left a comment

fix(balancer) fix accidental ttl=0 switches #56

fix(balancer) fix accidental ttl=0 switches #56

Conversation

Tieske commented Aug 23, 2018

Tieske commented Aug 23, 2018

hishamhm commented Aug 23, 2018

Tieske commented Aug 23, 2018

hishamhm commented Aug 27, 2018

Tieske commented Aug 27, 2018

hishamhm left a comment

Choose a reason for hiding this comment