Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

fix(balancer) fix accidental ttl=0 switches #56

Merged
merged 1 commit into from
Aug 27, 2018
Merged

fix(balancer) fix accidental ttl=0 switches #56

merged 1 commit into from
Aug 27, 2018

Conversation

Tieske
Copy link
Member

@Tieske Tieske commented Aug 23, 2018

See #51

Some servers will report ttl=0 when they are on the very edge
of their own cached ttl. This should never happen for a record
that has a non-0 ttl.

This fix makes sure we require ttl=0 reported twice in a row before
we switch the loadbalancer.

Fixes #51

@Tieske
Copy link
Member Author

Tieske commented Aug 23, 2018

For reference: the related Kong issue

@hishamhm
Copy link
Contributor

I did read the Kong issue but I'm struggling a bit to follow. Could two queries done in the same second cause this situation even with this check-twice fix?

@Tieske
Copy link
Member Author

Tieske commented Aug 23, 2018

No, the problem really is an edge problem. It is not that the dns server should count down from ttl to 1, but in fact does ttl to 0. In that case it would report 0 for a full second. It's an edge problem where the remaining ttl as cached by the dns server is really close to 0 (or even actually 0 with a bad comparison like >= 0 instead of > 0). So this case exists only for a very small amount of time.

I'm not sure whether two consecutive queries within eg. 1ms of each other would be enough to re-trigger the issue. That said, default behaviour is to "synchronise" queries (requesting name-x while another query for name-x is already in progress, will not re-query, but simply use the results of the query already in progress).
Since those synced-queries return the same table, it will not trigger the "there is a new dns record" branch (https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer.lua#L355)

From there, if we have 2 different queries, we can safely assume that the latency between the 2 will at least be 10ms (and probably more). My guess (no certainty) would be that this would be more than enough to by pass the issue.

It worked for the user reporting the problem.

@hishamhm
Copy link
Contributor

Makes sense — one last question about this topic: (more for the purposes of documenting this logic in this PR history than anything else)

Since the case for ttl=0 existed, I assume there are users who use that explicitly (for "query every time" behavior). Does adding this check for two requests change the behavior for those users? (In other words, will the first or second query for users of ttl=0 behave any different with this PR applied?)

@Tieske
Copy link
Member Author

Tieske commented Aug 27, 2018

No, that is exactly the difference between this patch and the one proposed in the Kong issue. Note the or 0) in my patch, which is not in the Kong issue.

Copy link
Contributor

@hishamhm hishamhm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifications!

See #51

Some servers will report ttl=0 when they are on the very edge
of their own cached ttl. This should never happen for a record
that has a non-0 ttl.

This fix makes sure we require ttl=0 reported twice in a row before
we switch the loadbalancer.

Fixes #51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Temporary ttl=0 returned on edge case by Route 53
2 participants