Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(resty.dns.client): fix the UDP sockets leak #10691

Merged
merged 13 commits into from
Apr 25, 2023

Conversation

liverpool8056
Copy link
Contributor

Summary

In resty.dns.client module when resolver:new() is called, a UDP socket is created. In 2.x, the UDP socket is released quickly, while in 3.x, it has to take several seconds to release. If the ttl of dns responses is shorter than the time it waits, the leak will happen. In this PR, a forcible release is added.

FTI-4962

Checklist

Full changelog

  • [Implement ...]

Issue reference

FTI-4962

@liverpool8056 liverpool8056 force-pushed the fix/udp-socket-leak-in-dns-client branch 2 times, most recently from 82f6e1c to 6e1abe8 Compare April 18, 2023 12:16
@@ -702,6 +702,14 @@ local function parseAnswer(qname, qtype, answers, try_list)
return true
end

local function close_socks(resolver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an upstream PR for this? if not can we send one (for lua-resty-dns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no upstream PR committed yet, actually I think it might not be lua-resty-dnss business, the sockets could be destroyed in 2.8, and I don't know why not being destroyed in 3.x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, then maybe we should look for the root-cause first, before fixing it.

are we keeping them cached somewhere? such that they stay alive instead of getting GC'ed?

Copy link
Member

@Tieske Tieske Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also; an upstream PR for always closing the sockets would be a sanity item imo. So would still make sesne to me.

since we're now fiddling with that libs internals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not being cached somewhere, the resolver is created in ngx.timer, I'm not sure if the timer matters. However the logic of dns queries is similar between in 2.x and 3.x, so I have no idea why it just happens in 3.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ADD-SP and I did further investigations and found that this is caused by timer-ng lib. After replacing the timer-ng with the native timer to schedule the dns query, this issue never happens. I'm not sure but it seems like there are some differences between timer-ng and native time over garbage collection.

Even if the issue is caused by timer-ng, it still makes sense to me that the sockets should be released explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very strange, we try to make timer-ng run full GC after each handler and this issue does not appear.

So there is no leak, but it looks like the GC has slowed down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tieske I committed an upstream PR: openresty/lua-resty-dns#64
Can we just take this PR as a workaround so that this issue could be fixed quickly? If the PR for resty-dns is released, we can revert this PR and make a better one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on OpenResty's claim at here, there are three ways to release socket objects:

  1. Call the close() method explicitly.
  2. Released by the Lua GC.
  3. Current client HTTP request finishes processing.

Since lua-resty-dns does not explicitly close socket objects or expose the close() method to users, the relevant socket objects can only be released through methods 2 and 3.

The reason why the native timer is not problematic is that once the timer ends, the fake request is considered finished (effectively fulfilling the condition of method 3). However, timer_ng reuses the existing ngx_timer to execute tasks, so the expected condition of method 3 did not occur and it was only dependent on method 2. I believe this is why those UDP socket objects were leaked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just take this PR as a workaround so that this issue could be fixed quickly? If the PR for resty-dns is released, we can revert this PR and make a better one.

Yes, that'd work. Please create an item to track this. eg a PR based on the proposed changes in resty-dns, or an internal Jira ticket, whatever works for you.

kong/resty/dns/client.lua Outdated Show resolved Hide resolved
kong/resty/dns/client.lua Outdated Show resolved Hide resolved
kong/resty/dns/client.lua Outdated Show resolved Hide resolved
@liverpool8056 liverpool8056 force-pushed the fix/udp-socket-leak-in-dns-client branch from f711900 to f238569 Compare April 19, 2023 05:02
@liverpool8056 liverpool8056 requested review from Tieske and ms2008 April 20, 2023 03:26
Copy link
Contributor

@ms2008 ms2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Thijs mentioned it seems more appropriate to add a new close() method to the upstream project lua-resty-dns, and then we call it here.

But overall this looks great; nice test BTW.

kong/resty/dns/client.lua Outdated Show resolved Hide resolved
Copy link
Contributor

@vm-001 vm-001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I saw an upstream PR already out there, and this is just a temporary solution as it will switch to calling the convenient close function after the upstream one merged and released.

Copy link
Contributor

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better way is write a patch to lua-resty-dns until they merge our fix and release a new version.

I don't want to do something in the core that should be done by an external dependency, which would reduce maintainability.

@@ -145,5 +145,60 @@ for _, strategy in helpers.each_strategy() do
assert.response(r).has.status(503)
end)
end)
describe("udp sockets", function()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we should write tests for all fixes, but this issue is caused by third-party dependencies and I'm not sure how we should do it.

I think it's too heavy for introducing an integration test for this purpose, I'd prefer to use unit tests to cover it.

The perfect thing to do would be to have lua-resty-dns cover the issue, but until then I think it would be better to use unit tests instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that even if we change this pr to a patch, the test case is still coverable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue can be covered, but does it make more sense to move the tests to unit tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes more sense, just a suggestion.

Copy link
Contributor

@ADD-SP ADD-SP Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is NOT A BLOCKER for this PR.

@ADD-SP
Copy link
Contributor

ADD-SP commented Apr 24, 2023

@Tieske FYI, we plan to write a patch to the lua-resty-dns instead of changing the core code.

@liverpool8056
Copy link
Contributor Author

I think the better way is write a patch to lua-resty-dns until they merge our fix and release a new version.

makes sense to me

ADD-SP
ADD-SP previously requested changes Apr 24, 2023
Copy link
Contributor

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec/02-integration/05-proxy/05-dns_spec.lua

If someone who has no context for this issue read this test, it is hard to understand why we write this test.

So could you please add some comments to explain more?

@Tieske
Copy link
Member

Tieske commented Apr 24, 2023

FYI, we plan to write a patch to the lua-resty-dns instead of changing the core code.

My 2cts: the upstream is moving slowly, so the patch is going to exist for quite a while probably. I personally find patches cumbersome to maintain, and the code is less clear, because the code is not in-line. You already need to know the patch exists to be able to reason about what you actually read in the code.
So imho maintainability is better if we apply the temporary fix to the core.

That said; it's not my call, so feel free to ignore.

liverpool8056 and others added 13 commits April 25, 2023 11:13
In `resty.dns.client` module a resolver is created for domain query.
When `resolver:new()` is called, a UDP socket is created. In 2.x, the
UDP socket is released quickly, while in 3.x, it has to take several
seconds to release. If the ttl of dns reponses is shorter than the time
it waits, the leak will happen. In this PR, a forcible release is added.

FTI-4962
Co-authored-by: Thijs Schreijer <thijs@thijsschreijer.nl>
Co-authored-by: Zijing Zhang <50045289+pluveto@users.noreply.github.com>
Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com>
@ms2008 ms2008 force-pushed the fix/udp-socket-leak-in-dns-client branch from e759f88 to 5cf355a Compare April 25, 2023 03:13
@windmgc windmgc merged commit f46078c into master Apr 25, 2023
@windmgc windmgc deleted the fix/udp-socket-leak-in-dns-client branch April 25, 2023 04:17
liverpool8056 added a commit that referenced this pull request May 4, 2023
…cessary

For those sockets created inside the schedule of timer-ng, the sockets
must be closed explicitly. More details, please refer to:
[#10691](#10691)
liverpool8056 added a commit that referenced this pull request May 16, 2023
…cessary

For those sockets created inside the schedule of timer-ng, the sockets
must be closed explicitly. More details, please refer to:
[#10691](#10691)
windmgc pushed a commit that referenced this pull request May 17, 2023
…cessary (#10783)

* fix(loggly&tcp-log&udp-log&reports): close sockets explicitly when necessary

For those sockets created inside the schedule of timer-ng, the sockets
must be closed explicitly. For more details, please refer to:
[#10691](#10691)


Co-authored-by: Chrono <chrono_cpp@me.com>
Co-authored-by: Jun Ouyang <ouyangjun1999@gmail.com>
liverpool8056 added a commit that referenced this pull request Mar 6, 2024
#10691)

* fix(resty.dns.client): fix the leak of UDP sockets in resty.dns.client

In `resty.dns.client` module a resolver is created for domain query.
When `resolver:new()` is called, a UDP socket is created. In 2.x, the
UDP socket is released quickly, while in 3.x, it has to take several
seconds to release. If the ttl of dns reponses is shorter than the time
it waits, the leak will happen. In this PR, a forcible release is added.

FTI-4962
FTI-5803

---------

Co-authored-by: Thijs Schreijer <thijs@thijsschreijer.nl>
Co-authored-by: Zijing Zhang <50045289+pluveto@users.noreply.github.com>
Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com>
lhanjian pushed a commit that referenced this pull request Dec 23, 2024
KAG-5729, KAG-5736, KAG-5755, KAG-5770, KAG-5757

(cherry picked from commit #13853)

Co-authored-by: Chrono <chrono_cpp@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants