-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS: timeout on resolving address. #40
Comments
which version of logstash is being used with version 3.0.7? also, are you pointing the filter to an internal or local dns server? do you have something like dnsmasq installed? are there any other network related exceptions being logged? |
The filter just uses the systems resolvers running powerdns on "remote" internal servers. This problem also clogs the pipeline, since all requests suddenly are marked as timed out and retried until failed (adding +1 second per event). |
I was seeing seeing this behaviour on 3.0.7 (Logstash 6.2.3). I have upgraded to v3.0.9, and am still seeing the behaviour - I get many I suspect if the solution above was to introduce a cache, then that would not solve the problem - I am using this to process proxy logs and have >100 of IPs that are resolved, so I can believe it is overflowing it. I am also running PDNS servers, so I will investigate there as well. |
Same issue with the 3.0.9. It keeps to timeout until I restart logstasg. |
Logstash is using JRuby 9.1.13. Can it relates to this issue: It may explain why the DNS resolution is working properly for some time then block. |
very good find @nithril ! [edit] I tested that the patch applies cleanly to logstash 6.2.4:
|
Patch applied. I will be able to check next week. |
Unfortunately it does not solve the issue. The FD remains stable, no leak. |
@nithril can you confirm the exact versions that trigger this? we know that >= 3.0.7 on logstash 6.2.2 triggers, but it'd be great if we could narrow this down. For example, version 3.0.7 didn't add anything new, so the issue isn't coming from it. I'd be great if we, using the same logstsash version, could find a dns filter version < 3.0.7 where the timeouts don't happen. Also, you may want to know that version 3.0.10 fixes a serious concurrency issue that greatly(!) speeds up the performance when the cache is used. |
I'm happy to try some older versions, but it's hard to trigger - it seems to start occuring after an indeterminate amount of time. |
The dns filter version is the 3.0.9. All the timeout error comes from The intersting part is that the CPU is stuck at 100% (one core) and the jstack seems to confirm that the process is stuck in the |
if this is the case then I suggest attempting another monkeypatch to resolv.rb to see if the problem goes away. While the commit you mentioned addressed the leak, there were about 4 other commits that touched the structure of resolv.rb and also mention a leak. @nithril if you could just replace the local logstash resolv.rb file with https://raw.githubusercontent.com/jruby/jruby/ruby-2.5/lib/ruby/stdlib/resolv.rb we'd be able to check if the latest jruby resolv.rb fixes this problem. Thanks for your patience so far. |
The resolution does not work not sure why. Is there a way to display jruby compilation/interpreter error? |
I have yet to be able to reproduce that problem locally. I tried on:
My test testup is this:
$ cat apache_log | bin/logstash -f dns.conf What I see is that there are a lot of errors like
But I am assuming these are somewhat normal but more importantly at no time do I see the pipeline stop. @nithril @jake2184 it would be great if you could run the same test with the same data and report what you see here. On my side I will try with LS 5 to see if the same timeouts actually occur. |
I added "5.6.3 with 3.0.6" above, which shows a similar behaviour. |
The pipeline doesn't stop completely, it's still running and logging failures. The issue is that due to the endless timeouts, it cannot keep up with the incoming data stream. In my case this is only around 20-30 events a second. The queue (I have persistent queue configured for logstash) then grows until I restart logstash. It's worth noting that I don't believe this is caused by an actual DNS timeout, as hinted at by @nithril . We run a high availability internal DNS system, and are only resolving internal IP addresses. If we had anywhere near as many 'timeouts' as Logstash is recording we would be having many other problems all over our system. |
@jake2184 thanks for the clarifications - I was in fact looking for a pipeline stall. To help correctly diagnose and/or test for potential solutions I would like to find a combination that does not show the timeout behaviour using my minimal setup + data as shared above. Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help) |
I am also getting 'DNS: timeout on resolving address' when DNS returns NXDOMAIN. Does this mean it isn't accepting NXDOMAIN as a valid response, so retries? Apologies for not being too helpful with the debugging, I've had a look at the Ruby source for the filter & resolv library, but I don't know Ruby so can't understand if there are any problems. |
@jake2184 I am trying locally and NSDOMAIN is handled correcty; it does not result in a timeout. You really don't need to know Ruby to help here; the important thing for me at this point is to be able to reproduce the problem and reproduce when the problem does no occur. For that I'll repeat my question: Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help) |
I've taken a bit of time to set up a test environment - sorry, I should have done this earlier when you're so willing to help. I am using my own config file and data, editing the input/output to be stdin, and removing the DNS filter cache parameters. I am now able to reliably cause the problem. As indicated in the above bug report, I am almost certain it is that something is not being freed in resolv.rb. I am able to process just under 65536 events, before the the pipeline doesn't process any more and prints error messages. The exact number (of dots from the output codec, I assume accurate) is around 64100 - 64400, so roughly 65536 - 1024? I have tested the following versions: Manual fixes I have attempted to LS 6.2.3 and DNS 3.0.9 that have not worked:
|
Have had a quick look at the Apache log you are testing. It has <40,000 unique IP addresses, so it's possible you're not doing enough DNS lookups to trigger the error, if you have any form of caching enabled? |
@jake2184 thanks a lot for your followup! This is super useful.
Oh, this is interesting. ok, I will build a > 64k unique IP addresses file and use that and report back my finding.
Not at the filter level - but I locally installed dnsmasq using default configuration which I assume does caching but I guess that should not be relevant since the dns filter still has to call the resolver which will hit dnsmasq. |
This is really mysterious. I built a file with ~3M unique IPs, one per line and ran this test:
and at the same time measured the pipeline throughput using the
All my tests processed a few 100ks of events to make sure I am well over the 64k mark. Note that this config above does not use any caching. This suggest that sometime else is really going on. Is it my usage of It could also be caching related?? I will try another testing round with extreme caching enabled to make sure those caches grow huge. something like:
|
damn - I just realized that my IP dedupping was wrong and my IPs are not unique. fixing that and re-running tests. |
My test file now contains 68648 unique IPs. My test config is :
So basically this is without any caching.
Overall the EPS differences here is not really important in the sense that overall the same behaviour was witnessed in all these scenarios: the pipeline was steadily processing with many periods of 0 EPS because of the DNS timeouts but the pipeline never completely stalled. So my conclusion is similar to the one above: I can't seem to reproduce the stalling as something happening after some time. I can't reproduce a scenario that works fast for a while. Is that normal because many of my IPs are in fact "normally" timing out? should I try with only IPs known to be valid? I will check that. |
I ran my same tests again (LS 6.2.3 + DNS 3.0.9), and measured using the script above. The output was: As you can see, I can rapidly do many DNS lookups for the first 60K ish events, and then there is a drastic drop-off. It looks like it does still process events in intermittent bursts, perhaps as whatever resource is tied-up is freed? However it can process 0 events for 30 seconds.. Version Info: I don't think it's an individual system config problem, as I see the issue on our production system and the test one I am running. I get the feeling it is going to be tricky for you to help much if you cannot generate the problem! Very grateful for any and all time spent assisting. |
I figured last night how to test that - stay tuned 🤔 |
BOOM 💥 I can reproduce. Here's what I did:
This setup allows me to be generate an exact number of always resolvable IP addresses. The result: after EXACTLY 65534 unique address resolutions, the resolver systematically times out. Now that I have a reproduction setup, let's work on the solution. I will now officially re-open this issue. |
Good news. I can reproduce this problem outside of logstash using my
It fails on JRuby versions As suggested by @jsvd above the fix IS indeed this patch as seen in the jruby diff on
The correct fix will be to upgrade to a JRuby version >= 9.1.16.0 but until we can do that I will probably be looking at creating a monkey patch on |
All this effort just because we patch the wrong file. Thank you @colinsurprenant 👍 |
submitted #45 and also elastic/logstash/issues/9750 |
@nithril yeah really - I wish I had clued in earlier. At least now we have a solid setup to reproduce and stress test any scenarios with this filter and that also gave me an idea on how to make the input generator plugin more flexible. |
Brilliant! Thanks for all your help + time. When can we expect the pull request to be merged and released (for the plugin)? |
@jake2184 the PR is being reviewed, progress is visible there |
Thank you very much!
I will be able to test in 2-4 weeks :/
…On Mon, Jun 18, 2018 at 7:16 PM Colin Surprenant ***@***.***> wrote:
logstash-filter-dns v3.0.11 is published. @jake2184
<https://github.com/jake2184> @nithril <https://github.com/nithril> would
love to know if that works for you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAjnU3g1DwQ3wW_mm5Kw7QbPUgQDTFwxks5t9-CHgaJpZM4Siara>
.
--
LilActu.fr Agrégateur de News
http://lilactu.fr
|
Can confirm this fixes my test, I can happily resolve 5K+ events a second :) Thanks very much |
Great stuff! Thanks for the update @jake2184 |
@colinsurprenant I'm seeing this behaviour on logstash-filter-dns v3.0.12 in Logstash 6.7.1. Here are my DNS filter configurations, there are a few more instances of the plugin running in other config files. 2-general/400000-ad-smb.conf: dns {
2-general/400000-ad-smb.conf- resolve => ["[@metadata][destination][address]"]
2-general/400000-ad-smb.conf- action => "replace"
2-general/400000-ad-smb.conf- failed_cache_size => 1000
2-general/400000-ad-smb.conf- failed_cache_ttl => 60
2-general/400000-ad-smb.conf- hit_cache_size => 10000
2-general/400000-ad-smb.conf- hit_cache_ttl => 300
2-general/400000-ad-smb.conf- }
--
2-general/400000-ad-smb.conf: dns {
2-general/400000-ad-smb.conf- reverse => ["[@metadata][ip]"]
2-general/400000-ad-smb.conf- action => "replace"
2-general/400000-ad-smb.conf- failed_cache_size => 1000
2-general/400000-ad-smb.conf- failed_cache_ttl => 60
2-general/400000-ad-smb.conf- hit_cache_size => 10000
2-general/400000-ad-smb.conf- hit_cache_ttl => 300
2-general/400000-ad-smb.conf- } I'm wondering if the multiple instances are somehow conflicting. After a certain time, almost all the lookups timeout slowing down the pipelines significantly. I'm wondering if either the hit cache is getting so large that the timeout of 500ms (default) is being hit before actually trying to hit our resolving name servers. Anyway, I'm going to try down-grading to logstash-filter-dns v3.0.11 to see if I can replicate it there. Another option would be to avoid using the caches altogether. |
@berglh thanks for the heads-up. In fact we've seen other reports of this which seems like a regression in 6.7+. Will report back when I find more. |
@colinsurprenant I've tested some more this morning and I tried the following:
The behaviour continued in the first two on LS 6.7.1. Some other notes about my environment:
Do you want me to open an issue or comment an an open one? |
@berglh Thanks for the followup. I believe I found the source of the problem. I will need to validate tomorrow but there seems to be a leak (a request id cache fills up which lead to an infinite loop that is aborted by the timeout) in the new version of the resolver code that was updated starting with 6.7. For now, as you noticed, the only workaround is to downgrade to 6.6 (6.6.2 as of now) and just downgrading the dns filter will not help. |
@berglh Yeah, we should open a new issue and followup there... |
After updating to 3.0.7, the plugin starts spamming
DNS: timeout on resolving address. {:field=>"host", :value=>"172.20.54.3"
errors in the logs after some time.If I do a manual lookup on the indexer, the IP resolves fine. So it seems like some kind of cache or memory buffer is breaking lookups permanently. It seems after 1 timeout, no new DNS queries are even made (I checked with tcpdump).
If I restart the indexer, the plugin works fine for a while until it breaks again.
Config:
The text was updated successfully, but these errors were encountered: