Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

Open
rawahars opened this issue Aug 10, 2022 · 7 comments
Labels
exempt-stale help wanted long-term Long term issues (exempted by stale bots) Windows Bugs and requests about Windows platforms

Comments

@rawahars
Copy link
Contributor

rawahars commented Aug 10, 2022

Bug Report

Describe the bug

In fluent-bit, there can be plugins which are using async mode for performance improvement. This is the default setting and would be used by a lot of plugins.

Consider a scenario wherein the host has multiple DNS servers [x.x.x.x, y.y.y.y] such that the the first/primary DNS server (x.x.x.x) does not resolve the endpoint but the secondary DNS Server (y.y.y.y) resolves it correctly. In such cases, the plugins using async mode try the DNS resolution with the primary DNS only and fail without ever trying resolution with secondary DNS.

The error for the same is-

[ warn] [net] getaddrinfo(host='www.google.com', err=12): Timeout while contacting DNS servers

This is in contrast to how DNS resolution should happen. The expected behaviour is for resolution to be tried using all the servers in DNS Server list before conceding error.

Note: This scenario works perfectly for plugins wherein async mode is disabled.

To Reproduce

  • Example log message if applicable:
[ warn] [net] getaddrinfo(host='www.google.com', err=12): Timeout while contacting DNS servers
  • Steps to reproduce the problem:

    The issue can be replicated by following the listed steps-

    • Create a new Linux VM or just create a new container using cr.fluentbit.io/fluent/fluent-bit:1.9.6-debug

      docker run -it cr.fluentbit.io/fluent/fluent-bit:1.9.4-debug bash
      
    • Inside the container or VM, change the DNS setting to include an invalid DNS Server as the primary DNS server.

      vi /etc/resolv.conf
      # Change the file to include a new invalid nameserver. For example-
      # search us-west-2.compute.internal
      # nameserver 10.0.0.6 --> Invalid DNS Server
      # nameserver 10.0.0.2 --> Valid DNS Server
      
    • Start fluent-bit with http output plugin to send logs to a remote server.

      export FLB_LOG_LEVEL=debug
      ./fluent-bit -i dummy -o http://www.google.com:443 -p tls=on -p tls.verify=off
      

      Using http://www.google.com:443 is the easiest repro of the issue as fluent-bit is unable to perform DNS resolution before any other thing happens. The same issue is applicable when using Kinesis Streams and Kinesis FIrehose plugins as well.

      ./fluent-bit -i dummy \
      -o firehose \
      -p "region=us-west-2" \
      -p "delivery_stream=example-stream"
      

      We could not use HTTP benchmarking server on localhost as we need to use DNS resolution for the same. The same can be set on another machine and used here to replicate the issue.

Expected behavior

DNS resolution should happen with the secondary DNS Server before erroring out.

For the first example (www.google.com), the logs cannot be sent since it is not a valid destination and therefore, we will get an error 405 from the Google server. This essentially means that the DNS resolution worked fine.

Screenshots

Your Environment

  • Version used: 1.9.6
  • Configuration: Dummy input plugin and any output plugin using http endpoint and async mode
  • Environment name and version (e.g. Kubernetes? What version?): Docker
  • Server type and version:
  • Operating System and version: Linux and Windows OS
  • Filters and plugins: Dummy input plugin and any output plugin using http endpoint and async mode

Additional context

@rawahars
Copy link
Contributor Author

Based on my debugging, I concluded that during the failure mode i.e. using async mode, we are following the codepath here. This calls flb_net_getaddrinfo which in turn calls c-ares package for DNS resolution.

In the happy path when using plugins with async mode disabled, the codepath followed is this one. This calls getaddrinfo API for the DNS resolution.

I also tried to compile the plugins by setting async mode disabled using upstream->flags &= ~(FLB_IO_ASYNC);. The plugins then work properly with DNS resolution working via secondary DNS server.

@leonardo-albertovich
Copy link
Collaborator

Yes, due to some constraints in the original design the async dns client aborts after the first resolution error and a refactor is on the way but we don't have an ETA yet. If you are interested in contributing let me know, I wrote the code so I can probably help you if you have any questions.

@PettitWesley
Copy link
Contributor

@leonardo-albertovich How come the setting I see here doesn't fix it: https://github.com/fluent/fluent-bit/blob/1.9/src/flb_upstream.c#L43

Does that not do anything?

@leonardo-albertovich
Copy link
Collaborator

What that setting does is select between using c-ares which is asynchronous and the default system resolver which is synchronous. If having those DNS queries block is something you can accept then it should be fine and if you need to minimize the overhead you can use a non authoritative local caching DNS server like most modern distributions do (which in the end would be the same for both async and sync since the real query wouldn't be performed by fluent-bit).

@PettitWesley
Copy link
Contributor

So we determined that setting this is a valid workaround for the problem:

net.dns.mode LEGACY

For Windows container users, all outputs will need this option set.

I am wondering if we could consider contributing a new environment variable that is like a global setting for net.dns.mode or a Service level setting. Something to provide a better user experience so that the config is only set once.

We could hide this new setting or env var behind a new CMake flag which would default to Off. So for example, I can enable it in the AWS distro, but if other community does not want to use the setting, by default it won't be built.

@leonardo-albertovich
Copy link
Collaborator

net.dns.mode can be set in the [SERVICE] section and overridden on a per plugin basis if desired, all of the DNS settings support that.

jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022
jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022
jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022
@github-actions github-actions bot added the Stale label Nov 23, 2022
@fluent fluent deleted a comment from github-actions bot Nov 28, 2022
@PettitWesley PettitWesley added help wanted Windows Bugs and requests about Windows platforms long-term Long term issues (exempted by stale bots) exempt-stale and removed Stale status: waiting-for-triage labels Nov 28, 2022
@elafontaine
Copy link

Why is this flagged as windows ? I had the same problem on flatcar, so I'm assuming that this is valid from any OS, right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exempt-stale help wanted long-term Long term issues (exempted by stale bots) Windows Bugs and requests about Windows platforms
Projects
None yet
Development

No branches or pull requests

4 participants