dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

rawahars · 2022-08-10T08:43:11Z

Bug Report

Describe the bug

In fluent-bit, there can be plugins which are using async mode for performance improvement. This is the default setting and would be used by a lot of plugins.

Consider a scenario wherein the host has multiple DNS servers [x.x.x.x, y.y.y.y] such that the the first/primary DNS server (x.x.x.x) does not resolve the endpoint but the secondary DNS Server (y.y.y.y) resolves it correctly. In such cases, the plugins using async mode try the DNS resolution with the primary DNS only and fail without ever trying resolution with secondary DNS.

The error for the same is-

[ warn] [net] getaddrinfo(host='www.google.com', err=12): Timeout while contacting DNS servers

This is in contrast to how DNS resolution should happen. The expected behaviour is for resolution to be tried using all the servers in DNS Server list before conceding error.

Note: This scenario works perfectly for plugins wherein async mode is disabled.

To Reproduce

Example log message if applicable:

[ warn] [net] getaddrinfo(host='www.google.com', err=12): Timeout while contacting DNS servers

Steps to reproduce the problem:

The issue can be replicated by following the listed steps-
- Create a new Linux VM or just create a new container using cr.fluentbit.io/fluent/fluent-bit:1.9.6-debug
```
docker run -it cr.fluentbit.io/fluent/fluent-bit:1.9.4-debug bash
```
- Inside the container or VM, change the DNS setting to include an invalid DNS Server as the primary DNS server.
```
vi /etc/resolv.conf
# Change the file to include a new invalid nameserver. For example-
# search us-west-2.compute.internal
# nameserver 10.0.0.6 --> Invalid DNS Server
# nameserver 10.0.0.2 --> Valid DNS Server
```
- Start fluent-bit with http output plugin to send logs to a remote server.
```
export FLB_LOG_LEVEL=debug
./fluent-bit -i dummy -o http://www.google.com:443 -p tls=on -p tls.verify=off
```
  Using http://www.google.com:443 is the easiest repro of the issue as fluent-bit is unable to perform DNS resolution before any other thing happens. The same issue is applicable when using Kinesis Streams and Kinesis FIrehose plugins as well.
```
./fluent-bit -i dummy \
-o firehose \
-p "region=us-west-2" \
-p "delivery_stream=example-stream"
```
  We could not use HTTP benchmarking server on localhost as we need to use DNS resolution for the same. The same can be set on another machine and used here to replicate the issue.

Expected behavior

DNS resolution should happen with the secondary DNS Server before erroring out.

For the first example (www.google.com), the logs cannot be sent since it is not a valid destination and therefore, we will get an error 405 from the Google server. This essentially means that the DNS resolution worked fine.

Screenshots

Your Environment

Version used: 1.9.6
Configuration: Dummy input plugin and any output plugin using http endpoint and async mode
Environment name and version (e.g. Kubernetes? What version?): Docker
Server type and version:
Operating System and version: Linux and Windows OS
Filters and plugins: Dummy input plugin and any output plugin using http endpoint and async mode

Additional context

The text was updated successfully, but these errors were encountered:

rawahars · 2022-08-10T08:48:51Z

Based on my debugging, I concluded that during the failure mode i.e. using async mode, we are following the codepath here. This calls flb_net_getaddrinfo which in turn calls c-ares package for DNS resolution.

In the happy path when using plugins with async mode disabled, the codepath followed is this one. This calls getaddrinfo API for the DNS resolution.

I also tried to compile the plugins by setting async mode disabled using upstream->flags &= ~(FLB_IO_ASYNC);. The plugins then work properly with DNS resolution working via secondary DNS server.

leonardo-albertovich · 2022-08-10T11:48:35Z

Yes, due to some constraints in the original design the async dns client aborts after the first resolution error and a refactor is on the way but we don't have an ETA yet. If you are interested in contributing let me know, I wrote the code so I can probably help you if you have any questions.

PettitWesley · 2022-08-12T19:20:44Z

@leonardo-albertovich How come the setting I see here doesn't fix it: https://github.com/fluent/fluent-bit/blob/1.9/src/flb_upstream.c#L43

Does that not do anything?

leonardo-albertovich · 2022-08-12T20:15:33Z

What that setting does is select between using c-ares which is asynchronous and the default system resolver which is synchronous. If having those DNS queries block is something you can accept then it should be fine and if you need to minimize the overhead you can use a non authoritative local caching DNS server like most modern distributions do (which in the end would be the same for both async and sync since the real query wouldn't be performed by fluent-bit).

PettitWesley · 2022-08-23T21:32:51Z

So we determined that setting this is a valid workaround for the problem:

net.dns.mode LEGACY

For Windows container users, all outputs will need this option set.

I am wondering if we could consider contributing a new environment variable that is like a global setting for net.dns.mode or a Service level setting. Something to provide a better user experience so that the config is only set once.

We could hide this new setting or env var behind a new CMake flag which would default to Off. So for example, I can enable it in the AWS distro, but if other community does not want to use the setting, by default it won't be built.

leonardo-albertovich · 2022-08-24T07:09:17Z

net.dns.mode can be set in the [SERVICE] section and overridden on a per plugin basis if desired, all of the DNS settings support that.

Workaround for fluent/fluent-bit#5862

elafontaine · 2024-01-05T14:32:15Z

Why is this flagged as windows ? I had the same problem on flatcar, so I'm assuming that this is valid from any OS, right ?

rawahars added the status: waiting-for-triage label Aug 10, 2022

rawahars mentioned this issue Aug 15, 2022

Added support for running integration tests on Windows. aws/aws-for-fluent-bit#411

Merged

jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022

fix(logs): fallback to LEGACY dns resolver

e79bbb9

Workaround for fluent/fluent-bit#5862

jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022

fix(logs): fallback to legacy DNS resolver

29ebcdd

Workaround for fluent/fluent-bit#5862

jta mentioned this issue Sep 16, 2022

fix(logs): fallback to legacy DNS resolver observeinc/manifests#68

Merged

jta added a commit to observeinc/manifests that referenced this issue Sep 16, 2022

fix(logs): fallback to legacy DNS resolver (#68)

9231edb

Workaround for fluent/fluent-bit#5862

github-actions bot added the Stale label Nov 23, 2022

fluent deleted a comment from github-actions bot Nov 28, 2022

PettitWesley added help wanted Windows Bugs and requests about Windows platforms long-term Long term issues (exempted by stale bots) exempt-stale and removed Stale status: waiting-for-triage labels Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

rawahars commented Aug 10, 2022 •

edited

Loading

rawahars commented Aug 10, 2022

leonardo-albertovich commented Aug 10, 2022

PettitWesley commented Aug 12, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 23, 2022

leonardo-albertovich commented Aug 24, 2022

elafontaine commented Jan 5, 2024

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862

Comments

rawahars commented Aug 10, 2022 • edited Loading

Bug Report

rawahars commented Aug 10, 2022

leonardo-albertovich commented Aug 10, 2022

PettitWesley commented Aug 12, 2022

leonardo-albertovich commented Aug 12, 2022

PettitWesley commented Aug 23, 2022

leonardo-albertovich commented Aug 24, 2022

elafontaine commented Jan 5, 2024

rawahars commented Aug 10, 2022 •

edited

Loading