Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core/dns): new dns client library #12305

Merged
merged 126 commits into from
Jul 16, 2024
Merged

feat(core/dns): new dns client library #12305

merged 126 commits into from
Jul 16, 2024

Conversation

chobits
Copy link
Contributor

@chobits chobits commented Jan 8, 2024

Summary

This feature will only be considered for default activation in version 4.0 at the earliest. For now, it is just an alternative to the DNS library, so we are focusing on making it complete and simple. Compatibility with the old DNS library is not being considered at this time.

  • Introduced global caching for DNS records across workers, significantly reducing the query load on DNS servers.
  • Introduced observable statistics for the new DNS client, and a new Admin API /status/dns to retrieve them.
  • Deprecated the dns_no_sync option. Multiple DNS queries for the same name will always be synchronized (even across workers). This remains functional with the legacy DNS client library.
  • Deprecated the dns_not_found_ttl option. It uses the dns_error_ttl option for all error responses. This option remains functional with the legacy DNS client library.
  • Deprecated the dns_order option. By default, SRV, A, and AAAA are supported. Only names in the SRV format (_service._proto.name) enable resolving of DNS SRV records.

Checklist

Issue reference

Fix KAG-3220

For reviewers

  1. README.md (include APIs and logic of _M:resolve())
  2. current error report from the upper caller balancer subsystem:
2024/02/28 16:35:23 [error] 77335#0: *1336 [kong] init.lua:405 [aws-lambda] Lambda:invoke() failed to connect to 'https://lambda.ab-cdef-1.amazonaws.com:10001': 
[cosocket] DNS resolution failed: dns server error: 3 name error. 
Tried: [["lambda.ab-cdef-1.amazonaws.com:A","dns server error: 3 name error"]], 
client: 127.0.0.1, server: kong, request: "GET /get?key1=some_value1 HTTP/1.1", host: "lambda15.test", request_id: "3333b77722d0bc23a44fad31953ccf0b"

if query DNS server failed, also added "Query Time: ms" into error log, like

DNS server error: failed to receive reply from UDP server 198.51.100.0:53: timeout, Query Time: 136ms
  1. Design doc: link

@thibaultcha
Copy link
Member

thibaultcha commented Jan 26, 2024

Overall I cannot stress hard enough that mlcache users in an application as complex as Kong should only use :get(), which is an extremely, extremely powerful, clean, and performant interface that should (if properly used) cover the great majority of caching use-cases and guarantee fantastic performance, flexibility, and stability. Most of the mlcache problems that I have seen in the Gateway (in the past or recently) and other large applications that use it at other companies are actually due to improper usage of the library. Only using :get() (or :get_bulk()) should be enough, and should guarantee very few errors, and if there are errors they can easily be mitigated with fallbacks and/or retries in the code itself. I strongly feel like this should be heavily taught to mlcache users especially Kong Gateway developers who often misunderstand how mlcache works and how it should be used.

I am afraid that by misusing mlcache, developers will slowly lose trust in the library (at their own fault) and thus try to rewrite a solution that will eventually not be as stable and powerful as mlcache really is if mlcache is used properly. This is a dangerous pattern that always had me concerned and this concern keeps increasing as I keep seeing mlcache being misused and incorrectly blamed. Perhaps the documentation is not clear enough that the write operations (set(), delete(), update()) should only be used when users really know what they are doing. An mlcache user who truly needs these write functions and who really knows what they are doing would not use these functions without also implementing opts.ipc and using a non-message-based IPC mechanism, which is the very minimum for using these write functions in a production OpenResty application. But again, with most use-cases only get() or :get_bulk() should be necessary, and they are written to be extremely efficient, 100% JIT-compatible, very flexible, and most of all extremely resilient to spurious errors. Only using :get() avoids the need for any IPC, which is the prominent source of errors encountered by users and developers, and the original primary goal behind mlcache development. Since the errors originate from the "mlcache" namespace, users tend to blame mlcache when in fact the library is actually improperly used and mlcache is not at fault. The logic of mlcache is extremely sound and has no known-bug (besides minor ones such as the l1_serializer recently reported), most of these errors are due to improperly configured timeouts, or improper use and/or over-confidence in the default bundled IPC mechanism (because it is messaged-based and not suitable to production applications).

If I were still working in the Gateway, I would absolutely forbid the use of mlcache write functions, perhaps even going as far as removing them from the module when it is being loaded. Either that, or I would make sure to provide an extremely resilient IPC mechanism that can be attached to all mlcache instances instead of the default messaged-based shm mechanism, as documented in the mlcache docs.

@Tieske
Copy link
Member

Tieske commented Jul 11, 2024

  • Introduced global caching for DNS records across workers, significantly reducing the query load on DNS servers.

This can be switched off, by reverting to the old client right?

  • Introduced observable statistics for the new DNS client, and a new Admin API /status/dns to retrieve them.

Nice

  • Deprecated the dns_no_sync option. Multiple DNS queries for the same name will always be synchronized (even across workers). This remains functional with the legacy DNS client library.

This was added as an explicit request from @subnetmarco in a very early version. But I haven't seen it used. Other than during some debugging sessions. So to me this seems ok to remove.

  • Deprecated the dns_not_found_ttl option. It uses the dns_error_ttl option for all error responses. This option remains functional with the legacy DNS client library.

I mentioned it before on confluence somewhere, but this is a bad idea imho. dns_error_ttl (when the server responds with some internal error) should be short (1-2 secs), whereas dns_not_found_ttl should be long (at least 30-60 seconds) because it requires user interaction to resolve.

This is mainly because the former is to be looked at by Kong admins, but the latter will mostly be generated by user errors. Scenario:

  • error ttl is 1 second
  • a user in a big enterprise somewhere pushes a bad hostname through a pipeline and it gets deployed
  • all Kong nodes in the cluster now start to throw errors in the logs every 1 second
  • nothing the Kong admin can do about it, he/she most likely doesn't even know who owns the bad config anyway.
  • Deprecated the dns_order option. By default, SRV, A, and AAAA are supported. Only names in the SRV format (_service._proto.name) enable resolving of DNS SRV records.

This seems like a valid change, but also a breaking change, yet I have no idea how much impact it is going to have. So also here, revert to the old client if it breaks right?

@chobits
Copy link
Contributor Author

chobits commented Jul 11, 2024

Hi @Tieske

...
This seems like a valid change, but also a breaking change, yet I have no idea how much impact it is going to have. So also here, revert to the old client if it breaks right?

In terms of compatibility, following discussions with Datong and Saju, this one is aimed at being simple and usable, not needing to be fully compatible with the original one. It will only be switched to when users encounter issues with the original DNS client now, and it won't be enabled by default until at least version 4.0.

To achieve the goal, it directly mirrors the logic of the nginx/openresty/Golang DNS resolver, only keeping a few necessary features, such as stale ttl, error ttl and resolv.conf/hosts parsing.

When issues arise, we can easily explain to customers that the implementation of other industry software is the same way. For known or potential issues, we don't plan on making further fixes. Instead, we'll let users who need those features/bugfixes switch directly to the new one.

Copy link
Member

@dndx dndx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the implementation is clear, logical and easy to follow. Good job @chobits .

There are some questions and suggestions however, but nothing too concerning from my eyes.

kong/dns/README.md Outdated Show resolved Hide resolved
kong/dns/README.md Outdated Show resolved Hide resolved
kong/dns/README.md Show resolved Hide resolved
kong/dns/client.lua Show resolved Hide resolved
kong/dns/client.lua Outdated Show resolved Hide resolved
kong/dns/utils.lua Show resolved Hide resolved
kong/dns/utils.lua Outdated Show resolved Hide resolved
kong/dns/utils.lua Show resolved Hide resolved
kong/dns/utils.lua Show resolved Hide resolved
kong/templates/nginx_kong.lua Outdated Show resolved Hide resolved
* Record: ~80 bytes json string, e.g., `{address = "127.0.0.1", name = <domain>, ttl = 3600, class = 1, type = 1}`.
* Domain: ~36 bytes string, e.g., `example<n>.long.long.long.long.test`. Domain names with lengths between 10 and 36 bytes yield similar results.

The results of ) are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is )? A mistake?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to fix it in #13389

@dndx dndx merged commit fd10d6e into master Jul 16, 2024
27 checks passed
@dndx dndx deleted the refactor/dns_client branch July 16, 2024 21:24
@team-gateway-bot
Copy link
Collaborator

Cherry-pick failed for master, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git remote add upstream https://github.com/kong/kong-ee
git fetch upstream master
git worktree add -d .worktree/cherry-pick-12305-to-master-to-upstream upstream/master
cd .worktree/cherry-pick-12305-to-master-to-upstream
git checkout -b cherry-pick-12305-to-master-to-upstream
ancref=$(git merge-base bc27ffd414fae5c18a84dbf3c53ee4bba2b4cc8a 1f0bc17dc388a8285cb6e4993e2d6ce0533dfe6f)
git cherry-pick -x $ancref..1f0bc17dc388a8285cb6e4993e2d6ce0533dfe6f

@github-actions github-actions bot added the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Jul 16, 2024
@AndyZhang0707 AndyZhang0707 removed the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Jul 23, 2024
ADD-SP pushed a commit that referenced this pull request Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.