-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: dns resolution did not resume immediately after the dns server resume #10093
Comments
two upstream:
the first upstream did not resume immediately Lines 422 to 445 in 7770483
and it seems the Lines 210 to 239 in 7770483
the comment (from ce4d8fb)
while the apisix/apisix/utils/upstream.lua Lines 90 to 112 in 7770483
|
Hi, thank you for your report. Could you give us a more detailed description of your configuration?
|
I have reproduce @jiangfucheng just mock dns fail in lua (not 100%, shoud retry for few times) apisix config:
register route:
local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
if host == "httpbin.org" and ngx_now() <= 1693542089 then
if random(1,10) % 2 == 0 then
core.log.error("101 empty records")
return new_nodes
end
-- return nil, "failed to query the DNS server: timeout11"
end
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
do requests
the logwhen 503, there got no
the diff of ![]() |
@wklken Hi, could you paste the APISIX configurations? It's hard to reproduce this bug if no detailed configurations. |
@jiangfucheng I still trying to reproduce it in apisix docker-compose, not success yet. I will update the issue when success. (it only can be reproduced on our image build from |
@jiangfucheng Finally, It been reproduced on the apisix docker-compose. It take a lot of time , add I add some scripts to help. here is the steps, can you please help to investigate this? (this is the problem which stuck our version for production) Reproduce steps:1. use the docker-compose
apisix:
node_listen: 9080 # APISIX listening port
enable_ipv6: false
router:
http: radixtree_uri_with_parameter
nginx_config:
worker_processes: 4
services:
apisix:
image: "apache/apisix:3.2.1-centos"
2. enter the container, do some change to mock the dns failure
-- to line 22
local random = math.random
-- add before core.resolver.parse_domain(host)
local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
if host == "httpbin.org" and ngx_now() <= 1695089345 then
if random(1,10) % 2 == 0 then
core.log.error("101 empty records")
return new_nodes
end
end
local ip, err = core.resolver.parse_domain(host)
#!/bin/bash
API_KEY="edd1c9f034335f136f87ad84b625c8f1"
ROUTE_ID="dns_route"
SERVVICE_ID="dns_service"
curl http://127.0.0.1:9180/apisix/admin/services/${SERVVICE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
"upstream": {
"nodes": [
{
"host": "httpbin.org",
"port": 80,
"weight": 100,
"priority": 1
}
],
"type": "roundrobin",
"scheme": "http",
"pass_host": "node"
}
}'
curl http://127.0.0.1:9180/apisix/admin/routes/${ROUTE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
"uri": "/api/test/prod/dns22",
"methods": [
"GET"
],
"plugins": {
"proxy-rewrite": {
"method": "GET",
"uri": "/get"
}
},
"upstream": {
"nodes": [
{
"host": "httpbin.org",
"port": 80,
"weight": 100,
"priority": 1
}
],
"type": "roundrobin",
"scheme": "http",
"pass_host": "node"
},
"service_id": "dns_service",
"status": 1
}'
#!/bin/bash
now=$(date "+%s")
echo "now is: ${now}"
A=$(date -d "+30 seconds" "+%s")
echo "will change the condition to <= ${A}"
sed -i -r "s/<= ([0-9]+) then/<= ${A} then/g" apisix/utils/upstream.lua
echo $?
echo "change done"
apisix reload
bash -x create.sh
3. not in the container, add check script
#!/bin/bash
date
url="http://0.0.0.0:9080/api/test/prod/dns22"
echo "start bench wrk"
wrk -c2 -t2 -d35s ${url}
date
echo "sleep 5 s"
sleep 5
echo "check the status code 10 times"
for ((i=1; i<=10; i++))
do
status_code=$(curl --write-out %{http_code} --silent --output /dev/null $url)
echo "status=$status_code"
if [ "${status_code}" -eq "503" ]
then
echo "503 show"
exit
fi
done 4. reproduce
|
More clues for detecting:
|
I add some log, the image is more clear, it will never recover from 503 until ![]() at apisix/init.lua ![]() and, it would be assign to empty nodes here ![]() apisix/apisix/utils/upstream.lua Lines 64 to 87 in 9b2031a
@jiangfucheng @Revolyssup Could someone please verify this? |
@wklken Thanks for your detailed description. Based on your description, I think there are two issues that need to be answered.
apisix/apisix/utils/upstream.lua Lines 64 to 86 in 9b2031a
I think it was caused by For the above two issues, I think we just fix the return value in For other issues, I will continue to try to debug the code to give my opinion. Another, I have a question, in the above description, you commented (I have been quite busy with work recently, and I will help you troubleshoot this issue as soon as possible in my spare time) |
@jiangfucheng , in the log you can see that @wklken has added I think some error logs from his side @wklken Can you show the snippet of |
@jiangfucheng The only caching culprit I see currently is this
But modification in _M.user_routes is only done directly by etcd. |
@wklken Just to confirm, it only happens when the route has service id, right? |
@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log. |
Yes, route has upstream and has service_id (the service have upstream too) |
You are right, it's my mistake, |
Agree with this opinion. |
@wklken In the first request, did you get a log similar to |
My log level is |
@wklken Before empty nodes came, did you find this? What was the last time this log came? |
@Revolyssup 1 second before the empty nodes |
cc: @kingluo |
@jiangfucheng we need your help to confirm if that fix would be right, thanks.(for our production, we need to reduce the risk) According to your answer above, can we just add a check before
|
@Revolyssup the issue should not be closed before the official repo fix the bug? and we still don't know the reason. our own repo need to fix the production so we patched it. It will resolve the |
@wklken Yep I understand. The patch still doesn't fix the original issue, its kind of a hack |
@sheharyaar ok, I will recheck it later |
@sheharyaar Unfortunately, the bug is not fixed by #10722 Follow the reproduce steps here: #10093 (comment)
|
Ok, I will have a look |
@wklken I have retried the reproduction steps several times but still cannot reproduce the issue. I do not get 503 after the normal operation is resumed. |
Today I tried to reproduce this issue from all the ways possible along with the exact steps given in the issue . But this problem doesn't come and from the code logic also it doesnt look like the error will come. parse_domain_for_nodes function will never get empty nodes list which seem to be causing the problem. route.value.upstream.nodes is never directly updated so parse_domain_for_nodes will always get the httpbin.org in nodes, and as soon as the dns server resumes, the node value is retrieved and the route.dns_value is updated. |
@wklken I have followed your exact reproduction steps multiple times but I never got the error. |
@Revolyssup did you follow those steps here? #10093 (comment) and the apisix version should be 3.2.1; we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works. |
Yes, I had followed the steps. |
@wklken Just want to confirm the behaviour after this patch. Basically you have covered the case that nodes will never be set as empty. |
It’s always showing a 503 error even if the DNS is okay. |
@wklken I meant to ask the behavior in your patch. After your patch, you do not get the 503 after dns server resumes. Right? |
@Revolyssup Yes, with the patch, no 503 shows |
I haven't been able to reproduce but by the error and how your patch fixes the error and the logs above that you showed, here is the conclusion so far: Somehow the nodes in the original saved configuration are set to empty. Since this goes away after a reload means the etcd still has correct value. Since I cant reproduce, I am going through the code to find the root cause currently. |
cache issue is unlikely because api_ctx is created newly for each new request. So for next request, new api_ctx will be created and the matched route should contain the correct routes configuration with the nodes field. |
Since after trying multiple times, this cannot be reproduced. I am closing the issue. If anyone else is able to reproduce this bug again then please reopen with the exact steps you folllowed. |
Current Behavior
In our production, when the dns server reload frequently.
The requests will be all fail, while the dns resolution not ok
failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received
;But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error
[lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node
for hours, until we doapisix reload
, the dns turns ok.And from the log, one apisix instance with 4 processes, 3 processes got the dns failed, 1 processes got no error all the requests are ok.
apisix/core/dns/client.lua
how the
client.resolve
detect the dns server is ok?Expected Behavior
dns resolution resume immediately
Error Logs
when the dns server is reloading
when the dns server is stable (no
dns
error logs, why?)so, it means when the dns server is stable, the
parse_domain
success, no error logs, but thehandle_upstream
still getno valid upstream node
and after
apisix reload
, no error logsset_upstream
fail?Steps to Reproduce
For now, I cannot reproduce it, but it has occurred three times in our production.
Any clues or advice for making dns server reload frequently?
Environment
apisix version
): 3.2.1uname -a
):openresty -V
ornginx -V
): nginx version: openresty/1.21.4.1curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):The text was updated successfully, but these errors were encountered: