bug: dns resolution did not resume immediately after the dns server resume #10093

wklken · 2023-08-28T08:23:01Z

Current Behavior

In our production, when the dns server reload frequently.
The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

And from the log, one apisix instance with 4 processes, 3 processes got the dns failed, 1 processes got no error all the requests are ok.

apisix/core/dns/client.lua

function _M.resolve(self, domain, selector)
    local client = self.client

    local answers, err = client.resolve(domain)
    if not answers then
        return nil, "failed to query the DNS server: " .. err
    end

how the client.resolve detect the dns server is ok?

Expected Behavior

dns resolution resume immediately

Error Logs

when the dns server is reloading

[lua] upstream.lua:79: parse_domain_for_nodes(): dns resolver domain: test.com error: failed to query the DNS server: dns client error: 101 empty record received
[lua] resolver.lua:80: parse_domain(): failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received

when the dns server is stable (no dns error logs, why?)

[lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

so, it means when the dns server is stable, the parse_domain success, no error logs, but the handle_upstream still get no valid upstream node

and after apisix reload, no error logs

dns client parse success, but return empty new_nodes? (cache?)
dns client parse success and return new_nodes, but set_upstream fail?

Steps to Reproduce

For now, I cannot reproduce it, but it has occurred three times in our production.
Any clues or advice for making dns server reload frequently?

Environment

APISIX version (run apisix version): 3.2.1
Operating system (run uname -a):
OpenResty / Nginx version (run openresty -V or nginx -V): nginx version: openresty/1.21.4.1
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

wklken · 2023-08-31T13:51:26Z

two upstream:

route.upstream
route.service_id -> service.upstream

the first upstream did not resume immediately

apisix/apisix/init.lua

Lines 422 to 445 in 7770483

    
           if up_id then 
        
               local upstream = apisix_upstream.get_by_id(up_id) 
        
               if not upstream then 
        
                   if is_http then 
        
                       return core.response.exit(502) 
        
                   end 
        
                   return ngx_exit(1) 
        
               end 
        
               api_ctx.matched_upstream = upstream 
        
           else 
        
               if route.has_domain then 
        
                   local err 
        
                   route, err = parse_domain_in_route(route) 
        
                   if err then 
        
                       core.log.error("failed to get resolved route: ", err) 
        
                       return core.response.exit(500) 
        
                   end 
        
                   api_ctx.conf_version = route.modifiedIndex 
        
                   api_ctx.matched_route = route 
        
               end

and it seems the parse_domain_in_route did not know the dns server is resumed

apisix/apisix/init.lua

Lines 210 to 239 in 7770483

    
           local function parse_domain_in_route(route) 
        
               local nodes = route.value.upstream.nodes 
        
               local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes) 
        
               if not new_nodes then 
        
                   return nil, err 
        
               end 
        
               local up_conf = route.dns_value and route.dns_value.upstream 
        
               local ok = upstream_util.compare_upstream_node(up_conf, new_nodes) 
        
               if ok then 
        
                   return route 
        
               end 
        
               -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result 
        
               -- has changed 
        
               local parent = route.value.upstream.parent 
        
               if parent then 
        
                   route.value.upstream.parent = nil 
        
               end 
        
               route.dns_value = core.table.deepcopy(route.value) 
        
               if parent then 
        
                   route.value.upstream.parent = parent 
        
                   route.dns_value.upstream.parent = parent 
        
               end 
        
               route.dns_value.upstream.nodes = new_nodes 
        
               core.log.info("parse route which contain domain: ", 
        
                             core.json.delay_encode(route, true)) 
        
               return route 
        
           end

the comment (from ce4d8fb)

    -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
    -- has changed

while the parse_domain_in_up modify the modifiedIndex

apisix/apisix/utils/upstream.lua

Lines 90 to 112 in 7770483

    
           function _M.parse_domain_in_up(up) 
        
               local nodes = up.value.nodes 
        
               local new_nodes, err = parse_domain_for_nodes(nodes) 
        
               if not new_nodes then 
        
                   return nil, err 
        
               end 
        
               local ok = compare_upstream_node(up.dns_value, new_nodes) 
        
               if ok then 
        
                   return up 
        
               end 
        
               if not up.orig_modifiedIndex then 
        
                   up.orig_modifiedIndex = up.modifiedIndex 
        
               end 
        
               up.modifiedIndex = up.orig_modifiedIndex .. "#" .. ngx_now() 
        
               up.dns_value = core.table.clone(up.value) 
        
               up.dns_value.nodes = new_nodes 
        
               core.log.info("resolve upstream which contain domain: ", 
        
                             core.json.delay_encode(up, true)) 
        
               return up 
        
           end

jiangfucheng · 2023-08-31T15:16:32Z

Hi, thank you for your report. Could you give us a more detailed description of your configuration?

What is the local DNS server you are running locally? And we need the related configuration.
Could you paste your related configuration of APISIX?

wklken · 2023-09-01T04:38:51Z

I have reproduce @jiangfucheng

just mock dns fail in lua (not 100%, shoud retry for few times)

apisix config:

start apisix with 4 processes
change then nginx error log level to info

register route:

add a route /get with one upstream, one node like httpbin.org; the parsed data in log is "nodes":[{"host":"54.175.87.239","priority":1,"domain":"httpbin.org","port":80,"weight":100,"upstream_host":"httpbin.org"}]
change apisix/utils/upstream.lua to mock dns random error, note the 1693542089 is a now() + 60

local function parse_domain_for_nodes(nodes) 
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then

            if host == "httpbin.org" and ngx_now() <= 1693542089 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
                -- return nil, "failed to query the DNS server: timeout11"
            end
            local ip, err = core.resolver.parse_domain(host)
            if ip then
                local new_node = core.table.clone(node)
                new_node.host = ip
                new_node.domain = host
                core.table.insert(new_nodes, new_node)
            end

apisix reload

do requests

wrk -c2 -t2 -d80s http://{}/get
and after the wrk finished, do curl http://{}/get for about 10 times, got 200 and 503

the log

when 503, there got no dns logs before no valid upstream node

2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332}
2023/09/01 04:24:21 [error] 140701#140701: *15534047 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

the diff of route conf between 503 and 200

wklken · 2023-09-01T07:24:56Z

route.value.upstream.nodes is empty, so the new_nodes is empty(no dns resolve), the compare_upstream_node result true and return

so the route.value.upsteram.nodes is being set to an empty value somewhere in the code.

jiangfucheng · 2023-09-07T14:37:31Z

@wklken Hi, could you paste the APISIX configurations? It's hard to reproduce this bug if no detailed configurations.

wklken · 2023-09-08T08:24:30Z

@jiangfucheng I still trying to reproduce it in apisix docker-compose, not success yet. I will update the issue when success. (it only can be reproduced on our image build from apache/apisix:3.2.1-centos)

wklken · 2023-09-19T02:40:08Z

@jiangfucheng Finally, It been reproduced on the apisix docker-compose. It take a lot of time , add I add some scripts to help.

here is the steps, can you please help to investigate this? (this is the problem which stuck our version for production)

Reproduce steps:

1. use the docker-compose

git clone https://github.com/apache/apisix-docker.git
vim apisix_conf/master/config.yaml, change the router and nginx_config. worker_processes

apisix:
  node_listen: 9080              # APISIX listening port
  enable_ipv6: false

  router:
    http: radixtree_uri_with_parameter

nginx_config:
  worker_processes: 4

vim docker-compose-master.yaml change the image to 3.2.1-centos

services:
  apisix:
    image: "apache/apisix:3.2.1-centos"

docker-compose -p docker-apisix -f docker-compose-master.yaml up start the apisix and etcd

2. enter the container, do some change to mock the dns failure

all in the container

docker exec -it dockerapisix_apisix_1 /bin/bash
check the conf/config.yaml is right, 4 processes, and the apisix version is 3.2.1
vi apisix/utils/upstream.lua, add code to mock the dns failure, 50% will 101 empty records

-- to line 22

local random = math.random

-- add before core.resolver.parse_domain(host)

local function parse_domain_for_nodes(nodes)
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then
            if host == "httpbin.org" and ngx_now() <= 1695089345 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
            end
            local ip, err = core.resolver.parse_domain(host)

add a script create.sh to register the route and service, check the API_KEY is right

#!/bin/bash

API_KEY="edd1c9f034335f136f87ad84b625c8f1"
ROUTE_ID="dns_route"
SERVVICE_ID="dns_service"

curl http://127.0.0.1:9180/apisix/admin/services/${SERVVICE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
  "upstream": {
    "nodes": [
      {
        "host": "httpbin.org",
        "port": 80,
        "weight": 100,
        "priority": 1
      }
    ],
    "type": "roundrobin",
    "scheme": "http",
    "pass_host": "node"
  }
}'

curl http://127.0.0.1:9180/apisix/admin/routes/${ROUTE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
  "uri": "/api/test/prod/dns22",
  "methods": [
    "GET"
  ],
  "plugins": {
    "proxy-rewrite": {
      "method": "GET",
      "uri": "/get"
    }
  },
  "upstream": {
    "nodes": [
      {
        "host": "httpbin.org",
        "port": 80,
        "weight": 100,
        "priority": 1
      }
    ],
    "type": "roundrobin",
    "scheme": "http",
    "pass_host": "node"
  },
  "service_id": "dns_service",
  "status": 1
}'

add script update_lua.sh to change the timestamp fast

#!/bin/bash

now=$(date "+%s")
echo "now is: ${now}"

A=$(date -d "+30 seconds" "+%s")
echo "will change the condition to <= ${A}"
sed -i -r "s/<= ([0-9]+) then/<= ${A} then/g"  apisix/utils/upstream.lua
echo $?
echo "change done"

apisix reload

do the register

bash -x create.sh

check the url is ok not in the container

curl -vv http://0.0.0.0:9080/api/test/prod/dns22

3. not in the container, add check script

not in the container

install wrk
add script start_and_check.sh for benchmark then check 503

#!/bin/bash

date

url="http://0.0.0.0:9080/api/test/prod/dns22"

echo "start bench wrk"
wrk -c2 -t2 -d35s ${url}

date

echo "sleep 5 s"
sleep 5

echo "check the status code 10 times"

for ((i=1; i<=10; i++))
do
    status_code=$(curl --write-out %{http_code} --silent --output /dev/null $url)
    echo "status=$status_code"
    if [ "${status_code}" -eq "503" ]
    then
        echo "503 show"
        exit
    fi
done

4. reproduce

open two window to run step 1/2 at the same time

in container, run

$ ./update_lua.sh 
now is: 1695089267
will change the condition to <= 1695089297
0
change done
/usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua reload

WARNING: using fixed Admin API token has security risk.
Please modify "admin_key" in conf/config.yaml .


2023/09/19 02:07:47 [notice] 157#157: signal process started

out the container, run

$ bash start_and_check.sh

Tue Sep 19 10:08:36 CST 2023
start bench wrk
Running 35s test @ http://0.0.0.0:9080/api/test/prod/dns22
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.22ms   80.61ms 823.04ms   95.99%
    Req/Sec     6.91k   807.96     7.77k    95.97%
  461582 requests in 35.00s, 202.93MB read
  Non-2xx or 3xx responses: 461575
Requests/sec:  13187.64
Transfer/sec:      5.80MB
Tue Sep 19 10:09:11 CST 2023
sleep 5 s
check the status code 10 times
status=503
503 show

if 503 show, means it been reproduced! you can curl -vv http://0.0.0.0:9080/api/test/prod/dns22 for few times to check

curl -vv http://0.0.0.0:9080/api/test/prod/dns22
* About to connect() to 0.0.0.0 port 9080 (#0)
*   Trying 0.0.0.0...
* Connected to 0.0.0.0 (0.0.0.0) port 9080 (#0)
> GET /api/test/prod/dns22 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 0.0.0.0:9080
> Accept: */*
> 
< HTTP/1.1 503 Service Temporarily Unavailable
< Date: Tue, 19 Sep 2023 02:21:58 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 269
< Connection: keep-alive
< Server: APISIX/3.2.1
< 
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>openresty</center>
<p><em>Powered by <a href="https://apisix.apache.org/">APISIX</a>.</em></p></body>
</html>
* Connection #0 to host 0.0.0.0 left intact

if all 200, do step 1/2 again(it took 50 seconds each time, and it not 100% show 503, but if you try about 10 times, high possibility showed).

wklken · 2023-09-19T02:54:13Z

More clues for detecting:

should be route.upstream, and the route should related to a service with upstream will cause the problem
- route without upstream - ok
- route with upstream, without service_id - ok
4 workers, only part of them will return 503
if apisix reload, will all 200
the 503 only 1 log 2023/09/19 02:34:52 [error] 854#854: *112694 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 172.31.0.1, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "0.0.0.0:9080"; it did not call the parse_domain, maybe some cache?
the error log of 503 after dns is ok

2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:629: merge_service_route(): service conf: {"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":{},"has_domain":true,"value":{"update_time":1695037858,"id":"dns_service","upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":[{"port":80,"weight":100,"priority":1,"host":"httpbin.org"}],"parent":{"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":"table: 0x7f66620322b8","has_domain":true,"value":"table: 0x7f66620684b0"}},"create_time":1695037858}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [error] 50022#50022: *5076712 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"

wklken · 2023-09-19T07:10:03Z

I add some log, the image is more clear, it will never recover from 503 until apisix reload

at apisix/init.lua

and, it would be assign to empty nodes here

apisix/apisix/utils/upstream.lua

Lines 64 to 87 in 9b2031a

    
           local function parse_domain_for_nodes(nodes) 
        
               local new_nodes = core.table.new(#nodes, 0) 
        
               for _, node in ipairs(nodes) do 
        
                   local host = node.host 
        
                   if not ipmatcher.parse_ipv4(host) and 
        
                           not ipmatcher.parse_ipv6(host) then 
        
                       local ip, err = core.resolver.parse_domain(host) 
        
                       if ip then 
        
                           local new_node = core.table.clone(node) 
        
                           new_node.host = ip 
        
                           new_node.domain = host 
        
                           core.table.insert(new_nodes, new_node) 
        
                       end 
        
                       if err then 
        
                           core.log.error("dns resolver domain: ", host, " error: ", err) 
        
                       end 
        
                   else 
        
                       core.table.insert(new_nodes, node) 
        
                   end 
        
               end 
        
               return new_nodes 
        
           end 
        
           _M.parse_domain_for_nodes = parse_domain_for_nodes

@jiangfucheng @Revolyssup Could someone please verify this?

jiangfucheng · 2023-09-19T13:26:12Z

@wklken Thanks for your detailed description. Based on your description, I think there are two issues that need to be answered.

Why APISIX will print error log with failed to set upstream: no valid upstream node after parse domain failed.
I think you already found the reason for this: parse_domain_for_nodes() function should return nil, error if dns parse failed.

apisix/apisix/utils/upstream.lua

Lines 64 to 86 in 9b2031a

    
           local function parse_domain_for_nodes(nodes) 
        
               local new_nodes = core.table.new(#nodes, 0) 
        
               for _, node in ipairs(nodes) do 
        
                   local host = node.host 
        
                   if not ipmatcher.parse_ipv4(host) and 
        
                           not ipmatcher.parse_ipv6(host) then 
        
                       local ip, err = core.resolver.parse_domain(host) 
        
                       if ip then 
        
                           local new_node = core.table.clone(node) 
        
                           new_node.host = ip 
        
                           new_node.domain = host 
        
                           core.table.insert(new_nodes, new_node) 
        
                       end 
        
                       if err then 
        
                           core.log.error("dns resolver domain: ", host, " error: ", err) 
        
                       end 
        
                   else 
        
                       core.table.insert(new_nodes, node) 
        
                   end 
        
               end 
        
               return new_nodes 
        
           end

Why the http_access_phase() function print the matched route with a nil node field in your second request?

apisix/apisix/init.lua

Lines 494 to 504 in 9b2031a

    
           if route.has_domain then 
        
               local err 
        
               route, err = parse_domain_in_route(route) 
        
               if err then 
        
                   core.log.error("failed to get resolved route: ", err) 
        
                   return core.response.exit(500) 
        
               end 
        
               api_ctx.conf_version = route.modifiedIndex 
        
               api_ctx.matched_route = route 
        
           end

I think it was caused by api_ctx.matched_route = route this line, it changed the route config in memory. (Note that I think it did not just change the copy table but changed the origin route table).

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

For other issues, I will continue to try to debug the code to give my opinion.

Another, I have a question, in the above description, you commented handle_upstream will never call parse_domain_for_nodes function in the second request, How do you get the conclusion of this, I find the parse_domain_for_nodes function is not record related logs to indicate the function has been executed.

(I have been quite busy with work recently, and I will help you troubleshoot this issue as soon as possible in my spare time)

Revolyssup · 2023-09-19T17:13:41Z

@jiangfucheng , in the log you can see that @wklken has added I think some error logs from his side 101 empty record inside parse_domain_for_nodes function.

@wklken Can you show the snippet of parse_domain_for_nodes function where this log is?

Revolyssup · 2023-09-19T17:34:04Z

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function.
The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this

apisix/apisix/http/router/radixtree_uri_with_parameter.lua

Line 38 in 57a7dcd

cached_router_version = user_routes.conf_version

But modification in _M.user_routes is only done directly by etcd.

Revolyssup · 2023-09-19T19:22:29Z

@wklken Just to confirm, it only happens when the route has service id, right?

Revolyssup · 2023-09-19T19:48:09Z

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

wklken · 2023-09-20T01:55:04Z

@wklken Just to confirm, it only happens when the route has service id, right?

Yes, route has upstream and has service_id (the service have upstream too)

jiangfucheng · 2023-09-20T02:00:34Z

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function. The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this

apisix/apisix/http/router/radixtree_uri_with_parameter.lua

Line 38 in 57a7dcd

cached_router_version = user_routes.conf_version

But modification in _M.user_routes is only done directly by etcd.

You are right, it's my mistake, api_ctx will be created when a new request comes, it does not reuse it.

jiangfucheng · 2023-09-20T02:04:16Z

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

Agree with this opinion.

Revolyssup · 2023-09-20T07:40:29Z

@wklken In the first request, did you get a log similar to route items: or insert uri route: ???

wklken · 2023-09-20T08:12:56Z

@wklken In the first request, did you get a log similar to route items: or insert uri route: ???

My log level is info and there no route items: or insert uri route: where the first empty nodes present.

Revolyssup · 2023-09-20T08:16:36Z

@wklken Before empty nodes came, did you find this? What was the last time this log came?

wklken · 2023-09-20T08:24:39Z

2023/09/19 06:46:16 [info] 51339#51339: *5157502 [lua] route.lua:73: create_radixtree_uri_router(): insert uri route: {"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":["GET"],"update_time":1695037859,"upstream":{"type":"roundrobin","pass_host":"node","scheme":"http","nodes":[{"priority":1,"port":80,"host":"httpbin.org","weight":100}],"parent":{"has_domain":true,"createdIndex":543015,"value":{"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":"table: 0x7f66620fc778","update_time":1695037859,"upstream":"table: 0x7f6661b40ed0","priority":0,"plugins":"table: 0x7f666236e638"},"modifiedIndex":543015,"orig_modifiedIndex":543015,"key":"/bk-gateway-apisix/routes/dns_route","update_count":0,"clean_handlers":{}},"hash_on":"vars"},"priority":0,"plugins":{"proxy-rewrite":{"method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"}}}\

..... a lot of other insert uri route

2023/09/19 06:46:17  empty nodes

@Revolyssup 1 second before the empty nodes

Revolyssup · 2023-09-22T06:58:28Z

cc: @kingluo

wklken · 2023-09-28T03:27:42Z

#10093 (comment)

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

@jiangfucheng we need your help to confirm if that fix would be right, thanks.(for our production, we need to reduce the risk)

According to your answer above, can we just add a check before return new_nodes inf local function parse_domain_for_nodes(nodes)?

local function parse_domain_for_nodes(nodes)
    .....

    if #new_nodes == 0 then
      return nil, "no valid nodes after dns resolver domain: " .. host
    end

    return new_nodes

Revolyssup · 2023-10-10T03:30:52Z

Closed by TencentBlueKing/blueking-apigateway-apisix#52

wklken · 2023-10-10T04:12:23Z

@Revolyssup the issue should not be closed before the official repo fix the bug? and we still don't know the reason.

our own repo need to fix the production so we patched it. It will resolve the empty nodes problem!

Revolyssup · 2023-10-10T04:18:16Z

@wklken Yep I understand. The patch still doesn't fix the original issue, its kind of a hack

sheharyaar · 2024-01-12T08:58:25Z

Hi @wklken , can you please recheck if this is fixed or not. I think this may have been related to #10722 .

wklken · 2024-01-12T09:16:52Z

@sheharyaar ok, I will recheck it later

wklken · 2024-01-23T11:58:50Z

@sheharyaar Unfortunately, the bug is not fixed by #10722

Follow the reproduce steps here: #10093 (comment)

reproduce, 503 show
patch the changes, apisix reload
reproduce, 503 show

sheharyaar · 2024-01-23T12:08:07Z

Ok, I will have a look

Revolyssup · 2024-07-02T11:49:07Z

@wklken I have retried the reproduction steps several times but still cannot reproduce the issue. I do not get 503 after the normal operation is resumed.

Revolyssup · 2024-07-02T12:43:56Z

Today I tried to reproduce this issue from all the ways possible along with the exact steps given in the issue . But this problem doesn't come and from the code logic also it doesnt look like the error will come. parse_domain_for_nodes function will never get empty nodes list which seem to be causing the problem. route.value.upstream.nodes is never directly updated so parse_domain_for_nodes will always get the httpbin.org in nodes, and as soon as the dns server resumes, the node value is retrieved and the route.dns_value is updated.

Revolyssup · 2024-07-02T13:12:54Z

@wklken I have followed your exact reproduction steps multiple times but I never got the error.

wklken · 2024-07-03T01:30:58Z

@Revolyssup did you follow those steps here? #10093 (comment)

and the apisix version should be 3.2.1;

we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

Revolyssup · 2024-07-03T06:14:51Z

@Revolyssup did you follow those steps here? #10093 (comment)

and the apisix version should be 3.2.1;

we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

Yes, I had followed the steps.

Revolyssup · 2024-07-03T08:22:23Z

@wklken Just want to confirm the behaviour after this patch.
So before the patch when empty nodes were returned, you were intermittently getting 503 error with the requests.
But now when you errror out early, you still get 500 error for "no valid ip found". But the subsequent requests when the dns comes back up, no error is found. right?

Basically you have covered the case that nodes will never be set as empty.

wklken · 2024-07-03T08:50:14Z

#10093 (comment)

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay.
I think there may be some cache that makes APISIX unaware of the DNS resume.

Revolyssup · 2024-07-03T08:56:05Z

#10093 (comment)

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.

@wklken I meant to ask the behavior in your patch. After your patch, you do not get the 503 after dns server resumes. Right?

wklken · 2024-07-03T09:20:24Z

@Revolyssup Yes, with the patch, no 503 shows

Revolyssup · 2024-07-03T09:51:02Z

I haven't been able to reproduce but by the error and how your patch fixes the error and the logs above that you showed, here is the conclusion so far:

Somehow the nodes in the original saved configuration are set to empty. Since this goes away after a reload means the etcd still has correct value. Since I cant reproduce, I am going through the code to find the root cause currently.

Revolyssup · 2024-07-03T10:07:56Z

cache issue is unlikely because api_ctx is created newly for each new request. So for next request, new api_ctx will be created and the matched route should contain the correct routes configuration with the nodes field.

Revolyssup · 2024-07-03T11:29:54Z

Tried multiple times following the exact steps with docker compose given above, still getting no 503s.

Same is the case when tried locally.

Revolyssup · 2024-07-04T08:39:45Z

Since after trying multiple times, this cannot be reproduced. I am closing the issue. If anyone else is able to reproduce this bug again then please reopen with the exact steps you folllowed.

github-project-automation bot added this to Apache APISIX backlog Aug 28, 2023

Revolyssup added the can't reproduce label Aug 28, 2023

Revolyssup moved this to 📋 Backlog in Apache APISIX backlog Aug 28, 2023

Revolyssup moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Sep 19, 2023

Revolyssup self-assigned this Sep 19, 2023

Revolyssup added bug Something isn't working and removed can't reproduce labels Sep 19, 2023

wklken mentioned this issue Oct 9, 2023

fix(build/patches): add patch for parse_domain_for_nodes in apisix/utils/upstream.lua TencentBlueKing/blueking-apigateway-apisix#52

Merged

5 tasks

Revolyssup closed this as completed Oct 10, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Oct 10, 2023

Revolyssup reopened this Oct 10, 2023

Revolyssup moved this from ✅ Done to 📋 Backlog in Apache APISIX backlog Oct 10, 2023

moonming moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Jul 1, 2024

Revolyssup closed this as completed Jul 4, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Jul 4, 2024

Revolyssup added the can't reproduce label Jul 4, 2024

bug: dns resolution did not resume immediately after the dns server resume #10093

bug: dns resolution did not resume immediately after the dns server resume #10093

Comments

wklken commented Aug 28, 2023 • edited Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

wklken commented Aug 31, 2023 • edited Loading

jiangfucheng commented Aug 31, 2023

wklken commented Sep 1, 2023 • edited Loading

apisix config:

register route:

do requests

the log

wklken commented Sep 1, 2023 • edited Loading

jiangfucheng commented Sep 7, 2023

wklken commented Sep 8, 2023

wklken commented Sep 19, 2023 • edited Loading

Reproduce steps:

1. use the docker-compose

2. enter the container, do some change to mock the dns failure

3. not in the container, add check script

4. reproduce

wklken commented Sep 19, 2023 • edited Loading

wklken commented Sep 19, 2023 • edited Loading

jiangfucheng commented Sep 19, 2023

Revolyssup commented Sep 19, 2023

Revolyssup commented Sep 19, 2023 • edited Loading

Revolyssup commented Sep 19, 2023

Revolyssup commented Sep 19, 2023

wklken commented Sep 20, 2023

jiangfucheng commented Sep 20, 2023

jiangfucheng commented Sep 20, 2023

Revolyssup commented Sep 20, 2023 • edited Loading

wklken commented Sep 20, 2023

Revolyssup commented Sep 20, 2023

wklken commented Sep 20, 2023

Revolyssup commented Sep 22, 2023

wklken commented Sep 28, 2023 • edited Loading

Revolyssup commented Oct 10, 2023

wklken commented Oct 10, 2023 • edited Loading

Revolyssup commented Oct 10, 2023

sheharyaar commented Jan 12, 2024

wklken commented Jan 12, 2024

wklken commented Jan 23, 2024

sheharyaar commented Jan 23, 2024

Revolyssup commented Jul 2, 2024

Revolyssup commented Jul 2, 2024

Revolyssup commented Jul 2, 2024

wklken commented Jul 3, 2024

Revolyssup commented Jul 3, 2024

Revolyssup commented Jul 3, 2024

wklken commented Jul 3, 2024

Revolyssup commented Jul 3, 2024

wklken commented Jul 3, 2024

Revolyssup commented Jul 3, 2024

Revolyssup commented Jul 3, 2024

Revolyssup commented Jul 3, 2024 • edited Loading

Revolyssup commented Jul 4, 2024

wklken commented Aug 28, 2023 •

edited

Loading

wklken commented Aug 31, 2023 •

edited

Loading

wklken commented Sep 1, 2023 •

edited

Loading

wklken commented Sep 1, 2023 •

edited

Loading

wklken commented Sep 19, 2023 •

edited

Loading

wklken commented Sep 19, 2023 •

edited

Loading

wklken commented Sep 19, 2023 •

edited

Loading

Revolyssup commented Sep 19, 2023 •

edited

Loading

Revolyssup commented Sep 20, 2023 •

edited

Loading

wklken commented Sep 28, 2023 •

edited

Loading

wklken commented Oct 10, 2023 •

edited

Loading

Revolyssup commented Jul 3, 2024 •

edited

Loading