Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: dns resolution did not resume immediately after the dns server resume #10093

Closed
wklken opened this issue Aug 28, 2023 · 43 comments
Closed
Assignees
Labels
bug Something isn't working can't reproduce

Comments

@wklken
Copy link

wklken commented Aug 28, 2023

Current Behavior

In our production, when the dns server reload frequently.
The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

And from the log, one apisix instance with 4 processes, 3 processes got the dns failed, 1 processes got no error all the requests are ok.


apisix/core/dns/client.lua

function _M.resolve(self, domain, selector)
    local client = self.client

    local answers, err = client.resolve(domain)
    if not answers then
        return nil, "failed to query the DNS server: " .. err
    end

how the client.resolve detect the dns server is ok?

Expected Behavior

dns resolution resume immediately

Error Logs

when the dns server is reloading

[lua] upstream.lua:79: parse_domain_for_nodes(): dns resolver domain: test.com error: failed to query the DNS server: dns client error: 101 empty record received
[lua] resolver.lua:80: parse_domain(): failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received

when the dns server is stable (no dns error logs, why?)

[lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

so, it means when the dns server is stable, the parse_domain success, no error logs, but the handle_upstream still get no valid upstream node

and after apisix reload, no error logs

  1. dns client parse success, but return empty new_nodes? (cache?)
  2. dns client parse success and return new_nodes, but set_upstream fail?

Steps to Reproduce

For now, I cannot reproduce it, but it has occurred three times in our production.
Any clues or advice for making dns server reload frequently?

Environment

  • APISIX version (run apisix version): 3.2.1
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V): nginx version: openresty/1.21.4.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@wklken
Copy link
Author

wklken commented Aug 31, 2023

two upstream:

  1. route.upstream
  2. route.service_id -> service.upstream

the first upstream did not resume immediately

apisix/apisix/init.lua

Lines 422 to 445 in 7770483

if up_id then
local upstream = apisix_upstream.get_by_id(up_id)
if not upstream then
if is_http then
return core.response.exit(502)
end
return ngx_exit(1)
end
api_ctx.matched_upstream = upstream
else
if route.has_domain then
local err
route, err = parse_domain_in_route(route)
if err then
core.log.error("failed to get resolved route: ", err)
return core.response.exit(500)
end
api_ctx.conf_version = route.modifiedIndex
api_ctx.matched_route = route
end

and it seems the parse_domain_in_route did not know the dns server is resumed

apisix/apisix/init.lua

Lines 210 to 239 in 7770483

local function parse_domain_in_route(route)
local nodes = route.value.upstream.nodes
local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes)
if not new_nodes then
return nil, err
end
local up_conf = route.dns_value and route.dns_value.upstream
local ok = upstream_util.compare_upstream_node(up_conf, new_nodes)
if ok then
return route
end
-- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
-- has changed
local parent = route.value.upstream.parent
if parent then
route.value.upstream.parent = nil
end
route.dns_value = core.table.deepcopy(route.value)
if parent then
route.value.upstream.parent = parent
route.dns_value.upstream.parent = parent
end
route.dns_value.upstream.nodes = new_nodes
core.log.info("parse route which contain domain: ",
core.json.delay_encode(route, true))
return route
end

the comment (from ce4d8fb)

    -- don't modify the modifiedIndex to avoid plugin cache miss because of DNS resolve result
    -- has changed

while the parse_domain_in_up modify the modifiedIndex

function _M.parse_domain_in_up(up)
local nodes = up.value.nodes
local new_nodes, err = parse_domain_for_nodes(nodes)
if not new_nodes then
return nil, err
end
local ok = compare_upstream_node(up.dns_value, new_nodes)
if ok then
return up
end
if not up.orig_modifiedIndex then
up.orig_modifiedIndex = up.modifiedIndex
end
up.modifiedIndex = up.orig_modifiedIndex .. "#" .. ngx_now()
up.dns_value = core.table.clone(up.value)
up.dns_value.nodes = new_nodes
core.log.info("resolve upstream which contain domain: ",
core.json.delay_encode(up, true))
return up
end

@jiangfucheng
Copy link
Member

Hi, thank you for your report. Could you give us a more detailed description of your configuration?

  1. What is the local DNS server you are running locally? And we need the related configuration.
  2. Could you paste your related configuration of APISIX?

@wklken
Copy link
Author

wklken commented Sep 1, 2023

I have reproduce @jiangfucheng

just mock dns fail in lua (not 100%, shoud retry for few times)

apisix config:

  1. start apisix with 4 processes
  2. change then nginx error log level to info

register route:

  1. add a route /get with one upstream, one node like httpbin.org; the parsed data in log is "nodes":[{"host":"54.175.87.239","priority":1,"domain":"httpbin.org","port":80,"weight":100,"upstream_host":"httpbin.org"}]
  2. change apisix/utils/upstream.lua to mock dns random error, note the 1693542089 is a now() + 60
local function parse_domain_for_nodes(nodes) 
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then

            if host == "httpbin.org" and ngx_now() <= 1693542089 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
                -- return nil, "failed to query the DNS server: timeout11"
            end
            local ip, err = core.resolver.parse_domain(host)
            if ip then
                local new_node = core.table.clone(node)
                new_node.host = ip
                new_node.domain = host
                core.table.insert(new_nodes, new_node)
            end
  1. apisix reload

do requests

  1. wrk -c2 -t2 -d80s http://{}/get
  2. and after the wrk finished, do curl http://{}/get for about 10 times, got 200 and 503

the log

when 503, there got no dns logs before no valid upstream node

2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332
2023/09/01 04:24:20 [info] 140701#140701: *15533385 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":"table: 0x7f7d645c64c8","has_domain":true,"value":{"create_time":1693309224,"methods":["GET"],"service_id":"test.prod.stage-12","update_time":1693539224,"timeout":{"send":150,"read":150,"connect":150},"labels":{"gateway.bk.tencent.com/gateway":"test","gateway.bk.tencent.com/stage":"prod"},"uris":["/api/test/prod/dns","/api/test/prod/dns/"],"status":1,"priority":0,"name":"test-prod-dns","id":"test.prod.2423","plugins":{"bk-proxy-rewrite":{"match_subpath":false,"subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"},"bk-resource-context":{"bk_resource_auth":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false},"bk_resource_id":2423,"bk_resource_name":"dns","bk_resource_auth_obj":{"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false,"verified_user_required":false}}},"upstream":{"nodes":{},"hash_on":"vars","scheme":"http","parent":{"key":"/bk-gateway-apisix/routes/test.prod.2423","modifiedIndex":532500,"clean_handlers":{},"has_domain":true,"value":"table: 0x7f7d9706c708","update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332},"type":"roundrobin","pass_host":"node","timeout":{"send":150,"read":150,"connect":150}}},"update_count":0,"orig_modifiedIndex":532500,"createdIndex":532332}
2023/09/01 04:24:21 [error] 140701#140701: *15534047 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node

the diff of route conf between 503 and 200

image

@wklken
Copy link
Author

wklken commented Sep 1, 2023

route.value.upstream.nodes is empty, so the new_nodes is empty(no dns resolve), the compare_upstream_node result true and return

image

so the route.value.upsteram.nodes is being set to an empty value somewhere in the code.

@jiangfucheng
Copy link
Member

@wklken Hi, could you paste the APISIX configurations? It's hard to reproduce this bug if no detailed configurations.

@wklken
Copy link
Author

wklken commented Sep 8, 2023

@jiangfucheng I still trying to reproduce it in apisix docker-compose, not success yet. I will update the issue when success. (it only can be reproduced on our image build from apache/apisix:3.2.1-centos)

@wklken
Copy link
Author

wklken commented Sep 19, 2023

@jiangfucheng Finally, It been reproduced on the apisix docker-compose. It take a lot of time , add I add some scripts to help.

here is the steps, can you please help to investigate this? (this is the problem which stuck our version for production)


Reproduce steps:

1. use the docker-compose

  1. git clone https://github.com/apache/apisix-docker.git
  2. vim apisix_conf/master/config.yaml, change the router and nginx_config. worker_processes
apisix:
  node_listen: 9080              # APISIX listening port
  enable_ipv6: false

  router:
    http: radixtree_uri_with_parameter

nginx_config:
  worker_processes: 4
  1. vim docker-compose-master.yaml change the image to 3.2.1-centos
services:
  apisix:
    image: "apache/apisix:3.2.1-centos"
  1. docker-compose -p docker-apisix -f docker-compose-master.yaml up start the apisix and etcd

2. enter the container, do some change to mock the dns failure

all in the container

  1. docker exec -it dockerapisix_apisix_1 /bin/bash
  2. check the conf/config.yaml is right, 4 processes, and the apisix version is 3.2.1
  3. vi apisix/utils/upstream.lua, add code to mock the dns failure, 50% will 101 empty records
-- to line 22

local random = math.random

-- add before core.resolver.parse_domain(host)

local function parse_domain_for_nodes(nodes)
    local new_nodes = core.table.new(#nodes, 0)
    for _, node in ipairs(nodes) do
        local host = node.host
        if not ipmatcher.parse_ipv4(host) and
                not ipmatcher.parse_ipv6(host) then
            if host == "httpbin.org" and ngx_now() <= 1695089345 then
                if random(1,10) % 2 == 0 then
                        core.log.error("101 empty records")
                        return new_nodes
                end
            end
            local ip, err = core.resolver.parse_domain(host)
  1. add a script create.sh to register the route and service, check the API_KEY is right
#!/bin/bash

API_KEY="edd1c9f034335f136f87ad84b625c8f1"
ROUTE_ID="dns_route"
SERVVICE_ID="dns_service"

curl http://127.0.0.1:9180/apisix/admin/services/${SERVVICE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
  "upstream": {
    "nodes": [
      {
        "host": "httpbin.org",
        "port": 80,
        "weight": 100,
        "priority": 1
      }
    ],
    "type": "roundrobin",
    "scheme": "http",
    "pass_host": "node"
  }
}'

curl http://127.0.0.1:9180/apisix/admin/routes/${ROUTE_ID} -H "X-API-KEY: ${API_KEY}" -X PUT -d '
{
  "uri": "/api/test/prod/dns22",
  "methods": [
    "GET"
  ],
  "plugins": {
    "proxy-rewrite": {
      "method": "GET",
      "uri": "/get"
    }
  },
  "upstream": {
    "nodes": [
      {
        "host": "httpbin.org",
        "port": 80,
        "weight": 100,
        "priority": 1
      }
    ],
    "type": "roundrobin",
    "scheme": "http",
    "pass_host": "node"
  },
  "service_id": "dns_service",
  "status": 1
}'
  1. add script update_lua.sh to change the timestamp fast
#!/bin/bash

now=$(date "+%s")
echo "now is: ${now}"

A=$(date -d "+30 seconds" "+%s")
echo "will change the condition to <= ${A}"
sed -i -r "s/<= ([0-9]+) then/<= ${A} then/g"  apisix/utils/upstream.lua
echo $?
echo "change done"

apisix reload
  1. do the register
bash -x create.sh
  1. check the url is ok not in the container
curl -vv http://0.0.0.0:9080/api/test/prod/dns22

3. not in the container, add check script

not in the container

  1. install wrk
  2. add script start_and_check.sh for benchmark then check 503
#!/bin/bash

date

url="http://0.0.0.0:9080/api/test/prod/dns22"

echo "start bench wrk"
wrk -c2 -t2 -d35s ${url}

date

echo "sleep 5 s"
sleep 5

echo "check the status code 10 times"

for ((i=1; i<=10; i++))
do
    status_code=$(curl --write-out %{http_code} --silent --output /dev/null $url)
    echo "status=$status_code"
    if [ "${status_code}" -eq "503" ]
    then
        echo "503 show"
        exit
    fi
done

4. reproduce

open two window to run step 1/2 at the same time

  1. in container, run
$ ./update_lua.sh 
now is: 1695089267
will change the condition to <= 1695089297
0
change done
/usr/local/openresty//luajit/bin/luajit ./apisix/cli/apisix.lua reload

WARNING: using fixed Admin API token has security risk.
Please modify "admin_key" in conf/config.yaml .


2023/09/19 02:07:47 [notice] 157#157: signal process started
  1. out the container, run
$ bash start_and_check.sh

Tue Sep 19 10:08:36 CST 2023
start bench wrk
Running 35s test @ http://0.0.0.0:9080/api/test/prod/dns22
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.22ms   80.61ms 823.04ms   95.99%
    Req/Sec     6.91k   807.96     7.77k    95.97%
  461582 requests in 35.00s, 202.93MB read
  Non-2xx or 3xx responses: 461575
Requests/sec:  13187.64
Transfer/sec:      5.80MB
Tue Sep 19 10:09:11 CST 2023
sleep 5 s
check the status code 10 times
status=503
503 show
  1. if 503 show, means it been reproduced! you can curl -vv http://0.0.0.0:9080/api/test/prod/dns22 for few times to check
curl -vv http://0.0.0.0:9080/api/test/prod/dns22
* About to connect() to 0.0.0.0 port 9080 (#0)
*   Trying 0.0.0.0...
* Connected to 0.0.0.0 (0.0.0.0) port 9080 (#0)
> GET /api/test/prod/dns22 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 0.0.0.0:9080
> Accept: */*
> 
< HTTP/1.1 503 Service Temporarily Unavailable
< Date: Tue, 19 Sep 2023 02:21:58 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 269
< Connection: keep-alive
< Server: APISIX/3.2.1
< 
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>openresty</center>
<p><em>Powered by <a href="https://apisix.apache.org/">APISIX</a>.</em></p></body>
</html>
* Connection #0 to host 0.0.0.0 left intact
  1. if all 200, do step 1/2 again(it took 50 seconds each time, and it not 100% show 503, but if you try about 10 times, high possibility showed).

@wklken
Copy link
Author

wklken commented Sep 19, 2023

More clues for detecting:

  1. should be route.upstream, and the route should related to a service with upstream will cause the problem
    • route without upstream - ok
    • route with upstream, without service_id - ok
  2. 4 workers, only part of them will return 503
  3. if apisix reload, will all 200
  4. the 503 only 1 log 2023/09/19 02:34:52 [error] 854#854: *112694 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 172.31.0.1, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "0.0.0.0:9080"; it did not call the parse_domain, maybe some cache?
  5. the error log of 503 after dns is ok
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] radixtree_uri_with_parameter.lua:53: match(): route match mode: radixtree_uri_with_parameter, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] init.lua:572: http_access_phase(): matched route: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:629: merge_service_route(): service conf: {"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":{},"has_domain":true,"value":{"update_time":1695037858,"id":"dns_service","upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":[{"port":80,"weight":100,"priority":1,"host":"httpbin.org"}],"parent":{"key":"/bk-gateway-apisix/services/dns_service","modifiedIndex":543014,"createdIndex":543014,"clean_handlers":"table: 0x7f66620322b8","has_domain":true,"value":"table: 0x7f66620684b0"}},"create_time":1695037858}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [info] 50022#50022: *5076712 [lua] plugin.lua:630: merge_service_route():   route conf: {"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":{},"update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":{"id":"dns_route","create_time":1695037859,"uri":"/api/test/prod/dns22","priority":0,"status":1,"update_time":1695037859,"service_id":"dns_service","methods":["GET"],"upstream":{"hash_on":"vars","scheme":"http","pass_host":"node","type":"roundrobin","nodes":{},"parent":{"key":"/bk-gateway-apisix/routes/dns_route","modifiedIndex":543015,"createdIndex":543015,"clean_handlers":"table: 0x7f666206ded0","update_count":0,"orig_modifiedIndex":543015,"has_domain":true,"value":"table: 0x7f66622f1d28"}},"plugins":{"proxy-rewrite":{"method":"GET","uri":"/get","use_real_request_uri_unsafe":false}}}}, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"
2023/09/19 06:26:24 [error] 50022#50022: *5076712 [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node, client: 10.244.0.202, server: _, request: "GET /api/test/prod/dns22 HTTP/1.1", host: "bkapi.dev.test.com"

@wklken
Copy link
Author

wklken commented Sep 19, 2023

I add some log, the image is more clear, it will never recover from 503 until apisix reload

image

at apisix/init.lua

image

and, it would be assign to empty nodes here

image

local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
if err then
core.log.error("dns resolver domain: ", host, " error: ", err)
end
else
core.table.insert(new_nodes, node)
end
end
return new_nodes
end
_M.parse_domain_for_nodes = parse_domain_for_nodes


@jiangfucheng @Revolyssup Could someone please verify this?

@Revolyssup Revolyssup moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Sep 19, 2023
@jiangfucheng
Copy link
Member

@wklken Thanks for your detailed description. Based on your description, I think there are two issues that need to be answered.

  1. Why APISIX will print error log with failed to set upstream: no valid upstream node after parse domain failed.
    I think you already found the reason for this: parse_domain_for_nodes() function should return nil, error if dns parse failed.

local function parse_domain_for_nodes(nodes)
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
if err then
core.log.error("dns resolver domain: ", host, " error: ", err)
end
else
core.table.insert(new_nodes, node)
end
end
return new_nodes
end

  1. Why the http_access_phase() function print the matched route with a nil node field in your second request?

    apisix/apisix/init.lua

    Lines 494 to 504 in 9b2031a

    if route.has_domain then
    local err
    route, err = parse_domain_in_route(route)
    if err then
    core.log.error("failed to get resolved route: ", err)
    return core.response.exit(500)
    end
    api_ctx.conf_version = route.modifiedIndex
    api_ctx.matched_route = route
    end

I think it was caused by api_ctx.matched_route = route this line, it changed the route config in memory. (Note that I think it did not just change the copy table but changed the origin route table).

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

For other issues, I will continue to try to debug the code to give my opinion.

Another, I have a question, in the above description, you commented handle_upstream will never call parse_domain_for_nodes function in the second request, How do you get the conclusion of this, I find the parse_domain_for_nodes function is not record related logs to indicate the function has been executed.

(I have been quite busy with work recently, and I will help you troubleshoot this issue as soon as possible in my spare time)

@Revolyssup Revolyssup self-assigned this Sep 19, 2023
@Revolyssup
Copy link
Contributor

@jiangfucheng , in the log you can see that @wklken has added I think some error logs from his side 101 empty record inside parse_domain_for_nodes function.

@wklken Can you show the snippet of parse_domain_for_nodes function where this log is?

@Revolyssup
Copy link
Contributor

Revolyssup commented Sep 19, 2023

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function.
The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this

cached_router_version = user_routes.conf_version

But modification in _M.user_routes is only done directly by etcd.

@Revolyssup
Copy link
Contributor

@wklken Just to confirm, it only happens when the route has service id, right?

@Revolyssup
Copy link
Contributor

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

@Revolyssup Revolyssup added bug Something isn't working and removed can't reproduce labels Sep 19, 2023
@wklken
Copy link
Author

wklken commented Sep 20, 2023

@wklken Just to confirm, it only happens when the route has service id, right?

Yes, route has upstream and has service_id (the service have upstream too)

@jiangfucheng
Copy link
Member

@jiangfucheng api_ctx.matched_route is per request so even if it is set wrong, the subsequent request will set it again. In the next request, you can see that api_ctx.matched_route has empty nodes in upstream after it is populated by match() function. The whole issue boils down to, why the matched_route log has no nodes? This is calculated for each request from the etcd. I haven't been able to find the exact flow that can explain where exactly did this nodes={} from previous upstream resulted in api_ctx.matched_route's nodes when api_ctx is freshly calculated per request. And where exactly is the caching happening at that point while matching route.

The only caching culprit I see currently is this

cached_router_version = user_routes.conf_version

But modification in _M.user_routes is only done directly by etcd.

You are right, it's my mistake, api_ctx will be created when a new request comes, it does not reuse it.

@jiangfucheng
Copy link
Member

@wklken I think parse_domain_for_nodes is called in subsequent request but there are no nodes to iterate over, so you don't get the log.

Agree with this opinion.

@Revolyssup
Copy link
Contributor

Revolyssup commented Sep 20, 2023

@wklken In the first request, did you get a log similar to route items: or insert uri route: ???

@wklken
Copy link
Author

wklken commented Sep 20, 2023

@wklken In the first request, did you get a log similar to route items: or insert uri route: ???

My log level is info and there no route items: or insert uri route: where the first empty nodes present.

@Revolyssup
Copy link
Contributor

@wklken Before empty nodes came, did you find this? What was the last time this log came?

@wklken
Copy link
Author

wklken commented Sep 20, 2023

2023/09/19 06:46:16 [info] 51339#51339: *5157502 [lua] route.lua:73: create_radixtree_uri_router(): insert uri route: {"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":["GET"],"update_time":1695037859,"upstream":{"type":"roundrobin","pass_host":"node","scheme":"http","nodes":[{"priority":1,"port":80,"host":"httpbin.org","weight":100}],"parent":{"has_domain":true,"createdIndex":543015,"value":{"id":"dns_route","service_id":"dns_service","create_time":1695037859,"uri":"/api/test/prod/dns22","status":1,"methods":"table: 0x7f66620fc778","update_time":1695037859,"upstream":"table: 0x7f6661b40ed0","priority":0,"plugins":"table: 0x7f666236e638"},"modifiedIndex":543015,"orig_modifiedIndex":543015,"key":"/bk-gateway-apisix/routes/dns_route","update_count":0,"clean_handlers":{}},"hash_on":"vars"},"priority":0,"plugins":{"proxy-rewrite":{"method":"GET","use_real_request_uri_unsafe":false,"uri":"/get"}}}\

..... a lot of other insert uri route

2023/09/19 06:46:17  empty nodes  

@Revolyssup 1 second before the empty nodes

@Revolyssup
Copy link
Contributor

cc: @kingluo

@wklken
Copy link
Author

wklken commented Sep 28, 2023

#10093 (comment)

For the above two issues, I think we just fix the return value in parse_domain_for_nodes function enough.

@jiangfucheng we need your help to confirm if that fix would be right, thanks.(for our production, we need to reduce the risk)

According to your answer above, can we just add a check before return new_nodes inf local function parse_domain_for_nodes(nodes)?

local function parse_domain_for_nodes(nodes)
    .....

    if #new_nodes == 0 then
      return nil, "no valid nodes after dns resolver domain: " .. host
    end

    return new_nodes

@Revolyssup
Copy link
Contributor

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Oct 10, 2023
@wklken
Copy link
Author

wklken commented Oct 10, 2023

@Revolyssup the issue should not be closed before the official repo fix the bug? and we still don't know the reason.

our own repo need to fix the production so we patched it. It will resolve the empty nodes problem!

@Revolyssup
Copy link
Contributor

@wklken Yep I understand. The patch still doesn't fix the original issue, its kind of a hack

@Revolyssup Revolyssup reopened this Oct 10, 2023
@Revolyssup Revolyssup moved this from ✅ Done to 📋 Backlog in Apache APISIX backlog Oct 10, 2023
@sheharyaar
Copy link
Contributor

Hi @wklken , can you please recheck if this is fixed or not. I think this may have been related to #10722 .

@wklken
Copy link
Author

wklken commented Jan 12, 2024

@sheharyaar ok, I will recheck it later

@wklken
Copy link
Author

wklken commented Jan 23, 2024

@sheharyaar Unfortunately, the bug is not fixed by #10722

Follow the reproduce steps here: #10093 (comment)

  1. reproduce, 503 show
  2. patch the changes, apisix reload
  3. reproduce, 503 show

@sheharyaar
Copy link
Contributor

Ok, I will have a look

@moonming moonming moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Jul 1, 2024
@Revolyssup
Copy link
Contributor

@wklken I have retried the reproduction steps several times but still cannot reproduce the issue. I do not get 503 after the normal operation is resumed.

@Revolyssup
Copy link
Contributor

Today I tried to reproduce this issue from all the ways possible along with the exact steps given in the issue . But this problem doesn't come and from the code logic also it doesnt look like the error will come. parse_domain_for_nodes function will never get empty nodes list which seem to be causing the problem. route.value.upstream.nodes is never directly updated so parse_domain_for_nodes will always get the httpbin.org in nodes, and as soon as the dns server resumes, the node value is retrieved and the route.dns_value is updated.

@Revolyssup
Copy link
Contributor

@wklken I have followed your exact reproduction steps multiple times but I never got the error.

@wklken
Copy link
Author

wklken commented Jul 3, 2024

@Revolyssup did you follow those steps here? #10093 (comment)

and the apisix version should be 3.2.1;


we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

@Revolyssup
Copy link
Contributor

@Revolyssup did you follow those steps here? #10093 (comment)

and the apisix version should be 3.2.1;


we just patch it https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/002_upstream_parse_domain_for_nodes.patch and it's on our production for about 10 months, and it works.

Yes, I had followed the steps.

@Revolyssup
Copy link
Contributor

@wklken Just want to confirm the behaviour after this patch.
So before the patch when empty nodes were returned, you were intermittently getting 503 error with the requests.
But now when you errror out early, you still get 500 error for "no valid ip found". But the subsequent requests when the dns comes back up, no error is found. right?

Basically you have covered the case that nodes will never be set as empty.

@wklken
Copy link
Author

wklken commented Jul 3, 2024

#10093 (comment)

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay.
I think there may be some cache that makes APISIX unaware of the DNS resume.

@Revolyssup
Copy link
Contributor

#10093 (comment)

The requests will be all fail, while the dns resolution not ok failed to parse domain: test.com, error: failed to query the DNS server: dns client error: 101 empty record received;

But when the dns server is stable, the dns resolution of apisix did not resume immediately, the requests got error [lua] init.lua:486: handle_upstream(): failed to set upstream: no valid upstream node for hours, until we do apisix reload, the dns turns ok.

It’s always showing a 503 error even if the DNS is okay. I think there may be some cache that makes APISIX unaware of the DNS resume.

@wklken I meant to ask the behavior in your patch. After your patch, you do not get the 503 after dns server resumes. Right?

@wklken
Copy link
Author

wklken commented Jul 3, 2024

@Revolyssup Yes, with the patch, no 503 shows

@Revolyssup
Copy link
Contributor

I haven't been able to reproduce but by the error and how your patch fixes the error and the logs above that you showed, here is the conclusion so far:

Somehow the nodes in the original saved configuration are set to empty. Since this goes away after a reload means the etcd still has correct value. Since I cant reproduce, I am going through the code to find the root cause currently.

@Revolyssup
Copy link
Contributor

cache issue is unlikely because api_ctx is created newly for each new request. So for next request, new api_ctx will be created and the matched route should contain the correct routes configuration with the nodes field.

@Revolyssup
Copy link
Contributor

Revolyssup commented Jul 3, 2024

Screenshot from 2024-07-03 16-59-17
Tried multiple times following the exact steps with docker compose given above, still getting no 503s.

Same is the case when tried locally.

@Revolyssup
Copy link
Contributor

Since after trying multiple times, this cannot be reproduced. I am closing the issue. If anyone else is able to reproduce this bug again then please reopen with the exact steps you folllowed.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working can't reproduce
Projects
Archived in project
Development

No branches or pull requests

4 participants