fix: use warn log for get healthcheck target status failure #10156

shreemaan-abhishek · 2023-09-04T16:31:44Z

Description

Lines 82 to 83 in 35349ef

    
           core.log.error("failed to get health check target status, addr: ", 
        
               node.host, ":", port or node.port, ", host: ", host, ", err: ", err)

The above error log reports an error in some cases even if the request is valid and successful. In the initial requests, get_target_status() will always fail as the target status is not yet finalized. It does not make sense to report an error for positive outcomes.

Note that the get_target_status() failure just signifies that the system failed to get the health status only. We do not know if the target is actually unhealthy. And even if the node is unavailable/unhealthy get_target_status() does not report the error, but some warn logs like this:

2023/09/04 22:13:09 [warn] 14709#16724107: *2442 [lua] healthcheck.lua:650: [healthcheck] (upstream#/apisix/upstreams/1) unhealthy TCP increment (2/2) for '(127.0.0.1:8765)', context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:9080

Background information:

I encountered this problem while writing a test case for https based healthcheck.

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

monkeyDluffy6017 · 2023-09-05T04:08:13Z

The error log occurs when the upstream is rebuilt as we add or delete an upstream node. The state is not correct, it will lead to unbalanced load. It's a problem, we should fix this, so the error log level is appropriate here.

shreemaan-abhishek · 2023-09-05T05:02:22Z

The state is not correct.

Do you mean not correct == unhealthy?

Also, in the beginning, there will be no state as the healthcheck won't be triggered unless the upstream has been accessed.

@monkeyDluffy6017 try running the following test case and check the servroot error logs and there you will find error logs even for positive outcomes.

t/stream-node/upstream-tls.t

use t::APISIX 'no_plan';

no_root_location();

add_block_preprocessor(sub {
    my ($block) = @_;

    if (!$block->http_config) {
        my $http_config = <<'_EOC_';
server {
    listen 8765 ssl;
    ssl_certificate ../../certs/mtls_server.crt;
    ssl_certificate_key ../../certs/mtls_server.key;
    ssl_client_certificate ../../certs/mtls_ca.crt;
    ssl_verify_client on;

    location /index {
        return 200 'gangnam style!';
    }
}
server {
    listen 8766 ssl;
    ssl_certificate ../../certs/mtls_server.crt;
    ssl_certificate_key ../../certs/mtls_server.key;
    ssl_client_certificate ../../certs/mtls_ca.crt;
    ssl_verify_client on;

    location /index {
        return 200 'gangnam style!';
    }
}
_EOC_
        $block->set_value("http_config", $http_config);
    }

    if (!$block->request) {
        $block->set_value("request", "GET /t");
    }

});

run_tests;

__DATA__

=== TEST 7: https health check
--- log_level: info
--- config
    location /t {
        content_by_lua_block {
            local t = require("lib.test_admin")
            local core = require("apisix.core")
            local cert = t.read_file("t/certs/mtls_client.crt")
            local key =  t.read_file("t/certs/mtls_client.key")
            local data = {
                retries = 2,
                checks = {
                    active = {
                    https_verify_certificate = false,
                    healthy = {
                        interval = 2,
                        successes = 1
                    },
                    unhealthy = {
                        interval = 1,
                        http_failures = 2
                    },
                    type = "https",
                    timeout = 5
                    },
                },
                type = "roundrobin",
                nodes = {
                    ["127.0.0.1:8766"] = 1,
                    ["127.0.0.1:8765"] = 1
                },
                scheme = "https",
                tls = {
                    client_cert = cert,
                    client_key = key
                }
            }
            local code, body = t.test('/apisix/admin/upstreams/1',
                 ngx.HTTP_PUT,
                 core.json.encode(data)
            )

            if code >= 300 then
                ngx.status = code
            end

            local code, body = t.test('/apisix/admin/routes/1',
                 ngx.HTTP_PUT,
                 [[{
                        "upstream_id": 1,
                        "uri": "/index"
                }]]
            )

            if code >= 300 then
                ngx.status = code
            end
        }
    }
--- error_code: 200
--- timeout: 15



=== TEST 8: ACCESS
--- pipelined_requests eval
["GET /index", "GET /index"]
--- response_body eval
["gangnam style!", "gangnam style!"]
--- error_code eval
[200, 200]

P.S: I am unable to assert the error_log using the test case so you will have to check the log manually, if you change the error log to warn log, the test will pass signifying the successful working of the feature.

leslie-tsang · 2023-09-07T07:12:21Z

APISIX use pre-defined preprocessor to handle the default no_error_log, maybe pass an ignore_error_log will feed you need. @shreemaan-abhishek

monkeyDluffy6017 · 2023-09-07T08:00:26Z

@shreemaan-abhishek If @leslie-tsang's suggestion satisfy your requirement, the pr could be closed?

shreemaan-abhishek · 2023-09-07T08:13:55Z

@monkeyDluffy6017, although the background for this PR was writing a test case, I am still convinced that using an error level log leads to a false negative.

monkeyDluffy6017 · 2023-09-12T08:03:11Z

@shreemaan-abhishek I have fix this by the following pr, so the error log won't happen again:
api7/lua-resty-healthcheck#35
api7/lua-resty-healthcheck#36

monkeyDluffy6017 · 2023-09-12T08:10:09Z

By the way, the log level error is as expected. The get_target_status should not return an error

monkeyDluffy6017 · 2023-09-13T06:21:25Z

I will close this pr, feel free to reopen it

ZhangShangyu · 2023-10-10T07:20:58Z

The error log occurs when the upstream is rebuilt as we add or delete an upstream node. The state is not correct, it will lead to unbalanced load. It's a problem, we should fix this, so the error log level is appropriate here.

we also meet this error log, #9415, finally we decide to ignore this target not find error, treat the node as health

if err == "target not found" then
    core.log.warn("health check target not found, ignore health check, addr: ",
            node.host, ":", port or node.port, ", host: ", host, ", err: ", err)
    up_nodes = transform_node(up_nodes, node)
else
    core.log.error("failed to get health check target status, addr: ",
            node.host, ":", port or node.port, ", host: ", host, ", err: ", err)
end

btw it may not lead to unbalanced load, because it add all node to the up_nodes as default.

monkeyDluffy6017 · 2023-10-10T08:02:05Z

I reopened this pull request because many people have encountered the same issue. This error should only occur when multiple requests are made concurrently while creating or modifying an upstream. While the first request is in the process of creating a health check object, the second request encounters this error because it cannot access the object that hasn't been created yet. However, this error does not actually impact request forwarding. Lowering the error log level from 'error' to 'warn' is reasonable to prevent any misunderstanding.

fix: use warn log for get healthcheck target status failure

615a1c7

shreemaan-abhishek requested review from monkeyDluffy6017 and Revolyssup September 4, 2023 16:32

monkeyDluffy6017 added the discuss label Sep 5, 2023

github-actions bot added the user responded label Sep 7, 2023

monkeyDluffy6017 removed the user responded label Sep 12, 2023

monkeyDluffy6017 closed this Sep 13, 2023

monkeyDluffy6017 reopened this Oct 10, 2023

shreemaan-abhishek linked an issue Oct 10, 2023 that may be closed by this pull request

bug: the exception of health check in special scene #9415

Closed

monkeyDluffy6017 mentioned this pull request Oct 10, 2023

bug: the exception of health check in special scene #9415

Closed

monkeyDluffy6017 approved these changes Oct 10, 2023

View reviewed changes

monkeyDluffy6017 added approved and removed discuss labels Oct 10, 2023

monkeyDluffy6017 requested a review from soulbird October 10, 2023 08:34

soulbird approved these changes Oct 10, 2023

View reviewed changes

monkeyDluffy6017 merged commit 7279a34 into apache:master Oct 10, 2023
62 checks passed

shreemaan-abhishek deleted the fix/health-error branch October 10, 2023 08:56

Revolyssup pushed a commit to Revolyssup/apisix that referenced this pull request Oct 15, 2023

fix: use warn log for get_target_status failure (apache#10156)

fab998d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use warn log for get healthcheck target status failure #10156

fix: use warn log for get healthcheck target status failure #10156

shreemaan-abhishek commented Sep 4, 2023 •

edited

Loading

monkeyDluffy6017 commented Sep 5, 2023

shreemaan-abhishek commented Sep 5, 2023

leslie-tsang commented Sep 7, 2023 •

edited

Loading

monkeyDluffy6017 commented Sep 7, 2023

shreemaan-abhishek commented Sep 7, 2023

monkeyDluffy6017 commented Sep 12, 2023

monkeyDluffy6017 commented Sep 12, 2023

monkeyDluffy6017 commented Sep 13, 2023

ZhangShangyu commented Oct 10, 2023

monkeyDluffy6017 commented Oct 10, 2023 •

edited

Loading

	core.log.error("failed to get health check target status, addr: ",
	node.host, ":", port or node.port, ", host: ", host, ", err: ", err)

fix: use warn log for get healthcheck target status failure #10156

fix: use warn log for get healthcheck target status failure #10156

Conversation

shreemaan-abhishek commented Sep 4, 2023 • edited Loading

Description

Background information:

Checklist

monkeyDluffy6017 commented Sep 5, 2023

shreemaan-abhishek commented Sep 5, 2023

leslie-tsang commented Sep 7, 2023 • edited Loading

monkeyDluffy6017 commented Sep 7, 2023

shreemaan-abhishek commented Sep 7, 2023

monkeyDluffy6017 commented Sep 12, 2023

monkeyDluffy6017 commented Sep 12, 2023

monkeyDluffy6017 commented Sep 13, 2023

ZhangShangyu commented Oct 10, 2023

monkeyDluffy6017 commented Oct 10, 2023 • edited Loading

shreemaan-abhishek commented Sep 4, 2023 •

edited

Loading

leslie-tsang commented Sep 7, 2023 •

edited

Loading

monkeyDluffy6017 commented Oct 10, 2023 •

edited

Loading