Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: mostly_healthy is healthy #10639

Merged
merged 1 commit into from
Dec 21, 2023
Merged

fix: mostly_healthy is healthy #10639

merged 1 commit into from
Dec 21, 2023

Conversation

Sn0rt
Copy link
Contributor

@Sn0rt Sn0rt commented Dec 13, 2023

Description

fix: #10664

WeChatWorkScreenshot_8548b234-7416-469a-9b45-4ddf05c50c1f

build a mostly_healthy test case based on test-nginx is too difficult for me. so this PR not include test case

I will add the test case form this metric at the next PR

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@Sn0rt Sn0rt marked this pull request as draft December 13, 2023 05:37
@luoluoyuyu
Copy link
Contributor

Hi @Sn0rt
The APISIX documentation describes the state transition process for health checks, which have a total of 4 states, and the prometheus plugin doesn't take into account the mostly_healty and mostly_unhealty states. I think this can be solved by adding mostly_healty and mostly_unhealty states to the Prometheus indicator. Instead of treating mostly_healty as a health state.

@Sn0rt
Copy link
Contributor Author

Sn0rt commented Dec 13, 2023

Hi @Sn0rt The APISIX documentation describes the state transition process for health checks, which have a total of 4 states, and the prometheus plugin doesn't take into account the mostly_healty and mostly_unhealty states. I think this can be solved by adding mostly_healty and mostly_unhealty states to the Prometheus indicator. Instead of treating mostly_healty as a health state.

The reason why mostly_healthy is regarded as health is because when upstream selects a node, both mostly_healthy and healthy are regarded as health. The core logic reference screenshot is from resty-helath-check. For this reason, the PR is written like this.

local function fetch_health_nodes(upstream, checker)
    local nodes = upstream.nodes
    if not checker then
        local new_nodes = core.table.new(0, #nodes)
        for _, node in ipairs(nodes) do
            new_nodes = transform_node(new_nodes, node)
        end
        return new_nodes
    end

    local host = upstream.checks and upstream.checks.active and upstream.checks.active.host
    local port = upstream.checks and upstream.checks.active and upstream.checks.active.port
    local up_nodes = core.table.new(0, #nodes)
    for _, node in ipairs(nodes) do
        local ok, err = checker:get_target_status(node.host, port or node.port, host)
        if ok then
            up_nodes = transform_node(up_nodes, node)
        elseif err then
            core.log.warn("failed to get health check target status, addr: ",
                node.host, ":", port or node.port, ", host: ", host, ", err: ", err)
        end
    end

    if core.table.nkeys(up_nodes) == 0 then
        core.log.warn("all upstream nodes is unhealthy, use default")
        for _, node in ipairs(nodes) do
            up_nodes = transform_node(up_nodes, node)
        end
    end

    return up_nodes
end

@luoluoyuyu
Copy link
Contributor

@snort Thanks for the answer, the corresponding code link is here https://github.com/api7/lua-resty-healthcheck/blob/master/lib/resty/healthcheck.lua#L684

@Sn0rt Sn0rt marked this pull request as ready for review December 13, 2023 07:50
@Sn0rt Sn0rt marked this pull request as draft December 14, 2023 03:26
@Sn0rt Sn0rt marked this pull request as ready for review December 14, 2023 04:14
apisix/upstream.lua Outdated Show resolved Hide resolved
Signed-off-by: Sn0rt <wangguohao.2009@gmail.com>
@shreemaan-abhishek
Copy link
Contributor

I agree with @luoluoyuyu's #10639 (comment). Even if logically mostly_healthy means healthy... the naming serves a purpose to the admin. They can judge that some nodes are unhealthy.

@Sn0rt
Copy link
Contributor Author

Sn0rt commented Dec 18, 2023

I agree with @luoluoyuyu's #10639 (comment). Even if logically mostly_healthy means healthy... the naming serves a purpose to the admin. They can judge that some nodes are unhealthy.

This is a bugfix. I want to avoid doing meaningless things without clear requirements.
and the user can get the node status is down if get the below metric

apisix_upstream_status{name="/apisix/routes/1",ip="127.0.0.1",port="8765"} 1
apisix_upstream_status{name="/apisix/routes/1",ip="127.0.0.1",port="8766"} 0

@monkeyDluffy6017 monkeyDluffy6017 merged commit b9d2dbf into apache:master Dec 21, 2023
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Prometheus target node_status not correct
6 participants