Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault agent 1.17.0 still consumes 100% CPU on all cores. #27627

Closed
pieter-lautus opened this issue Jun 27, 2024 · 1 comment
Closed

Vault agent 1.17.0 still consumes 100% CPU on all cores. #27627

pieter-lautus opened this issue Jun 27, 2024 · 1 comment

Comments

@pieter-lautus
Copy link

Describe the bug
Many of our servers have experienced high CPU utilisation due to the vault agent process consuming 100% CPU on all cores.

This seems to be an entirely new issue, unrelated to #25497. Whereas before in these situations the vault-agent's log output was full of errors about failing template renderings being retried in a tight loop, in this case the log file shows absolutely nothing.

I have run strace -ff against the main PID of the vault-agent, and it reveals that the main process is doing absolutely nothing, but about 8 or so of its threads are stuck in loops due to failing futex calls:

futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=473, si_uid=3254} ---
rt_sigreturn({mask=[]})                 = 824635260040
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc000110148, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=473, si_uid=3254} ---
rt_sigreturn({mask=[]})                 = 1
futex(0xc002700148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc002700148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc002700148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc002700148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc002700148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc002d28948, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
futex(0xc002e64178, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xc002e64178, FUTEX_WAKE_PRIVATE, 1) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=473, si_uid=3254} ---

At least this time unlike #25497 it is not a hard busy loop, it at least yields to the scheduler each time round the loop.

To Reproduce
Investigation ongoing.

Environment:
Vault agent version: 1.17.0
Operating system: Ubuntu 20.04.6 LTS, running inside an LXD container
Vault server configuration file(s):

Agent config:

vault {
    address = "https://some.where.private"
}

auto_auth {
    method "approle" {
        config = {
            role_id_file_path = "/etc/vault-agent/receive/approle-role-id"
            secret_id_file_path = "/etc/vault-agent/receive/approle-secret-id"
            remove_secret_id_file_after_reading = true
        }
    }
}

listener "tcp" {
    address = "127.0.0.1:8201"
    tls_disable = "true"
    role = "metrics_only"
}

telemetry {
    disable_hostname = true
}


template {
    source      = "/etc/vault-agent/template/privkey.pem.ctmpl"
    destination = "/etc/vault-agent/data/privkey.pem"
    perms       = "0640"
    command     = "/usr/bin/sudo -u root -n /etc/vault-agent/lautus-private-ssl-renewal-hook /etc/vault-agent/data/privkey.pem /etc/tau/ssl/private/key.pem no"
}


template {
    source      = "/etc/vault-agent/template/certificate.pem.ctmpl"
    destination = "/etc/vault-agent/data/certificate.pem"
    perms       = "0644"
    command     = "/usr/bin/sudo -u root -n /etc/vault-agent/lautus-private-ssl-renewal-hook /etc/vault-agent/data/certificate.pem /etc/tau/ssl/certs/certificate.pem no"
}


template {
    source      = "/etc/vault-agent/template/chain.pem.ctmpl"
    destination = "/etc/vault-agent/data/chain.pem"
    perms       = "0644"
    command     = "/usr/bin/sudo -u root -n /etc/vault-agent/lautus-private-ssl-renewal-hook /etc/vault-agent/data/chain.pem /etc/tau/ssl/certs/chain.pem no"
}


template {
    source      = "/etc/vault-agent/template/fullchain.pem.ctmpl"
    destination = "/etc/vault-agent/data/fullchain.pem"
    perms       = "0644"
    command     = "/usr/bin/sudo -u root -n /etc/vault-agent/lautus-private-ssl-renewal-hook /etc/vault-agent/data/fullchain.pem /etc/tau/ssl/certs/fullchain.pem restart"
}

Additional context
Add any other context about the problem here.

@miagilepner
Copy link
Contributor

Hi, thank you for reporting this. This is a known issue that will be fixed in 1.17.1. Because the fix has already been merged I'm going to close this Github issue. If you find that this fix doesn't solve your problem, or you are able to profile agent and you see that your problem stems from another cause, please feel free to re-open the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants