Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent hangs during a connection attempt and delays shutdown #358

Closed
Tracked by #241
vikman90 opened this issue Nov 27, 2024 · 4 comments · Fixed by #384
Closed
Tracked by #241

Agent hangs during a connection attempt and delays shutdown #358

vikman90 opened this issue Nov 27, 2024 · 4 comments · Fixed by #384
Assignees
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/bug Bug issue

Comments

@vikman90
Copy link
Member

Description

When the agent is configured with an incorrect server address (server_url), it becomes unresponsive to shutdown signals (e.g., Ctrl+C). This behavior causes delays in stopping the agent and may be related to the retry mechanism for failed connection attempts.

The issue appears to be influenced by the retry_interval setting, as indicated by the following log entry:

[2024-11-27 16:47:13.017] [wazuh-agent] [warning] [WARN] [https_socket.hpp:56] [AsyncConnect] boost::asio::async_connect returned error code: 113 No route to host
[2024-11-27 16:47:13.018] [wazuh-agent] [warning] [WARN] [http_client.cpp:102] [Co_PerformHttpRequest] Failed to send http request. /api/v1/events/stateless. Retrying in 5 seconds.
^C
[2024-11-27 16:47:16.107] [wazuh-agent] [info] [INFO] [inventory.cpp:57] [Stop] Module stopped.
[2024-11-27 16:47:16.107] [wazuh-agent] [info] [INFO] [logcollector.cpp:51] [Stop] Logcollector stopped
[2024-11-27 16:47:16.123] [wazuh-agent] [info] [INFO] [inventory.cpp:36] [Start] Module finished.
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
[2024-11-27 16:47:23.328] [wazuh-agent] [error] [ERROR] [https_socket.hpp:38] [Connect] Exception thrown: connect: No route to host [system:113 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_service.hpp:587:33 in function 'boost::system::error_code boost::asio::detail::reactive_socket_service<boost::asio::ip::tcp>::connect(implementation_type &, const endpoint_type &, boost::system::error_code &) [Protocol = boost::asio::ip::tcp]']
[2024-11-27 16:47:23.329] [wazuh-agent] [error] [ERROR] [https_socket.hpp:79] [Write] Exception thrown during write: uninitialized (SSL routines) [asio.ssl:167772436]
[2024-11-27 16:47:23.329] [wazuh-agent] [error] [ERROR] [https_socket.hpp:112] [Read] Exception thrown during read: uninitialized (SSL routines) [asio.ssl:167772436]
[2024-11-27 16:47:23.329] [wazuh-agent] [error] [ERROR] [http_client.cpp:227] [AuthenticateWithUuidAndKey] Error parsing token in response: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON.
[2024-11-27 16:47:23.329] [wazuh-agent] [warning] [WARN] [communicator.cpp:32] [SendAuthenticationRequest] Failed to authenticate with the manager. Retrying in 5 seconds.

Steps to Reproduce

  1. Configure the agent with an incorrect server_url.
  2. Start the agent.
  3. Attempt to stop the agent using Ctrl+C.

Observed Behavior

  • The agent does not respond promptly to the shutdown signal.
  • The delay seems correlated with the connection retry interval.

Expected Behavior

  • The agent should respond immediately to the shutdown signal, regardless of its connection state or retry mechanism.
@vikman90 vikman90 added type/bug Bug issue mvp Minimum Viable Product refinement level/task Task issue module/agent labels Nov 27, 2024
@wazuhci wazuhci moved this to Backlog in Release 5.0.0 Nov 27, 2024
@cborla cborla self-assigned this Nov 28, 2024
@wazuhci wazuhci moved this from Backlog to In progress in Release 5.0.0 Nov 28, 2024
@cborla
Copy link
Member

cborla commented Nov 29, 2024

Work report

November 28

  • Reproduced failure case.
    • It is reproduced when starting an agent from zero.
  • Analysis.
    • It is identified that the inventory is the one that does not respond until all its events have been entered in the queue.

@LucioDonda
Copy link
Member

Update 29/11

  • Inventory module continues it's execution although the signal handler receives the shutdown.
  • When it starts (if scan on start enabled) it will TryCatch each type of scan till it finishes, then the boolean m_stopping gets updated and the next scan doesn't wait for the interval, the modules finishes.
  • When scan on start is not enabled this behavior doesn't happen.
  • We should replicate a similar wait behavior before the first scan.

@LucioDonda
Copy link
Member

Update 02/12

  • As stated above the problem happens when trying to update m_stopping boolean while the Scan takes place.
  • If the scan on start is enabled it can happpend at the beginning or it could take place after if it isn't.
  • The base problem is that each scan checks this boolean, so the expected behavior should be only to finish an X scan before shutting down the module.
  • This is not happening, all scans get finished before the update of the variable. The time spent in this situation may vary depending on the running host.
  • The cause of this is the mutex lock at the init method:
    std::unique_lock<std::mutex> lock{m_mutex};
    This guards the DBSync access.
  • One approach to fix this is to unlock it when needed, another one is to use an alternative lock for the members access.
    • Still working on the best solution.

@LucioDonda LucioDonda linked a pull request Dec 3, 2024 that will close this issue
1 task
@LucioDonda LucioDonda removed a link to a pull request Dec 3, 2024
1 task
@LucioDonda LucioDonda linked a pull request Dec 3, 2024 that will close this issue
1 task
@LucioDonda
Copy link
Member

Update 03/12

  • PR fixing the base issue here
  • The behavior is directly attached to the host specs.
    • In my case with the changes the time till answer was cut in half.
    • In a slower VM didn't difference wasn't that important.
  • In order to improve even more this response time it's mandatory to cut the scan process in the middle, this could imply generating wrong messages that would impact the status of the inventory sent to the manager.

@wazuhci wazuhci moved this from In progress to Done in Release 5.0.0 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/bug Bug issue
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants