Agent fails to retry connection after server disconnection #354

vikman90 · 2024-11-26T15:35:29Z

Parent Issue: #241

Description

The Wazuh Agent hangs indefinitely when the server becomes unavailable. This issue occurs despite the expectation that the agent should wait for retry_interval (30 seconds by default) and attempt to reconnect.

Root Cause

Upon investigation, we discovered that when the agent closes a socket, the underlying Boost library throws a boost::system::system_error exception by definition. This exception is not handled, resulting in the termination of the coroutine responsible for managing the reconnection attempts.

Steps to Reproduce

Start the agent and connect it to a server.
Terminate the server while the agent is connected.
Observe that the agent does not attempt to reconnect after the disconnection.

Expected Behavior

The agent should wait for the configured retry_interval and attempt to reconnect to the server.

Actual Behavior

The agent does not attempt to reconnect and remains idle indefinitely.

Proposed Solution

Prevent the agent from closing the socket if the asynchronous handshake call returned boost::asio::ssl::error::stream_truncated, which means that the socket is not open.

The text was updated successfully, but these errors were encountered:

vikman90 · 2024-11-26T16:54:42Z

Work report

November 26

Investigated the issue using mock-server and mitmproxy to analyze agent behavior during server disconnections.
Confirmed that the agent hangs indefinitely when the server is closed.
Added logic to prevent the agent from closing a socket if the TLS handshake is incomplete.
Implemented exception handling for socket closure (Close) to log errors instead of terminating coroutines.
Fixed a timing multiplier issue where the conversion from seconds to milliseconds was being applied twice.
Tested the fix locally to ensure correct behavior and stability.

Next Steps

Identify and reference documentation that supports these changes.
Create and integrate test cases to cover the fix.
Validate the fix in an AWS environment.

November 27

Prevented the agent from shutting down the socket if the SSL handshake did not complete.

TomasTurina · 2024-11-26T23:40:43Z

Update

The issue with coroutines not running in an AWS environment was related to the number of threads allocated to the task manager. The agent requires at least 3 threads to run properly (one for inventory, one for logcollector, and one for the coroutines). Since the task manager was configured using the number of CPUs, this became an issue in the AWS environment where the machine only had 2 CPUs available. This results in coroutines not being able to run as the two threads were exclusively used by the 2 modules, which do not run as coroutines.

To fix this, a new setting will be introduced to adjust the number of threads the task manager can handle and alerts will be issued when this number is exceeded, if that happens.

jr0me · 2024-11-27T22:01:59Z

Update

A setting was added to change the number of threads.
Try/catch blocks have been added to enclose boost function calls.
Error checking has been added to the Resolve,Connect,Write and Read steps of the http requests, which at some points were performing certain steps without having validated the previous ones.
Changed log level on certain parts of the http socket, resolver and client code.
Performed some refactorings to remove duplicated code.

When using an incorrect sever address, the agent will timeout after a couple of minutes, but it is still responsive and it is possible to shut it down.

See the following log... (look for ^C)

root@ip-172-31-37-153:~/wazuh-agent# SPDLOG_LEVEL=debug ./build/wazuh-agent 
[2024-11-27 21:58:45.856] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.run
[2024-11-27 21:58:45.856] [wazuh-agent] [debug] [DEBUG] [unix_daemon.cpp:89] [createLockFile] Lock file created: /var/run/wazuh-agent.lock
[2024-11-27 21:58:45.856] [wazuh-agent] [info] [INFO] [process_options_unix.cpp:26] [StartAgent] Starting wazuh-agent
[2024-11-27 21:58:45.857] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.data
[2024-11-27 21:58:45.865] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.data
[2024-11-27 21:58:45.865] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: system
[2024-11-27 21:58:45.866] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: networks
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventory.cpp:17] [Start] Starting inventory.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:907] [SyncLoop] Module started.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:890] [Scan] Starting evaluation.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [logcollector.cpp:19] [Start] Logcollector is disabled
[2024-11-27 21:58:46.008] [wazuh-agent] [error] [ERROR] [inventory.cpp:131] [LogErrorInventory] dbEngine: Empty table metadata.
[2024-11-27 21:58:46.008] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:902] [Scan] Evaluation finished.


[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:37] [Connect] Connect failed: Connection timed out
[2024-11-27 22:00:56.173] [wazuh-agent] [error] [ERROR] [http_client.cpp:222] [PerformHttpRequest] Error: Error connecting to host: Connection timed out.
[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:245] [AuthenticateWithUuidAndKey] Error: 500.
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [communicator.cpp:31] [SendAuthenticationRequest] Failed to authenticate with the manager. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/events/stateful. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/events/stateless. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']
[2024-11-27 22:00:56.175] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/commands. Retrying in 30 seconds.
[2024-11-27 22:00:56.175] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']

^C[2024-11-27 22:02:20.847] [wazuh-agent] [info] [INFO] [inventory.cpp:57] [Stop] Module stopped.

[2024-11-27 22:02:20.847] [wazuh-agent] [info] [INFO] [logcollector.cpp:51] [Stop] Logcollector stopped
[2024-11-27 22:02:20.852] [wazuh-agent] [info] [INFO] [inventory.cpp:36] [Start] Module finished.
[2024-11-27 22:03:35.917] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:37] [Connect] Connect failed: Connection timed out
[2024-11-27 22:03:35.917] [wazuh-agent] [error] [ERROR] [http_client.cpp:222] [PerformHttpRequest] Error: Error connecting to host: Connection timed out.
[2024-11-27 22:03:35.917] [wazuh-agent] [warning] [WARN] [http_client.cpp:245] [AuthenticateWithUuidAndKey] Error: 500.
[2024-11-27 22:03:35.917] [wazuh-agent] [warning] [WARN] [communicator.cpp:31] [SendAuthenticationRequest] Failed to authenticate with the manager. Retrying in 30 seconds.
root@ip-172-31-37-153:~/wazuh-agent#

The timeout in these calls is not configurable since the authentication calls don't use the async boost beast functions. We could change this to use async calls and then set an appropriate timeout.

The Beast docs are clear on this: "For portability reasons, networking does not provide timeouts or cancellation features for synchronous stream operations."
https://www.boost.org/doc/libs/1_70_0/libs/beast/doc/html/beast/using_io/timeouts.html

vikman90 added the mvp Minimum Viable Product refinement label Nov 26, 2024

vikman90 self-assigned this Nov 26, 2024

vikman90 added type/bug Bug issue level/task Task issue module/agent labels Nov 26, 2024

wazuhci moved this to In progress in Release 5.0.0 Nov 26, 2024

wazuhci added this to Release 5.0.0 Nov 26, 2024

This was referenced Nov 26, 2024

MVP Agent refinement (I) #241

Open

Fix agent reconnection issues and timing logic corrections #355

Merged

vikman90 linked a pull request Nov 26, 2024 that will close this issue

Fix agent reconnection issues and timing logic corrections #355

Merged

3 tasks

vikman90 assigned TomasTurina and jr0me Nov 27, 2024

wazuhci moved this from In progress to In review in Release 5.0.0 Nov 28, 2024

TomasTurina closed this as completed in #355 Nov 28, 2024

wazuhci moved this from In review to Done in Release 5.0.0 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent fails to retry connection after server disconnection #354

Agent fails to retry connection after server disconnection #354

vikman90 commented Nov 26, 2024 •

edited

Loading

vikman90 commented Nov 26, 2024 •

edited

Loading

TomasTurina commented Nov 26, 2024

jr0me commented Nov 27, 2024 •

edited

Loading

Agent fails to retry connection after server disconnection #354

Agent fails to retry connection after server disconnection #354

Comments

vikman90 commented Nov 26, 2024 • edited Loading

Description

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Solution

vikman90 commented Nov 26, 2024 • edited Loading

Work report

November 26

Next Steps

November 27

TomasTurina commented Nov 26, 2024

Update

jr0me commented Nov 27, 2024 • edited Loading

Update

vikman90 commented Nov 26, 2024 •

edited

Loading

vikman90 commented Nov 26, 2024 •

edited

Loading

jr0me commented Nov 27, 2024 •

edited

Loading