Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent fails to retry connection after server disconnection #354

Closed
Tracked by #241
vikman90 opened this issue Nov 26, 2024 · 3 comments · Fixed by #355
Closed
Tracked by #241

Agent fails to retry connection after server disconnection #354

vikman90 opened this issue Nov 26, 2024 · 3 comments · Fixed by #355
Assignees
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/bug Bug issue

Comments

@vikman90
Copy link
Member

vikman90 commented Nov 26, 2024

Parent Issue: #241

Description

The Wazuh Agent hangs indefinitely when the server becomes unavailable. This issue occurs despite the expectation that the agent should wait for retry_interval (30 seconds by default) and attempt to reconnect.

Root Cause

Upon investigation, we discovered that when the agent closes a socket, the underlying Boost library throws a boost::system::system_error exception by definition. This exception is not handled, resulting in the termination of the coroutine responsible for managing the reconnection attempts.

Steps to Reproduce

  1. Start the agent and connect it to a server.
  2. Terminate the server while the agent is connected.
  3. Observe that the agent does not attempt to reconnect after the disconnection.

Expected Behavior

The agent should wait for the configured retry_interval and attempt to reconnect to the server.

Actual Behavior

The agent does not attempt to reconnect and remains idle indefinitely.

Proposed Solution

Prevent the agent from closing the socket if the asynchronous handshake call returned boost::asio::ssl::error::stream_truncated, which means that the socket is not open.

@vikman90 vikman90 added the mvp Minimum Viable Product refinement label Nov 26, 2024
@vikman90 vikman90 self-assigned this Nov 26, 2024
@wazuhci wazuhci moved this to In progress in Release 5.0.0 Nov 26, 2024
@vikman90 vikman90 linked a pull request Nov 26, 2024 that will close this issue
3 tasks
@vikman90
Copy link
Member Author

vikman90 commented Nov 26, 2024

Work report

November 26

  1. Investigated the issue using mock-server and mitmproxy to analyze agent behavior during server disconnections.
  2. Confirmed that the agent hangs indefinitely when the server is closed.
  3. Added logic to prevent the agent from closing a socket if the TLS handshake is incomplete.
  4. Implemented exception handling for socket closure (Close) to log errors instead of terminating coroutines.
  5. Fixed a timing multiplier issue where the conversion from seconds to milliseconds was being applied twice.
  6. Tested the fix locally to ensure correct behavior and stability.

Next Steps

  1. Identify and reference documentation that supports these changes.
  2. Create and integrate test cases to cover the fix.
  3. Validate the fix in an AWS environment.

November 27

  • Prevented the agent from shutting down the socket if the SSL handshake did not complete.

@TomasTurina
Copy link
Member

Update

The issue with coroutines not running in an AWS environment was related to the number of threads allocated to the task manager. The agent requires at least 3 threads to run properly (one for inventory, one for logcollector, and one for the coroutines). Since the task manager was configured using the number of CPUs, this became an issue in the AWS environment where the machine only had 2 CPUs available. This results in coroutines not being able to run as the two threads were exclusively used by the 2 modules, which do not run as coroutines.

To fix this, a new setting will be introduced to adjust the number of threads the task manager can handle and alerts will be issued when this number is exceeded, if that happens.

@jr0me
Copy link
Member

jr0me commented Nov 27, 2024

Update

A setting was added to change the number of threads.
Try/catch blocks have been added to enclose boost function calls.
Error checking has been added to the Resolve,Connect,Write and Read steps of the http requests, which at some points were performing certain steps without having validated the previous ones.
Changed log level on certain parts of the http socket, resolver and client code.
Performed some refactorings to remove duplicated code.

When using an incorrect sever address, the agent will timeout after a couple of minutes, but it is still responsive and it is possible to shut it down.

See the following log... (look for ^C)

root@ip-172-31-37-153:~/wazuh-agent# SPDLOG_LEVEL=debug ./build/wazuh-agent 
[2024-11-27 21:58:45.856] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.run
[2024-11-27 21:58:45.856] [wazuh-agent] [debug] [DEBUG] [unix_daemon.cpp:89] [createLockFile] Lock file created: /var/run/wazuh-agent.lock
[2024-11-27 21:58:45.856] [wazuh-agent] [info] [INFO] [process_options_unix.cpp:26] [StartAgent] Starting wazuh-agent
[2024-11-27 21:58:45.857] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.data
[2024-11-27 21:58:45.865] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: path.data
[2024-11-27 21:58:45.865] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: system
[2024-11-27 21:58:45.866] [wazuh-agent] [debug] [DEBUG] [configuration_parser.hpp:93] [GetConfig] Requested setting not found or invalid, default value used. Key not found: networks
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventory.cpp:17] [Start] Starting inventory.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:907] [SyncLoop] Module started.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:890] [Scan] Starting evaluation.
[2024-11-27 21:58:45.885] [wazuh-agent] [info] [INFO] [logcollector.cpp:19] [Start] Logcollector is disabled
[2024-11-27 21:58:46.008] [wazuh-agent] [error] [ERROR] [inventory.cpp:131] [LogErrorInventory] dbEngine: Empty table metadata.
[2024-11-27 21:58:46.008] [wazuh-agent] [info] [INFO] [inventoryImp.cpp:902] [Scan] Evaluation finished.


[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:37] [Connect] Connect failed: Connection timed out
[2024-11-27 22:00:56.173] [wazuh-agent] [error] [ERROR] [http_client.cpp:222] [PerformHttpRequest] Error: Error connecting to host: Connection timed out.
[2024-11-27 22:00:56.173] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:245] [AuthenticateWithUuidAndKey] Error: 500.
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [communicator.cpp:31] [SendAuthenticationRequest] Failed to authenticate with the manager. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:69] [AsyncConnect] boost::asio::async_connect returned error code: 110 Connection timed out
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/events/stateful. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/events/stateless. Retrying in 30 seconds.
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']
[2024-11-27 22:00:56.174] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']
[2024-11-27 22:00:56.175] [wazuh-agent] [warning] [WARN] [http_client.cpp:100] [Co_PerformHttpRequest] Failed to send http request. /api/v1/commands. Retrying in 30 seconds.
[2024-11-27 22:00:56.175] [wazuh-agent] [debug] [DEBUG] [http_client.cpp:103] [Co_PerformHttpRequest] Http request failed: Broken pipe - Broken pipe [system:32 at /root/wazuh-agent/build/vcpkg_installed/x64-linux/include/boost/asio/detail/reactive_socket_send_op.hpp:136:5 in function 'static void boost::asio::detail::reactive_socket_send_op<ConstBufferSequence, Handler, IoExecutor>::do_complete(void*, boost::asio::detail::operation*, const boost::system::error_code&, std::size_t)']

^C[2024-11-27 22:02:20.847] [wazuh-agent] [info] [INFO] [inventory.cpp:57] [Stop] Module stopped.

[2024-11-27 22:02:20.847] [wazuh-agent] [info] [INFO] [logcollector.cpp:51] [Stop] Logcollector stopped
[2024-11-27 22:02:20.852] [wazuh-agent] [info] [INFO] [inventory.cpp:36] [Start] Module finished.
[2024-11-27 22:03:35.917] [wazuh-agent] [debug] [DEBUG] [https_socket.hpp:37] [Connect] Connect failed: Connection timed out
[2024-11-27 22:03:35.917] [wazuh-agent] [error] [ERROR] [http_client.cpp:222] [PerformHttpRequest] Error: Error connecting to host: Connection timed out.
[2024-11-27 22:03:35.917] [wazuh-agent] [warning] [WARN] [http_client.cpp:245] [AuthenticateWithUuidAndKey] Error: 500.
[2024-11-27 22:03:35.917] [wazuh-agent] [warning] [WARN] [communicator.cpp:31] [SendAuthenticationRequest] Failed to authenticate with the manager. Retrying in 30 seconds.
root@ip-172-31-37-153:~/wazuh-agent# 

The timeout in these calls is not configurable since the authentication calls don't use the async boost beast functions. We could change this to use async calls and then set an appropriate timeout.

The Beast docs are clear on this: "For portability reasons, networking does not provide timeouts or cancellation features for synchronous stream operations."
https://www.boost.org/doc/libs/1_70_0/libs/beast/doc/html/beast/using_io/timeouts.html

@wazuhci wazuhci moved this from In progress to In review in Release 5.0.0 Nov 28, 2024
@wazuhci wazuhci moved this from In review to Done in Release 5.0.0 Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/bug Bug issue
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants