-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent fails to retry connection after server disconnection #354
Comments
Work reportNovember 26
Next Steps
November 27
|
UpdateThe issue with coroutines not running in an AWS environment was related to the number of threads allocated to the task manager. The agent requires at least 3 threads to run properly (one for inventory, one for logcollector, and one for the coroutines). Since the task manager was configured using the number of CPUs, this became an issue in the AWS environment where the machine only had 2 CPUs available. This results in coroutines not being able to run as the two threads were exclusively used by the 2 modules, which do not run as coroutines. To fix this, a new setting will be introduced to adjust the number of threads the task manager can handle and alerts will be issued when this number is exceeded, if that happens. |
UpdateA setting was added to change the number of threads. When using an incorrect sever address, the agent will timeout after a couple of minutes, but it is still responsive and it is possible to shut it down. See the following log... (look for
The timeout in these calls is not configurable since the authentication calls don't use the async boost beast functions. We could change this to use async calls and then set an appropriate timeout.
|
Parent Issue: #241
Description
The Wazuh Agent hangs indefinitely when the server becomes unavailable. This issue occurs despite the expectation that the agent should wait for
retry_interval
(30 seconds by default) and attempt to reconnect.Root Cause
Upon investigation, we discovered that when the agent closes a socket, the underlying Boost library throws a
boost::system::system_error
exception by definition. This exception is not handled, resulting in the termination of the coroutine responsible for managing the reconnection attempts.Steps to Reproduce
Expected Behavior
The agent should wait for the configured
retry_interval
and attempt to reconnect to the server.Actual Behavior
The agent does not attempt to reconnect and remains idle indefinitely.
Proposed Solution
Prevent the agent from closing the socket if the asynchronous handshake call returned
boost::asio::ssl::error::stream_truncated
, which means that the socket is not open.The text was updated successfully, but these errors were encountered: