Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Robustness of Agent Foreground and Background Execution Modes #288

Closed
Tracked by #241
vikman90 opened this issue Nov 11, 2024 · 3 comments · Fixed by #313 or #349
Closed
Tracked by #241

Improve Robustness of Agent Foreground and Background Execution Modes #288

vikman90 opened this issue Nov 11, 2024 · 3 comments · Fixed by #313 or #349
Assignees
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/enhancement Enhancement issue

Comments

@vikman90
Copy link
Member

vikman90 commented Nov 11, 2024

Parent Issue: #241

Description

The Wazuh agent currently has issues handling its execution modes when run with --run (foreground) or --start (background) flags. Specifically, launching the agent in foreground with ./wazuh-agent --run can sometimes print the following message:

wazuh-agent already running

This message typically indicates that an instance of the agent is already running. However, it may also appear if the agent's previous process terminated unexpectedly, which leads to unreliable behavior.

Proposed Solution

  1. Separate --run and --start behavior:
    • --run: Should only launch the agent in the foreground without checking if an instance is already running.
    • --start: Should launch the agent in the background and include checks to ensure no other instance of the agent is running.
  2. PID file handling:
    When using --start, the agent should:
    • Check for the existence of a PID file.
    • If a PID file exists, verify if it corresponds to a currently running agent process.
    • If no running process is found or the PID file does not exist, perform a fork and execute the agent in background mode (--run).
  3. Systemd Compatibility:
    • Ensure that the modified behavior aligns with Systemd’s service management for proper control over the agent's lifecycle.
@vikman90 vikman90 added level/task Task issue type/enhancement Enhancement issue module/agent mvp Minimum Viable Product refinement labels Nov 11, 2024
@wazuhci wazuhci moved this to Backlog in Release 5.0.0 Nov 11, 2024
@wazuhci wazuhci moved this from Backlog to In progress in Release 5.0.0 Nov 12, 2024
@sdvendramini
Copy link
Member

sdvendramini commented Nov 12, 2024

12/11/2024

I've started reproducing the problem and researching different approaches to solve this issue.

13/11/2024

I was testing another platforms to know how they works. I'm trying to do some tests using procps library to check if the process is running.

19/11/2024

I've changed the approach. I started implementing the solution using a lock file with the sys/file.h library.

20/11/2024

I have completed the development for Linux and performed tests on both Linux and macOS. Some adjustments were necessary to ensure compatibility with macOS. I need to update the service-related files to reflect the changes made to the execution modes. Draf PR opened.

21/11/2024

I have been testing on macOS and fixing some details for the systemd service.

22/11/2024

I've finished the tests and closed de issue.

@sdvendramini
Copy link
Member

OpenSearch

During the testing of OpenSearch, it was observed that it is possible to execute another instance of the executable while the OpenSearch service is already running. This behavior appears to create an additional node, which aligns with the fact that OpenSearch is designed as a cluster-based system.

However, it was noticed that the directory containing the PID file becomes empty after launching the second OpenSearch instance. This behavior raises questions about how the process manages resources and whether this is expected in cluster configurations.

To clarify, OpenSearch uses the same executable for all node types, including:

  • Master Node
  • Data Node
  • Client Node (Coordinating Node)

Each OpenSearch instance runs as an independent Java process and can be configured for different roles.

Filebeat

Testing Filebeat revealed that it does not allow multiple instances to run simultaneously with the same configuration. When attempting to execute a second instance of Filebeat while the service is already running, the following error was encountered:

Exiting: /var/lib/filebeat/filebeat.lock: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data)  

This error indicates that Filebeat locks the path.data directory, preventing concurrent executions unless a separate data path is specified. Further tests showed that it is possible to launch multiple instances if distinct path.data configurations are provided.

While it is technically feasible to run multiple Filebeat instances on the same server, this practice is uncommon. Typically, a single instance is configured to handle data ingestion from multiple sources, streamlining operations.

Conclusions

After analyzing these two products, I believe that wazuh-agent will not behave the same way, as it is designed to run as a single instance. Tests could be done to observe what happens with data persistence when two instances are running simultaneously. Alternatively, it could be worth considering running two instances with different data paths to avoid issues related to this.

If the idea of executing a new instance with --run while the service is running is solely for development purposes,I think the approach described in the issue's description should be sufficient. I just don't see the need to fork with --start, as systemd already handles the process in the background.
Currently, I am working on an implementation for Linux and other for macOS to improve how we verify that the process is running. Once the development is complete, the behavior of an instance with --run will be tested while the service is active and running with --start.

@vikman90
Copy link
Member Author

vikman90 commented Nov 18, 2024

Hi @sdvendramini,

Thank you for the detailed analysis. Based on your findings and further discussions, we propose the following adjustments to streamline the behavior of the wazuh-agent and align it more closely with practical use cases:

Proposal

  1. Remove the --run and --start options from the agent CLI.
    These options add unnecessary complexity to the behavior of the agent. Instead, we aim for a simplified and predictable execution model.
  2. Default foreground execution:
    If wazuh-agent is executed without CLI options, it will start the service in the foreground. This makes behavior consistent and reduces confusion.
  3. Prevent multiple instances:
    If wazuh-agent detects that another instance is already running, it will terminate its execution. This ensures we avoid resource conflicts and maintain a single-agent instance, as is typical.

To reliably detect whether another process is running, we suggest implementing a robust mechanism using lockfiles. This approach addresses scenarios where PID files or lockfiles might remain stale, such as:

  • Process crashes (e.g., segmentation faults or out-of-memory errors).
  • Abrupt system shutdowns (e.g., power failures).
  • Unexpected restarts.

Proposed Lockfile Implementation

  1. Create a lockfile when the agent starts. This file will be attached to a living process to claim exclusive ownership of the agent instance.
  2. Validate lockfile ownership: Before starting, the agent will check whether the process associated with the lockfile is still running.
  3. Remove stale lockfiles: If the process is no longer active, the agent will clean up the stale lockfile before proceeding.

Let me know if you agree with this approach or have further suggestions. Once finalized, we can proceed with implementing and testing these changes.

Best regards.

@sdvendramini sdvendramini linked a pull request Nov 19, 2024 that will close this issue
@wazuhci wazuhci moved this from In progress to Done in Release 5.0.0 Nov 22, 2024
@TomasTurina TomasTurina linked a pull request Nov 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue module/agent mvp Minimum Viable Product refinement type/enhancement Enhancement issue
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants