Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tedge-log-plugin publishes service health status to main device when running on child device #2253

Closed
reubenmiller opened this issue Sep 13, 2023 · 4 comments
Assignees
Labels
bug Something isn't working theme:troubleshooting Theme: Troubleshooting and remote control
Milestone

Comments

@reubenmiller
Copy link
Contributor

Describe the bug

When the tedge-log-plugin is not running as the main device it still reports the service health status to the main device, overwriting the status of the real tedge-log-plugin service status running on the main device.

Looking at the code crates/extensions/tedge_health_ext/src/lib.rs it clearly shows there is no logic to handle running as a child device or main device, and it blindly assumes it is always the main device.

// Connect this actor to MQTT
let subscriptions = vec![
    "tedge/health-check",
    &format!("tedge/health-check/{service_name}"),
]
.try_into()
.expect("Failed to create the HealthMonitorActor topic filter");

To Reproduce

  1. Install thin-edge.io

  2. Run the tedge-log-plugin as a child device (see the integration test under tests/RobotFramework/tests/cumulocity/log/log_operation_child.robot)

  3. Subscribe to all topics (in the background)

    tedge mqtt sub '#' &
    

    Note: Press enter a few times afterwards so you get a prompt again.

  4. Restart the tedge-log-plugin service

    sudo systemctl restart tedge-log-plugin
  5. Inspect the console output. The tedge-log-plugin should publish its health message to either tedge/health/tedge-log-plugin/<child-id> or te/device/<child-id>/service/tedge-log-plugin/status/health (depending if the new MQTT v1 api spec has been implemented or not)

Expected behavior

The tedge-log-plugin should publish to the tedge-log-plugin attached to the child device it is running on, when the service is running as a child device.

Screenshots

The following console output shows when the tedge-log-plugin was restarted on the child device, and that the tedge/health/tedge-log-plugin topic was incorrectly used as the published message also includes the PID of the tedge-log-plugin running on the child device, 5010.

root@12bb2636811b:/setup# tedge mqtt sub '#' &
root@12bb2636811b:/setup# systemctl restart tedge-log-plugin
root@12bb2636811b:/setup# [c8y/s/us] 102,TST_overtake_corn_hybrid_tedge-log-plugin,"service",tedge-log-plugin,"down"
[te/device/main/service/tedge-log-plugin/status/health] {"pid":217,"status":"down"}
[tedge/health/tedge-log-plugin] {"pid":5010,"status":"up","time":1694571559}
[te/device/main/service/tedge-log-plugin/status/health] {"pid":5010,"status":"up","time":1694571559}
[c8y/s/us] 102,TST_overtake_corn_hybrid_tedge-log-plugin,"service",tedge-log-plugin,"up"
[c8y/s/us/TST_hobble_claret_modem] 118,example
^C
root@12bb2636811b:/setup# systemctl status tedge-log-plugin
● tedge-log-plugin.service - thin-edge.io log file retriever
     Loaded: loaded (/lib/systemd/system/tedge-log-plugin.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-09-13 02:19:19 UTC; 20s ago
   Main PID: 5010 (tedge-log-plugi)
      Tasks: 6 (limit: 9515)
     Memory: 716.0K
     CGroup: /docker/12bb2636811b1701fdbf1c934074c959892e5412283c71ef08ba737017f45926/system.slice/tedge-log-plugin.service
             └─5010 /usr/bin/tedge-log-plugin

Environment (please complete the following information):

  • OS [incl. version]: Debian bullseye
  • Hardware [incl. revision]: Docker
  • System-Architecture [e.g. result of "uname -a"]: Linux 12bb2636811b 5.15.68-0-virt #1-Alpine SMP Fri, 16 Sep 2022 06:29:31 +0000 aarch64 GNU/Linux
  • thin-edge.io version [e.g. 0.1.0]: 0.12.1~266+g2eb30d5

Additional context

@reubenmiller reubenmiller added bug Something isn't working theme:troubleshooting Theme: Troubleshooting and remote control labels Sep 13, 2023
@didier-wenzek didier-wenzek self-assigned this Sep 13, 2023
@didier-wenzek didier-wenzek removed their assignment Sep 26, 2023
@didier-wenzek
Copy link
Contributor

There is only one key point to clarify before implementation: How a service can know its topic identifier?

  • Thanks to the tedge config setting mqtt.device_topic_id, a service knows on which device it's running.
  • The service topic identifier can be derived from its parent topic identifier only when the latter follows the default topic scheme.
  • Using tedge config is let appropriate for services as this would require a setting per service.

My proposal is to:

  • add an mqtt_service_topic_id optional argument to all the thin-edge daemon.
  • use mqtt_service_topic_id as the service topic identifier of the daemon when given.
  • derive the service topic identifier of the daemon from mqtt.device_topic_id assuming a default MQTT schema,
    when no mqtt_service_topic_id is given
  • stop with an error when the service topic identifier is not given and cannot be derived.

The mqtt_service_topic_id is then used by the service to:

This approach will have then to be applied to all tedge daemons:

  • tedge-log-plugin
  • tedge-config-plugin
  • tedge-agent
  • tedge-mapper

@didier-wenzek
Copy link
Contributor

@reubenmiller
Copy link
Contributor Author

QA has thoroughly checked the feature and here are the results:

  • Test for ticket exists in the test suite.
  • QA has tested the bug and could not reproduce it anymore.

@reubenmiller reubenmiller self-assigned this Nov 1, 2023
@jarhodes314
Copy link
Contributor

@reubenmiller Given this is now completed, I was surprised to discover that uncommenting

# Uncomment once https://github.com/thin-edge/thin-edge.io/issues/2253 is resolved
# ThinEdgeIO.Service Health Status Should Be Up tedge-configuration-plugin device=child1
failed (see https://github.com/thin-edge/thin-edge.io/actions/runs/7045873655). As I was just uncommenting it out of curiosity, I didn't investigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working theme:troubleshooting Theme: Troubleshooting and remote control
Projects
None yet
Development

No branches or pull requests

3 participants