Ordered processing of entity registration messages before any other messages #2466

albinsuresh · 2023-11-17T15:12:58Z

Proposed changes

Spec

Problems statements
Solution proposals
Finalized solution

Impl

Caching of early telemetry messages
Caching of early twin data messages
Handling early child entity registration messages
Persist entity store: Deferred for later

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

github-actions · 2023-11-17T15:44:42Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
362	0	3	362	100	49m46.662s

didier-wenzek

These thoughts make me wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

ground/rasp/// -> main device
floor[0-9]+/rasp// -> child device of main device
(?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

docs/src/references/mappers/c8y-mapper.md

reubenmiller · 2023-11-20T12:01:16Z

These thoughts make be wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

ground/rasp/// -> main device

floor[0-9]+/rasp// -> child device of main device

(?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

Yes, very valid points. Some kind of blocking call (e.g. to a HTTP endpoint) might be more appropriate to register devices (or patterns, nice idea).

docs/src/references/mappers/c8y-mapper.md

albinsuresh · 2023-11-21T07:23:22Z

These thoughts make me wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I can't agree more and most "capable devices" probably would prefer such an HTTP API over the clunky MQTT one. But, my understanding was that there would be many limited devices that we'll have to support, that can only handle MQTT and we'll have to provide an MQTT-only solution for them anyway. So, ideally we should have both APIs, but we might be forced to start with MQTT first, as that works for everyone.

Irrespective of the protocol, another key thing that we need to define is what a "successful registration" means? Does it mean that any one mapper has processed that registration message or all the connected mappers have processed it? How can we communicate either of those in a cloud-agnostic manner? I've expressed my concerns on this matter in this comment in detail.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

ground/rasp/// -> main device

floor[0-9]+/rasp// -> child device of main device

(?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

Once we have the registration message feedback mechanism, on which the device-logic will have to wait, do we still need this template mechanism? Such a mechanism would still be useful to the customer, to avoid the explicit registration messages altogether. But this is not necessary in the context of this problem, right?

One other key point to note is that these issues are not just limited to custom topic schemes and the id-generation associated with it. That's definitely one key aspect, but not the only one. Even the default topic scheme suffers from these limitations when explicit registration is mandatory, like in the case of nested child devices. Getting the id wrong is not the only problem in this case, but getting the @parent wrong as well. I've added another dedicated sequence diagram for nested-child devices case, explaining the error path.

didier-wenzek · 2023-11-21T09:23:41Z

Once we have the registration message feedback mechanism, on which the device-logic will have to wait, do we still need this template mechanism? Such a mechanism would still be useful to the customer, to avoid the explicit registration messages altogether. But this is not necessary in the context of this problem, right?

We need to choose, sure.

either a protocol over MQTT or HTTP to dynamically register entities
or a static topic scheme used to deduce registration messages from entity identifiers.

I do think the latter would simplify things a lot (the topic<->entity relation being computed)
while being flexible enough (a user don't have to list in advance all and every entities - but to detail the logic behind his topic scheme).

One other key point to note is that these issues are not just limited to custom topic schemes and the id-generation associated with it. That's definitely one key aspect, but not the only one. Even the default topic scheme suffers from these limitations when explicit registration is mandatory, like in the case of nested child devices. Getting the id wrong is not the only problem in this case, but getting the @parent wrong as well. I've added another dedicated sequence diagram for nested-child devices case, explaining the error path.

Why registration should be mandatory and necessarily done dynamically using registration messages?

A rule such as this one can be used to derive the external id as well as the parent of a grand child device,
assuming a scheme where a bunch of PLC are attached to a Rasp on each floor:

(?<floor> floor[0-9]+)/(?<plc> plc[0-9]+)// -> { "@id" -> "${floor}:${plc]", "@parent": "${floor}/rasp//" }

albinsuresh · 2023-11-21T09:58:00Z

A rule such as this one can be used to derive the external id as well as the parent of a grand child device,
assuming a scheme where a bunch of PLC are attached to a Rasp on each floor:

For the external ID, I completely agree that some kind of correlation would definitely exist between the entity topic ID and its external ID, making that mapping definition fairly straight-forward. But, deriving the parent, I'm not fully sure. Looking at a few examples, like devices deployed in factory floors in a hierarchical way, it seems plausible, but I'm just not sure if we can generalise that assumption.

tests/RobotFramework/tasks/debug.robot

docs/src/references/mappers/c8y-mapper.md

didier-wenzek

I'm okay for "Phase 1 - live update" and have a suggestion for an alternative solution to "Phase 1 - on mapper restart".

I see the "Phase 2" proposal as not mature enough. Is this even required?

I also have a question on where to put this content. These are more internal design issues and discussions rather than specifications as implied by the name of the updated file.

docs/src/references/mappers/c8y-mapper.md

didier-wenzek

The current text clarifies well the problem statement/alternative solutions/pros & cons;
but it's not so clear what's the retained solution among all the alternative ideas and combinations.

A summary of the retained solution has to be given in the introduction, followed by the discussion (which can be unchanged).
This content has to be moved under https://github.com/thin-edge/thin-edge.io/tree/main/design/decisions as this is a design discussion and not a reference guide.

docs/src/references/mappers/c8y-mapper.md

didier-wenzek

I'm okay with the proposed design.

This document can be improved with an executive summary of the retained solution.
3 bullet points would be enough:

The c8y mapper persists its entity store on disk.
Messages received from entities not registered yet are cached till the registration is received (with an eviction policy to deal with rogue clients).
Auto-registration has to be disabled if the user opt-in for a custom MQTT scheme.

design/decisions/0003-c8y-mapper.md

didier-wenzek · 2023-12-01T08:20:50Z

design/decisions/0003-c8y-mapper.md

+To minimize the message races on a restart, especially between all the entities that were previously registered
+before the mapper went down and any data messages for that entity that may have come while the mapper was down,
+a persistent copy of the entity store is maintained on the disk, and kept up-to-date, while the mapper is live.


We have to consider to use the new agent.state.path directory for that purpose as the plan is to have this directory persisted across device restart during firmware updates.

See #2488

codecov · 2023-12-06T08:36:09Z

Codecov Report

Merging #2466 (2cac9f0) into main (1ef77c9) will decrease coverage by 0.2%.
Report is 33 commits behind head on main.
The diff coverage is 92.9%.

Additional details and impacted files

Files	Coverage Δ
crates/core/tedge_api/src/lib.rs	`100.0% <ø> (ø)`
crates/core/tedge_api/src/mqtt_topics.rs	`86.7% <0.0%> (+1.0%)`	⬆️
crates/core/tedge_api/src/ring_buffer.rs	`97.5% <97.5%> (ø)`
crates/core/tedge_api/src/entity_store.rs	`94.1% <85.9%> (-0.6%)`	⬇️
crates/core/tedge_api/src/pending_entity_store.rs	`96.3% <96.3%> (ø)`
crates/extensions/c8y_mapper_ext/src/converter.rs	`82.4% <90.0%> (+1.0%)`	⬆️

... and 30 files with indirect coverage changes

crates/core/tedge_api/src/partial_entity_store.rs

crates/core/tedge_api/src/entity_store.rs

crates/extensions/c8y_mapper_ext/src/converter.rs

crates/core/tedge_api/src/message_log.rs

crates/core/tedge_api/src/entity_store.rs

crates/common/tedge_config/src/tedge_config_cli/tedge_config.rs

didier-wenzek

I didn't try to understand why most of the system tests are failing.

The mappers would benefit from an EntityStore taking in charge all the caching issues. Can this be done by two independent implementations: one with auto registration another one with a cache of pending registration.

crates/common/mqtt_channel/src/messages.rs

crates/common/tedge_config/src/tedge_config_cli/tedge_config.rs

crates/extensions/c8y_mapper_ext/src/converter.rs

crates/core/tedge_api/src/partial_entity_store.rs

didier-wenzek · 2023-12-09T17:10:47Z

crates/core/tedge_api/src/entity_store.rs

+        Ok(entity_store)
+    }
+
+    pub fn load_from_message_log(&mut self) -> Result<(), InitError> {


Have you consider to persist not a log of registration messages but the entity store itself? Or more precisely its main component aka entities: HashMap<EntityTopicId, EntityMetadata>. I think this would be simpler.

My initial plan was to persist the EntityMetadata struct instances in the log. But since the command capability messages were not part of this struct, I was forced to drop that plan (Eventually, we'll have to include them also in the entity store, but doing that now would have been a bigger change and risky).

Then I almost introduced a new PersistentMessage enum wrapper for all persistable messages (reg messages, twin data messages, command metadata etc), but then finally decided to persist the raw MqttMessages itself to cover other metadata messages as well, like MeasurementMetadata, EventMetadata etc which are not part of the entity store at the moment. It felt like a better choice as we could cover any other message types as well in the future (e.g: persist even the telemetry messages on disk instead of the in-memory ring buffer to reduce memory pressure, if needed).

One additional risk that I felt in logging the EntityMetadata struct itself was the possibility of future changes to this struct making the persistent logs between different tedge versions incompatible. Since the raw messages are less likely to change, I felt it's a safer option. But, this is a point worth discussing.

As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.

crates/core/tedge_api/src/entity_store.rs

crates/core/tedge_api/src/message_log.rs

didier-wenzek

Approved. The code is clear and far simpler with pending message caching responsibility moved behind the entity store.

As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.

Will have then to be addressed the pending comment related to storage.
Notably, we need to find an agreement on /data/tedge

didier-wenzek

I confirm my approval

albinsuresh · 2023-12-14T09:01:25Z

Test Plan

This PR fixes the message race issue where an out-of-order delivery of entity data messages delivered before the entity registrations themselves results in those data messages getting dropped. The fix results in the following behaviour:

All data messages: telemetry and other metadata messages like twin data, received before their respective registration messages, are cached in-memory and not converted.
- Telemetry messages for all entities are cached in a cache with a capacity of only 100 entries. Hence, when the cache is full, older entries are replaced with newer entries.
- Metadata messages are cached in unbounded buffers as we can't afford dropping such critical data
All child device registration messages received before their parents are also cached
When the registration message is received, itself and all its cached child devices are registered, and all their cached data are also processed.
Auto-registration must be turned off if explicit registration is used for any entity. Keeping it turned on while using explicit registration can sometimes result in undesired behaviours like nested child devices getting registered as immediate child devices.

More details on the issue and the solution proposals can be found in the design decisions doc included in this PR. Some basic test coverage was also added in device_registration.robot. Persisting the entity store on disk and restoring it on startup, to minimize message races on startup, was deferred for later.

gligorisaev · 2023-12-15T06:28:40Z

QA has thoroughly checked the feature and here are the results:

Test for ticket exists in the test suite.
QA has tested the function and it's functioning according description.

albinsuresh requested review from didier-wenzek and reubenmiller November 17, 2023 15:13

albinsuresh temporarily deployed to Test Pull Request November 17, 2023 15:19 — with GitHub Actions Inactive

didier-wenzek reviewed Nov 20, 2023

View reviewed changes

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

albinsuresh had a problem deploying to Test Pull Request November 20, 2023 10:55 — with GitHub Actions Failure

albinsuresh commented Nov 21, 2023

View reviewed changes

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

albinsuresh had a problem deploying to Test Pull Request November 21, 2023 07:03 — with GitHub Actions Failure

albinsuresh had a problem deploying to Test Pull Request November 23, 2023 06:35 — with GitHub Actions Failure

albinsuresh commented Nov 23, 2023

View reviewed changes

tests/RobotFramework/tasks/debug.robot Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

albinsuresh requested a review from didier-wenzek November 24, 2023 09:30

didier-wenzek reviewed Nov 24, 2023

View reviewed changes

albinsuresh marked this pull request as ready for review November 28, 2023 11:29

albinsuresh had a problem deploying to Test Pull Request November 28, 2023 11:35 — with GitHub Actions Failure

albinsuresh requested a review from didier-wenzek November 28, 2023 11:53

didier-wenzek reviewed Nov 29, 2023

View reviewed changes

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved

albinsuresh had a problem deploying to Test Pull Request November 30, 2023 08:24 — with GitHub Actions Failure

didier-wenzek reviewed Dec 1, 2023

View reviewed changes

albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from 2ddf72b to 13564ef Compare December 6, 2023 08:26

albinsuresh had a problem deploying to Test Pull Request December 6, 2023 08:33 — with GitHub Actions Failure

albinsuresh had a problem deploying to Test Pull Request December 7, 2023 06:56 — with GitHub Actions Failure

didier-wenzek reviewed Dec 7, 2023

View reviewed changes

crates/core/tedge_api/src/partial_entity_store.rs Outdated Show resolved Hide resolved

crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved

crates/extensions/c8y_mapper_ext/src/converter.rs Outdated Show resolved Hide resolved

rina23q mentioned this pull request Dec 7, 2023

Get operation from JSON over MQTT instead of SmartREST #2482

Merged

21 tasks

albinsuresh had a problem deploying to Test Pull Request December 7, 2023 14:54 — with GitHub Actions Failure

albinsuresh had a problem deploying to Test Pull Request December 7, 2023 17:50 — with GitHub Actions Failure

albinsuresh had a problem deploying to Test Pull Request December 8, 2023 17:39 — with GitHub Actions Failure

albinsuresh had a problem deploying to Test Pull Request December 8, 2023 17:51 — with GitHub Actions Failure

albinsuresh commented Dec 8, 2023

View reviewed changes

crates/core/tedge_api/src/message_log.rs Outdated Show resolved Hide resolved

crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved

crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved

albinsuresh commented Dec 8, 2023

View reviewed changes

crates/common/tedge_config/src/tedge_config_cli/tedge_config.rs Outdated Show resolved Hide resolved

didier-wenzek reviewed Dec 9, 2023

View reviewed changes

albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from dc64075 to 7abfae0 Compare December 13, 2023 11:43

albinsuresh temporarily deployed to Test Pull Request December 13, 2023 11:49 — with GitHub Actions Inactive

didier-wenzek approved these changes Dec 13, 2023

View reviewed changes

albinsuresh temporarily deployed to Test Pull Request December 14, 2023 07:56 — with GitHub Actions Inactive

didier-wenzek approved these changes Dec 14, 2023

View reviewed changes

albinsuresh added 2 commits December 14, 2023 08:32

Entity message ordering problem and solution proposal thin-edge#2428

d01b11d

Handle early entity messages with caching thin-edge#2482

2cac9f0

albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from bf4a31d to 2cac9f0 Compare December 14, 2023 08:33

albinsuresh temporarily deployed to Test Pull Request December 14, 2023 08:40 — with GitHub Actions Inactive

albinsuresh merged commit b4974b9 into thin-edge:main Dec 14, 2023
18 checks passed

gligorisaev self-assigned this Dec 14, 2023

albinsuresh deleted the feat/2428/registration-message-processing-priority branch December 14, 2023 10:30

reubenmiller mentioned this pull request Dec 15, 2023

registering commands on nested child devices results in duplicate devices when the tedge-mapper-c8y is restarted #2409

Closed

This was referenced Dec 18, 2023

Make entity store persistent #2428 #2522

Merged

Ordered processing of entity registration messages before any other messages #2428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordered processing of entity registration messages before any other messages #2466

Ordered processing of entity registration messages before any other messages #2466

albinsuresh commented Nov 17, 2023 •

edited

Loading

github-actions bot commented Nov 17, 2023 •

edited

Loading

didier-wenzek left a comment •

edited

Loading

reubenmiller commented Nov 20, 2023

albinsuresh commented Nov 21, 2023 •

edited

Loading

didier-wenzek commented Nov 21, 2023

albinsuresh commented Nov 21, 2023

didier-wenzek left a comment

didier-wenzek left a comment

didier-wenzek left a comment •

edited

Loading

didier-wenzek Dec 1, 2023

codecov bot commented Dec 6, 2023 •

edited

Loading

didier-wenzek left a comment

didier-wenzek Dec 9, 2023

albinsuresh Dec 11, 2023 •

edited

Loading

albinsuresh Dec 13, 2023

didier-wenzek left a comment

didier-wenzek left a comment

albinsuresh commented Dec 14, 2023 •

edited

Loading

gligorisaev commented Dec 15, 2023

Ordered processing of entity registration messages before any other messages #2466

Ordered processing of entity registration messages before any other messages #2466

Conversation

albinsuresh commented Nov 17, 2023 • edited Loading

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

github-actions bot commented Nov 17, 2023 • edited Loading

Robot Results

didier-wenzek left a comment • edited Loading

Choose a reason for hiding this comment

reubenmiller commented Nov 20, 2023

albinsuresh commented Nov 21, 2023 • edited Loading

didier-wenzek commented Nov 21, 2023

albinsuresh commented Nov 21, 2023

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek left a comment • edited Loading

Choose a reason for hiding this comment

didier-wenzek Dec 1, 2023

Choose a reason for hiding this comment

codecov bot commented Dec 6, 2023 • edited Loading

Codecov Report

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek Dec 9, 2023

Choose a reason for hiding this comment

albinsuresh Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

albinsuresh Dec 13, 2023

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

didier-wenzek left a comment

Choose a reason for hiding this comment

albinsuresh commented Dec 14, 2023 • edited Loading

Test Plan

gligorisaev commented Dec 15, 2023

albinsuresh commented Nov 17, 2023 •

edited

Loading

github-actions bot commented Nov 17, 2023 •

edited

Loading

didier-wenzek left a comment •

edited

Loading

albinsuresh commented Nov 21, 2023 •

edited

Loading

didier-wenzek left a comment •

edited

Loading

codecov bot commented Dec 6, 2023 •

edited

Loading

albinsuresh Dec 11, 2023 •

edited

Loading

albinsuresh commented Dec 14, 2023 •

edited

Loading