-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordered processing of entity registration messages before any other messages #2466
Ordered processing of entity registration messages before any other messages #2466
Conversation
Robot Results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These thoughts make me wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.
I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.
By a set of regex rules, I mean something along these lines:
ground/rasp///
-> main devicefloor[0-9]+/rasp//
-> child device of main device(?<parent> floor[0-9]+)/plc[0-9]+//
-> child device of${parent}/rasp//
Yes, very valid points. Some kind of blocking call (e.g. to a HTTP endpoint) might be more appropriate to register devices (or patterns, nice idea). |
I can't agree more and most "capable devices" probably would prefer such an HTTP API over the clunky MQTT one. But, my understanding was that there would be many limited devices that we'll have to support, that can only handle MQTT and we'll have to provide an MQTT-only solution for them anyway. So, ideally we should have both APIs, but we might be forced to start with MQTT first, as that works for everyone. Irrespective of the protocol, another key thing that we need to define is what a "successful registration" means? Does it mean that any one mapper has processed that registration message or all the connected mappers have processed it? How can we communicate either of those in a cloud-agnostic manner? I've expressed my concerns on this matter in this comment in detail.
Once we have the registration message feedback mechanism, on which the device-logic will have to wait, do we still need this template mechanism? Such a mechanism would still be useful to the customer, to avoid the explicit registration messages altogether. But this is not necessary in the context of this problem, right? One other key point to note is that these issues are not just limited to custom topic schemes and the id-generation associated with it. That's definitely one key aspect, but not the only one. Even the default topic scheme suffers from these limitations when explicit registration is mandatory, like in the case of nested child devices. Getting the |
We need to choose, sure.
I do think the latter would simplify things a lot (the topic<->entity relation being computed)
Why registration should be mandatory and necessarily done dynamically using registration messages? A rule such as this one can be used to derive the external id as well as the parent of a grand child device,
|
For the external ID, I completely agree that some kind of correlation would definitely exist between the entity topic ID and its external ID, making that mapping definition fairly straight-forward. But, deriving the parent, I'm not fully sure. Looking at a few examples, like devices deployed in factory floors in a hierarchical way, it seems plausible, but I'm just not sure if we can generalise that assumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay for "Phase 1 - live update" and have a suggestion for an alternative solution to "Phase 1 - on mapper restart".
I see the "Phase 2" proposal as not mature enough. Is this even required?
I also have a question on where to put this content. These are more internal design issues and discussions rather than specifications as implied by the name of the updated file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current text clarifies well the problem statement/alternative solutions/pros & cons;
but it's not so clear what's the retained solution among all the alternative ideas and combinations.
- A summary of the retained solution has to be given in the introduction, followed by the discussion (which can be unchanged).
- This content has to be moved under https://github.com/thin-edge/thin-edge.io/tree/main/design/decisions as this is a design discussion and not a reference guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the proposed design.
This document can be improved with an executive summary of the retained solution.
3 bullet points would be enough:
- The c8y mapper persists its entity store on disk.
- Messages received from entities not registered yet are cached till the registration is received (with an eviction policy to deal with rogue clients).
- Auto-registration has to be disabled if the user opt-in for a custom MQTT scheme.
To minimize the message races on a restart, especially between all the entities that were previously registered | ||
before the mapper went down and any data messages for that entity that may have come while the mapper was down, | ||
a persistent copy of the entity store is maintained on the disk, and kept up-to-date, while the mapper is live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to consider to use the new agent.state.path
directory for that purpose as the plan is to have this directory persisted across device restart during firmware updates.
See #2488
2ddf72b
to
13564ef
Compare
Codecov Report
Additional details and impacted files
|
crates/common/tedge_config/src/tedge_config_cli/tedge_config.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try to understand why most of the system tests are failing.
The mappers would benefit from an EntityStore
taking in charge all the caching issues. Can this be done by two independent implementations: one with auto registration another one with a cache of pending registration.
crates/common/tedge_config/src/tedge_config_cli/tedge_config.rs
Outdated
Show resolved
Hide resolved
Ok(entity_store) | ||
} | ||
|
||
pub fn load_from_message_log(&mut self) -> Result<(), InitError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you consider to persist not a log of registration messages but the entity store itself? Or more precisely its main component aka entities: HashMap<EntityTopicId, EntityMetadata>
. I think this would be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial plan was to persist the EntityMetadata
struct instances in the log. But since the command capability messages were not part of this struct, I was forced to drop that plan (Eventually, we'll have to include them also in the entity store, but doing that now would have been a bigger change and risky).
Then I almost introduced a new PersistentMessage
enum wrapper for all persistable messages (reg messages, twin data messages, command metadata etc), but then finally decided to persist the raw MqttMessage
s itself to cover other metadata messages as well, like MeasurementMetadata
, EventMetadata
etc which are not part of the entity store at the moment. It felt like a better choice as we could cover any other message types as well in the future (e.g: persist even the telemetry messages on disk instead of the in-memory ring buffer to reduce memory pressure, if needed).
One additional risk that I felt in logging the EntityMetadata
struct itself was the possibility of future changes to this struct making the persistent logs between different tedge versions incompatible. Since the raw messages are less likely to change, I felt it's a safer option. But, this is a point worth discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.
dc64075
to
7abfae0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. The code is clear and far simpler with pending message caching responsibility moved behind the entity store.
As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.
Will have then to be addressed the pending comment related to storage.
Notably, we need to find an agreement on /data/tedge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm my approval
bf4a31d
to
2cac9f0
Compare
Test PlanThis PR fixes the message race issue where an out-of-order delivery of entity data messages delivered before the entity registrations themselves results in those data messages getting dropped. The fix results in the following behaviour:
More details on the issue and the solution proposals can be found in the design decisions doc included in this PR. Some basic test coverage was also added in |
QA has thoroughly checked the feature and here are the results:
|
Proposed changes
Spec
Impl
Types of changes
Paste Link to the issue
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments