-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronising/completeness data - how to know you have all messages #35
Comments
Question from Arjan Lamers, 22 July 2019: As discussed in the previous working group call, there is a need to define how an application can be sure it has consumed all data. 0) don’t specify anything 1a) use existing definition
From application perspective: the application will need to keep track of what the latest ‘modified’ date X is for a given ’source’ Y. To query for all new events, the application should query with parameters like ’modified > X’ and ’source = Y’. Pro’s:
Con’s:
1b) use existing definition with a defined max drift Pro’s:
Con’s:
2) make those fields explicit
From application perspective: the application will need to keep track of what the latest ‘offset’ X is for a given ’offsetScope’ Y. To query for all new events, the application should query with parameters like ’offset > X’ and ’offsetScope = Y’. Pro’s:
Con’s:
|
Response from Andrew Cooke 24/25 July 2019: Here are some further comments on “synchronisation” or “knowing that you have got a complete data set”.
For this last reason, for most existing messages, I prefer approach 1a or 1b as explained by Arjan. We have helped a number of organisations implement similar rest-based data sharing and I usually recommend that approach 1b is followed – the consumer of data requests data from a provider that has been created or modified since 24 hours before the last request it made from that provider, and then resolves the duplicates as necessary. However, when dealing with IOT devices, a more rapid flow of data, and/or communication between on-farm systems, this may not be sufficient. We do want the ADE data schema to support multiple uses, not just server to server communications with milk recording organisations. Preliminary work by the Open Geospatial Organisation on OGC standards identified a similar need, and their working group has prototyped using synchronisation headers or data fields, called SYNC.SERVICEID and SYNC.CHECKPOINT, in association with a modified timestamp. This model assumes that each device or service keeps a set of tracked changes in its data model (inserts, updates, deletes), and “checkpoint” is an ID that is a pointer to the current (or a previous) end point of that set of tracked changes. In practice, I don’t see any IOT devices or many on-farm systems keeping such lists of their own tracked changes (unless they do replication), but the concept is not unlike option 2 below. Up until now, OGC have been using “resultTime” (equivalent of our “modified”) as their method of querying data. I consider that inaccurate device clocks will not be such a large issue in the future. Almost all IOT communication frameworks (LPWAN, LORAWAN, 5G) require accurate clock synchronisation to support network communications, so it is built into their protocols, and internet connected devices mostly use network time services. The main challenge is current in-field devices that have manually maintained time settings. So while I believe we should make best efforts to support solving this problem, we should not introduce too much complexity, and be careful about creating mandatory fields that cannot be readily filled by existing systems. If we are to support option 2 below, I favour a separate “Sync” sub-object with SourceID and SyncOffset fields to make it clear. If we are to make this mandatory (and it is probably only useful if it is reliably available), then we should make it clear that existing systems can map “Source” and “modified” fields into SourceID and SyncOffset. |
During the meeting we discussed the potential to use an interval/period query filter rather than an absolute date/time filter when requesting data. For instance, if the current time by your computer's clock was 08:00 UTC and you had last asked for data 3 hours ago, you could ask for data by absolute time (modified since 05:00 UTC). However, if the clock on the other computer did not match, there could be messages that you do not get. However, if you ask for data for the duration "current time - 03:00", then the other computer could use its clock and still return the correct set of data. This does not adjust for manual changes to clock, which the guaranteed sequential offset method in method 2 above does address, but is otherwise very elegant. |
Here is Arjan's very good summary of the meeting outcome: Summarizing the workgroup meeting on this ’synchronisation’ / ‘completeness’ decision. Let me know if I misinterpreted something or if there are comments! The WG decided to opt for scenario 1a. The ‘modified’ and ’source’ fields in the meta data will be compulsory. Data sources are required to make sure that the ‘modified’ datetime is monotonically increasing (cannot go back in time). This should not be a problem for devices that synchronise with a central cloud, nor for devices with a local clock. The datetime is already in UTC so we do not expect problems with DST or time zones. The client can thus keep track of the latest ‘modified’ datetime it has received per ’source’, and use that as a starting point to query. There is a scenario that we do not cover in this specification: in case a device with a local clock has drifted too much, an operator may decide to reset its clock either forward or backward. If it is reset backward in time, clients may miss data recorded in that correction period. In case of such a hard reset, clients should be prepared to recapture a larger period in time. How to detect this is out of the scope of the spec and assumed a manual proces, similar to reparations needed with hardware failure or other kinds of data loss. In this scenario, the client should be able to rely on the ‘eventId’ being unique for the ’source’. The url scheme should thus allow to query for ‘modified >= x and source = y’. Alternatively, a client could query with ‘modified in last z hours and source = y’. The latter does not require the client to keep the modified date but has to take into account a possible drift between the clock of the source and the clock of the client. The WG also discussed the possibility of allowing for a tolerance period (scenario 1b). No default could be found that is both reasonably efficient as well as allowing all vendors to guarantee it, the perceived benefit of defining a period is low. The WG also discussed future extensions: if messages (or rather, devices) are defined for which scenario 1a cannot be implemented, the standard could define a tolerance period (scenario 1b) for those message, or the standard could be extended with specific synchronization fields (a Sync group as Andrew suggested below). This will be considered only when these messages present itself with a use case. If this summary is approved, we’ll need to update the spec to make the fields compulsory. |
At our working group meeting on 25 July we discussed how a client GETting data from another device or a server can know if it has consumed all data, especially given that clocks of devices might not be perfectly synchronised.
I am creating this issue to capture the discussion so that it is documented for future contributors. I will post the emails between us as comments on the thread so contributors have the narrative that supports the decision we made at the meeting.
The text was updated successfully, but these errors were encountered: