Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the activation algorithm of the RX-DO-TX loop that executes the services.
In particular in nasty cases when the phases have bursts of execution that last much longer that the allocated budget time it may happen that the regular RX-DO-TX recovers but without keeping the intended separation time between two consecutive phases.
As it is the object
EOMtheEMrunner
that does the job in here follows a description of its behavior, of the phoenomenon and of its remedy.The object
EOMtheEMSrunner
The object
EOMtheEMSrunner
is responsible to execute the three phases of a service offered by the ETH board: theRX
,DO
andTX
that act in the following way:RX
collects the received data from YRI or from CAN service boards;DO
use the data, maybe to execute an outer control loop for the MC service;TX
transmits results to YRI and to CAN service boards.These three phases must be executed at a given frequency and each one must be regular with its period and with a given time budget. The timing is configurable from xml file and we use frequency at 1 kHz and typically assign a budget of 400 us for
RX
, 300 us forDO
and 300 us forTX
.Description of how it works now
The
EOMtheEMSrunner
achieves this goal using three HW timers and three dedicated threads, one for each phase. The HW timers are started with the same required frequency and each one is offset by the specified time budget. At its expiry, each HW timer sends an RTOS start signal to the thread that runs the phase.The thread starts execution when it receives both the start signal and also an RTOS enable signal from its previous thread.
To clarify, the
DO
thread starts only when its HW timer emits the start signal and the it previous thread, theRX
, emits the enable signal. See picture.Figure. Activation of the
RX
,DO
andTX
phases by the two signalseENABLExx
andeSTARTxx
in case the durations of all the phases are withing their time budget.The
EOMtheEMSrunner
also offers a service of monitoring the execution times of each phase vs its allocated budget. The duration is computed in a strict policy: it is counted starting from the start emission of the HW timer and terminates when the thread finishes.This mechanism surely works if the budget time for each phase is higher that the effective required time, also if there are sporadic overflows of a phase into the next execution window. In such a case, the phases are activated in bursts and they realign when their execution time is finally reduced to normality. See picture.
Figure. Activation of the
RX
,DO
andTX
phases when the RX phase lasts longer. The cycle re-synchronizes. Noe that the durations are not the effective execution time rather they express the timing passed since the intended start.The longer phases don't usually happen because the system must be designed in such a way that they do not happen in normal cases. It may however happen in the initialization phase of boards with many joints that the RX phase lasts longer, but I have always observed that the cycle recovers and the execution of the phases stays inside the intended timing window. See below.
Figure. Another example of excessive long duration of the RX phase (more than 1 ms). The cycle re-synchronizes and stays inside its timing window.
Observed misbehavior
I have recently observed on an
amc
board that the cycle recovered but produced an anticipation of one phase vs its intended start time. I have studies the problem and I have found out situations when that may happens. See Figure below.Figure. In here are three cases of excessive long duration of the phases that put the cycle out of its intended timing window.
The above situation correctly executes the control and would not generate any harm to the execution of the service in the board apart a huge number of timing overflow messages. See the following situation in Figure below where we have a flood of overflow messages for the RX phase (top of Figure) and for the DO phase (bottom of Figure).
Figure. The anticipation of a phase in its timing window is likely to generate continuous reports of excessive duration time because the phase is triggered by a past HW timer expiration and the measure is the sum of the effective duration plus the delay in its activation.
Description of the modified
EOMtheEMSrunner
This above situation may happen. It does not happen because we don't have a flood of diagnostics messages emitted by the board and if we have we try to solve it. But nevertheless we have to remove the possibility of it happening.
The cause
The cause is that there are some
eSTARTxx
events that are emitted in moments when they do not contribute to activate the phase and they stay active until the nexteENABLExx
, so the phase starts straight away even if it should start slightly later.The remedy
The remedy is to avoid the emission of
eSTARTxx
that are not necessary. I have tested some algorithms and this one does the job:eSTARTx
if phase x is not in current execution and previous phase y = prev(x) is the last just finished or is currently in execution.The following figure shows how the timing 6 achieves synchronization quite soon because of some reduced activations.
Figure. The new activation algorithm allows a quick recovery of synchronization for timing 6. Also, the measure of duration takes into account the effective time so that the focus is on the long phase only.
The tests
On a dedicated setup
I have tested both on the
ems
and theamc
on the lego setup where I simulated and increased execution times every 1 second that generate the problem. In here is the situation with the current and with the new activation algorithm.Figure. The new activation algorithm solves the synch on the
ems
when some nasty bursts of RX-DO-TX much longer than 1 ms happen.On the robots
Together w/ @martinaxgloria I have tested the
ems
andmc4plus
oniCubGenova11
, where obviously there are no time overflows: it works fine.We have also tested the
amc
conergoCub001
and it works fine as well. It also works fine with the third motor controlled overICC1:3
rather than with overCAN1:3
.Mergeability
After the tests we can safely merge this PR and the associated one: