Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved EOMtheEMSrunner #496

Merged
merged 5 commits into from
May 28, 2024

Conversation

marcoaccame
Copy link
Contributor

@marcoaccame marcoaccame commented May 24, 2024

This PR improves the activation algorithm of the RX-DO-TX loop that executes the services.
In particular in nasty cases when the phases have bursts of execution that last much longer that the allocated budget time it may happen that the regular RX-DO-TX recovers but without keeping the intended separation time between two consecutive phases.

As it is the object EOMtheEMrunner that does the job in here follows a description of its behavior, of the phoenomenon and of its remedy.

The object EOMtheEMSrunner

The object EOMtheEMSrunner is responsible to execute the three phases of a service offered by the ETH board: the RX, DO and TX that act in the following way:

  • RX collects the received data from YRI or from CAN service boards;
  • DO use the data, maybe to execute an outer control loop for the MC service;
  • TX transmits results to YRI and to CAN service boards.

These three phases must be executed at a given frequency and each one must be regular with its period and with a given time budget. The timing is configurable from xml file and we use frequency at 1 kHz and typically assign a budget of 400 us for RX, 300 us for DO and 300 us for TX.

Description of how it works now

The EOMtheEMSrunner achieves this goal using three HW timers and three dedicated threads, one for each phase. The HW timers are started with the same required frequency and each one is offset by the specified time budget. At its expiry, each HW timer sends an RTOS start signal to the thread that runs the phase.

The thread starts execution when it receives both the start signal and also an RTOS enable signal from its previous thread.

To clarify, the DO thread starts only when its HW timer emits the start signal and the it previous thread, the RX, emits the enable signal. See picture.

image-20240524144108015

Figure. Activation of the RX, DO and TX phases by the two signals eENABLExx and eSTARTxx in case the durations of all the phases are withing their time budget.

The EOMtheEMSrunner also offers a service of monitoring the execution times of each phase vs its allocated budget. The duration is computed in a strict policy: it is counted starting from the start emission of the HW timer and terminates when the thread finishes.

This mechanism surely works if the budget time for each phase is higher that the effective required time, also if there are sporadic overflows of a phase into the next execution window. In such a case, the phases are activated in bursts and they realign when their execution time is finally reduced to normality. See picture.

image-20240524144212046

Figure. Activation of the RX, DO and TX phases when the RX phase lasts longer. The cycle re-synchronizes. Noe that the durations are not the effective execution time rather they express the timing passed since the intended start.

The longer phases don't usually happen because the system must be designed in such a way that they do not happen in normal cases. It may however happen in the initialization phase of boards with many joints that the RX phase lasts longer, but I have always observed that the cycle recovers and the execution of the phases stays inside the intended timing window. See below.

image-20240524145548763

Figure. Another example of excessive long duration of the RX phase (more than 1 ms). The cycle re-synchronizes and stays inside its timing window.

Observed misbehavior

I have recently observed on an amc board that the cycle recovered but produced an anticipation of one phase vs its intended start time. I have studies the problem and I have found out situations when that may happens. See Figure below.

image-20240524150011683

Figure. In here are three cases of excessive long duration of the phases that put the cycle out of its intended timing window.

The above situation correctly executes the control and would not generate any harm to the execution of the service in the board apart a huge number of timing overflow messages. See the following situation in Figure below where we have a flood of overflow messages for the RX phase (top of Figure) and for the DO phase (bottom of Figure).

image-20240524154612388

Figure. The anticipation of a phase in its timing window is likely to generate continuous reports of excessive duration time because the phase is triggered by a past HW timer expiration and the measure is the sum of the effective duration plus the delay in its activation.

Description of the modified EOMtheEMSrunner

This above situation may happen. It does not happen because we don't have a flood of diagnostics messages emitted by the board and if we have we try to solve it. But nevertheless we have to remove the possibility of it happening.

The cause

The cause is that there are some eSTARTxx events that are emitted in moments when they do not contribute to activate the phase and they stay active until the next eENABLExx , so the phase starts straight away even if it should start slightly later.

image-20240524161740919

The remedy

The remedy is to avoid the emission of eSTARTxx that are not necessary. I have tested some algorithms and this one does the job:

  • Emit an eSTARTx if phase x is not in current execution and previous phase y = prev(x) is the last just finished or is currently in execution.

The following figure shows how the timing 6 achieves synchronization quite soon because of some reduced activations.

image

Figure. The new activation algorithm allows a quick recovery of synchronization for timing 6. Also, the measure of duration takes into account the effective time so that the focus is on the long phase only.

The tests

On a dedicated setup

I have tested both on the ems and the amc on the lego setup where I simulated and increased execution times every 1 second that generate the problem. In here is the situation with the current and with the new activation algorithm.

not in synch

in synch

Figure. The new activation algorithm solves the synch on the ems when some nasty bursts of RX-DO-TX much longer than 1 ms happen.

On the robots

Together w/ @martinaxgloria I have tested the ems and mc4plus on iCubGenova11, where obviously there are no time overflows: it works fine.

We have also tested the amc con ergoCub001 and it works fine as well. It also works fine with the third motor controlled over ICC1:3 rather than with over CAN1:3.

Mergeability

After the tests we can safely merge this PR and the associated one:

@marcoaccame marcoaccame marked this pull request as draft May 24, 2024 12:00
- a revised activation mode that is robust to get in synch again in presence of long durations
- a different measurement of the RX-DO-TX phases that considers the effective execution time and not also the delay from the target activation of teh HW timer
@marcoaccame marcoaccame force-pushed the feat/runner-improved branch from e9785a4 to a6bb37e Compare May 27, 2024 08:16
… amc.

enabled CANflushMODE_DO_phase for ems, mc4plus, mc2plus.
increased application versions for all the above boards
@marcoaccame marcoaccame merged commit dbc3e02 into robotology:devel May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant