osal_timer_UT periodic failure #335

skliper · 2019-12-31T17:34:19Z

Describe the bug
osal_timer_UT periodically reports a failure on linux for the nominal OS_TimerSet case

To Reproduce

Build and run

make ENABLE_UNIT_TESTS=TRUE SIMULATION=native prep
make
make install
build/exe/cpu1/osal_timer_UT

Expected behavior
Should pass

Code snips
See ut_ostimer_test.c for UT_os_timercallback (the timer callback function)
See ut_ostimer_timerio_test.c line 654 for the failure location

System observed on:

cFS Dev 2 Server
OS: Ubuntu 18.04
Versions: happened to be running the ic-20191230 branches, but not unique (historical issue)

Additional context
Added after the tolerance check in the callback function:

printf("\ndeltaTime = %d\n", deltaTime);
printf("g_toleranceVal = %d\n", g_toleranceVal);
printf("currIntervalTime = %d\n", currIntervalTime);
printf("prevIntervalTime = %d\n", prevIntervalTime);
printf("endTime = %d sec, %d msec\n", endTime.seconds, endTime.microsecs);
printf("currTime = %d sec, %d msec\n", currTime.seconds, currTime.microsecs);

failure result was seen due to unexpected first interval of 1056, where interval was set to 500000:

deltaTime = 0
g_toleranceVal = 25000
currIntervalTime = 0
prevIntervalTime = 0
endTime = 1577812468 sec, 179711 msec
currTime = 1577812468 sec, 179711 msec

deltaTime = 1056
g_toleranceVal = 25000
currIntervalTime = 1056
prevIntervalTime = 0
endTime = 1577812468 sec, 180767 msec
currTime = 1577812468 sec, 179711 msec

deltaTime = 498945
g_toleranceVal = 25000
currIntervalTime = 500001
prevIntervalTime = 1056
endTime = 1577812468 sec, 680768 msec
currTime = 1577812468 sec, 180767 msec

deltaTime = 12
g_toleranceVal = 25000
currIntervalTime = 499989
prevIntervalTime = 500001
endTime = 1577812469 sec, 180757 msec
currTime = 1577812468 sec, 680768 msec

Reporter Info
Jacob Hageman - NASA/GSFC

The text was updated successfully, but these errors were encountered:

skliper · 2019-12-31T17:35:05Z

@jphickey any relation to the vxworks timer startup issue?

skliper · 2019-12-31T17:38:26Z

Actual failure comes from the deltaTime = 498945 the 3rd time through, since the delta isn't checked on the first 2 frames.

skliper · 2019-12-31T17:48:29Z

Now I'm confused. The start time is set to 1000, so in this case the unit under test did what was expected. I ran it a few more times and DID get it to pass, which is likely the actual error.

jphickey · 2019-12-31T17:48:55Z

Can't really add printf statements to the timer callback function, as these can disrupt the timing substantially - so I'm not sure if that observation is really what is happening or if it is just an artifact of doing the printf's.

skliper · 2019-12-31T17:49:32Z

The timer is slow, and there is plenty of time to print prior to next call. They are after getting time.

skliper · 2019-12-31T17:52:06Z

I understand it could effect it, but not worth bothering w/ a fancier test when the delay is .5 sec. It does illustrate an issue in the test, so why the pushback?

skliper · 2019-12-31T17:53:58Z

I'll fix the test for the first issue identified and remove the printf's to retest (since start delay of 1000 is likely impacted by the printfs).

skliper · 2019-12-31T18:10:24Z

Now I'm confused. The start time is set to 1000, so in this case the unit under test did what was expected. I ran it a few more times and DID get it to pass, which is likely the actual error.

Retracting this. When it does pass, the first interval is ~500000 as expected. Just a guess since for all the failure cases the first interval is ~1000 usec, but could it be the callback is inappropriately getting called before the start delay? This doesn't sound like expected behavior, although the ~1000 usec could be a coincidence it's suspect that it matches the start delay.

jphickey · 2019-12-31T18:15:46Z

I think this is coincidence ... The 1000usec initial start delay is pretty tight for a desktop Linux kernel which typically has tick rate of 100Hz, so this is unlikely to be accurate but it doesn't check the first interval anyway. I just ran the test a bunch of times and when it finally failed (after many successful runs) it was on the 10th interval, which according to the check was only 7us after the previous call. I will look closer.

skliper · 2019-12-31T18:19:23Z

could it be related to the following comments?

osal/src/os/shared/osapi-time.c

Lines 358 to 370 in 155e9eb

    
              /* 
        
               * If the timer uses a dedicated time base, then also change that to the same interval. 
        
               * 
        
               * This is not perfect in the sense that the update is not all done atomically.  If this 
        
               * is not the first call to TimerSet, then weirdness can happen: 
        
               *  - a timer tick could occur between setting the callback times above and now 
        
               *  - the call to OS_TimeBaseSet() could fail leaving us in a mixed state 
        
               * 
        
               * However, the notion of multiple TimerSet() calls is a gray area to begin with, since 
        
               * timer ticks can occur at any moment - there is never a guarantee your new TimerSet values 
        
               * will get applied before the old timer expires.  Therefore by definition an application 
        
               * MUST be able to handle a possible "spurious" callback in these circumstances. 
        
               */

Maybe the spurious callback case the first time through, then gets called after the start wait.

skliper · 2019-12-31T18:21:55Z

Edited out Time design ignorance is showing...

Only useful part of this comment was it's the first call to TimerSet.

jphickey · 2019-12-31T18:59:14Z

I think I know what the issue is....

This test is not comparing the elapsed interval to the expected interval, but rather the interval that came before it. So yeah, if the first interval is short (i.e. 1000us), it invalidates the second test which will end up comparing 1000us to 500000us and getting a difference of 499000us.

I'm somewhat surprised this doesn't fail more often than it does.

jphickey · 2019-12-31T19:00:05Z

I'm currently testing a fix in a loop; if this runs repeatedly for an hour or so with no failures than I'll push it up.

On POSIX, call sigtimedwait() on the selected RT signal as part of the set up for the timebase. This ensures that if the signal is already pending, it will be cleared. This also simplifies the timer callback routine in the UT code, cleaning up some unncessary extra logic.

skliper · 2020-01-02T13:19:41Z

Note the real issue is described and fixed in #336. In short - pending signal from previous test could cause an extra "spurious" initial callback at the start, now cleared at setup.

On POSIX, call sigtimedwait() on the selected RT signal as part of the set up for the timebase. This ensures that if the signal is already pending, it will be cleared. This also simplifies the timer callback routine in the UT code, cleaning up some unncessary extra logic.

Fix #294 #335 Reviewd and approved at 2020-01-08 CCB

On POSIX, call sigtimedwait() on the selected RT signal as part of the set up for the timebase. This ensures that if the signal is already pending, it will be cleared. This also simplifies the timer callback routine in the UT code, cleaning up some unncessary extra logic.

Fix #335: Clear pending signal when configuring timer

Fix nasa#335: Shell unsigned pkt length bug

jphickey mentioned this issue Dec 31, 2019

Clear pending signal when configuring timer #336

Merged

skliper added the bug label Jan 2, 2020

skliper added this to the 5.1.0 milestone Jan 2, 2020

skliper mentioned this issue Jan 2, 2020

ostimer OS_TimerSet may fail when testing on a VM #162

Closed

skliper added a commit that referenced this issue Jan 8, 2020

CCB 20200108: Merge #325 #336

092f047

Fix #294 #335 Reviewd and approved at 2020-01-08 CCB

jphickey mentioned this issue Jan 13, 2020

Reevaluate signal handling in POSIX #348

Open

skliper mentioned this issue Jan 14, 2020

Integration Candidate 20200108 #350

Merged

skliper added a commit that referenced this issue Jan 14, 2020

Merge pull request #336 from jphickey/fix-335-timer-ut-failure

6ad6f2f

Fix #335: Clear pending signal when configuring timer

skliper closed this as completed in #350 Jan 17, 2020

lethuillierg mentioned this issue May 3, 2020

#include <sys/signal.h> from os-posix.h causes compiler warnings in Alpine Linux #438

Open

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Fix nasa#335: Shell unsigned pkt length bug

627a969

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Merge pull request nasa#584 from skliper/fix335-shell-unsigned-bug

8ed578e

Fix nasa#335: Shell unsigned pkt length bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osal_timer_UT periodic failure #335

osal_timer_UT periodic failure #335

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019 •

edited

Loading

jphickey commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Jan 2, 2020

osal_timer_UT periodic failure #335

osal_timer_UT periodic failure #335

Comments

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Dec 31, 2019

skliper commented Dec 31, 2019 • edited Loading

jphickey commented Dec 31, 2019

jphickey commented Dec 31, 2019

skliper commented Jan 2, 2020

skliper commented Dec 31, 2019 •

edited

Loading