Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osal_timer_UT periodic failure #335

Closed
skliper opened this issue Dec 31, 2019 · 14 comments · Fixed by #350
Closed

osal_timer_UT periodic failure #335

skliper opened this issue Dec 31, 2019 · 14 comments · Fixed by #350
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Dec 31, 2019

Describe the bug
osal_timer_UT periodically reports a failure on linux for the nominal OS_TimerSet case

To Reproduce

  1. Build and run
make ENABLE_UNIT_TESTS=TRUE SIMULATION=native prep
make
make install
build/exe/cpu1/osal_timer_UT

Expected behavior
Should pass

Code snips
See ut_ostimer_test.c for UT_os_timercallback (the timer callback function)
See ut_ostimer_timerio_test.c line 654 for the failure location

System observed on:

  • cFS Dev 2 Server
  • OS: Ubuntu 18.04
  • Versions: happened to be running the ic-20191230 branches, but not unique (historical issue)

Additional context
Added after the tolerance check in the callback function:

printf("\ndeltaTime = %d\n", deltaTime);
printf("g_toleranceVal = %d\n", g_toleranceVal);
printf("currIntervalTime = %d\n", currIntervalTime);
printf("prevIntervalTime = %d\n", prevIntervalTime);
printf("endTime = %d sec, %d msec\n", endTime.seconds, endTime.microsecs);
printf("currTime = %d sec, %d msec\n", currTime.seconds, currTime.microsecs);

failure result was seen due to unexpected first interval of 1056, where interval was set to 500000:

deltaTime = 0
g_toleranceVal = 25000
currIntervalTime = 0
prevIntervalTime = 0
endTime = 1577812468 sec, 179711 msec
currTime = 1577812468 sec, 179711 msec

deltaTime = 1056
g_toleranceVal = 25000
currIntervalTime = 1056
prevIntervalTime = 0
endTime = 1577812468 sec, 180767 msec
currTime = 1577812468 sec, 179711 msec

deltaTime = 498945
g_toleranceVal = 25000
currIntervalTime = 500001
prevIntervalTime = 1056
endTime = 1577812468 sec, 680768 msec
currTime = 1577812468 sec, 180767 msec

deltaTime = 12
g_toleranceVal = 25000
currIntervalTime = 499989
prevIntervalTime = 500001
endTime = 1577812469 sec, 180757 msec
currTime = 1577812468 sec, 680768 msec

Reporter Info
Jacob Hageman - NASA/GSFC

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

@jphickey any relation to the vxworks timer startup issue?

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

Actual failure comes from the deltaTime = 498945 the 3rd time through, since the delta isn't checked on the first 2 frames.

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

Now I'm confused. The start time is set to 1000, so in this case the unit under test did what was expected. I ran it a few more times and DID get it to pass, which is likely the actual error.

@jphickey
Copy link
Contributor

Can't really add printf statements to the timer callback function, as these can disrupt the timing substantially - so I'm not sure if that observation is really what is happening or if it is just an artifact of doing the printf's.

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

The timer is slow, and there is plenty of time to print prior to next call. They are after getting time.

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

I understand it could effect it, but not worth bothering w/ a fancier test when the delay is .5 sec. It does illustrate an issue in the test, so why the pushback?

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

I'll fix the test for the first issue identified and remove the printf's to retest (since start delay of 1000 is likely impacted by the printfs).

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

Now I'm confused. The start time is set to 1000, so in this case the unit under test did what was expected. I ran it a few more times and DID get it to pass, which is likely the actual error.

Retracting this. When it does pass, the first interval is ~500000 as expected. Just a guess since for all the failure cases the first interval is ~1000 usec, but could it be the callback is inappropriately getting called before the start delay? This doesn't sound like expected behavior, although the ~1000 usec could be a coincidence it's suspect that it matches the start delay.

@jphickey
Copy link
Contributor

I think this is coincidence ... The 1000usec initial start delay is pretty tight for a desktop Linux kernel which typically has tick rate of 100Hz, so this is unlikely to be accurate but it doesn't check the first interval anyway. I just ran the test a bunch of times and when it finally failed (after many successful runs) it was on the 10th interval, which according to the check was only 7us after the previous call. I will look closer.

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

could it be related to the following comments?

/*
* If the timer uses a dedicated time base, then also change that to the same interval.
*
* This is not perfect in the sense that the update is not all done atomically. If this
* is not the first call to TimerSet, then weirdness can happen:
* - a timer tick could occur between setting the callback times above and now
* - the call to OS_TimeBaseSet() could fail leaving us in a mixed state
*
* However, the notion of multiple TimerSet() calls is a gray area to begin with, since
* timer ticks can occur at any moment - there is never a guarantee your new TimerSet values
* will get applied before the old timer expires. Therefore by definition an application
* MUST be able to handle a possible "spurious" callback in these circumstances.
*/

Maybe the spurious callback case the first time through, then gets called after the start wait.

@skliper
Copy link
Contributor Author

skliper commented Dec 31, 2019

Edited out Time design ignorance is showing...

Only useful part of this comment was it's the first call to TimerSet.

@jphickey
Copy link
Contributor

I think I know what the issue is....

This test is not comparing the elapsed interval to the expected interval, but rather the interval that came before it. So yeah, if the first interval is short (i.e. 1000us), it invalidates the second test which will end up comparing 1000us to 500000us and getting a difference of 499000us.

I'm somewhat surprised this doesn't fail more often than it does.

@jphickey
Copy link
Contributor

I'm currently testing a fix in a loop; if this runs repeatedly for an hour or so with no failures than I'll push it up.

jphickey added a commit to jphickey/osal that referenced this issue Dec 31, 2019
On POSIX, call sigtimedwait() on the selected RT signal as part of
the set up for the timebase.  This ensures that if the signal is
already pending, it will be cleared.

This also simplifies the timer callback routine in the UT code,
cleaning up some unncessary extra logic.
@skliper
Copy link
Contributor Author

skliper commented Jan 2, 2020

Note the real issue is described and fixed in #336. In short - pending signal from previous test could cause an extra "spurious" initial callback at the start, now cleared at setup.

@skliper skliper added the bug label Jan 2, 2020
@skliper skliper added this to the 5.1.0 milestone Jan 2, 2020
skliper pushed a commit that referenced this issue Jan 8, 2020
On POSIX, call sigtimedwait() on the selected RT signal as part of
the set up for the timebase.  This ensures that if the signal is
already pending, it will be cleared.

This also simplifies the timer callback routine in the UT code,
cleaning up some unncessary extra logic.
skliper added a commit that referenced this issue Jan 8, 2020
Fix #294 #335
Reviewd and approved at 2020-01-08 CCB
skliper pushed a commit that referenced this issue Jan 13, 2020
On POSIX, call sigtimedwait() on the selected RT signal as part of
the set up for the timebase.  This ensures that if the signal is
already pending, it will be cleared.

This also simplifies the timer callback routine in the UT code,
cleaning up some unncessary extra logic.
skliper added a commit that referenced this issue Jan 14, 2020
Fix #335: Clear pending signal when configuring timer
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants