Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gstreamer - high IAT(Inter Arrival Time) around ~10ms observed between UDP datagrams of SRT client output #673

Closed
ravtr opened this issue Apr 30, 2019 · 20 comments · Fixed by #678
Assignees
Labels
[core] Area: Changes in SRT library core
Milestone

Comments

@ravtr
Copy link

ravtr commented Apr 30, 2019

Using below pipeline to convert RTP to SRT
udpsrc name=srtudpsrc multicast-iface=lo address=<address port= caps="application/x-rtp, media=(string)video, clock-rate=(int)90000, payload=(int)33, encoding-name=(string)MP2T" ! queue name=rtpqueue max-size-buffers=80000 max-size-bytes=100000000 max-size-time=10000000000 min-threshold-time=0 min-threshold-buffers=0 min-threshold-bytes=0 leaky=upstream ! rtpjitterbuffer latency=40 ! queue name=rtpjitterbufferqueue ! rtpmp2tdepay ! queue name=outputqueue ! srtclientsink uri= latency=1000 max-lateness=100000000 key-length=16 passphrase=""

When monitored the input IAT was seen to be 1ms. But output of SRT client is around 10ms around 10% of the time. The same is observed when packets captured at SRT output.

Need some help in debugging this issue.

@ravtr ravtr changed the title high IAT(Inter Arrival Time) around ~10ms observed between UDP datagrams of SRT client output gstreamer - high IAT(Inter Arrival Time) around ~10ms observed between UDP datagrams of SRT client output Apr 30, 2019
@maxsharabayko
Copy link
Collaborator

@ravtr This might be related to #637. SRT inter packet sending has a minimum waiting interval of 10 ms. So if each packet should have 1 ms interval, then the first one will have 10 ms inter packet delay, and further 9 will be sent ASAP to make the interval 1 ms on average.
According to the experiments in #637, this minimum interval can be reduced to 1 ms, as the sleeping accuracy on Linux is good enough. We are looking forward to check how this will affect the overall performance of SRT.
Furthermore, there is an option to use busy waiting loop for delivery to achieve higher accuracy. To check this delete definitions of NO_BUSY_WAITING from srt.h and udt.h. And I would recommend using the following code in CUDT::packData(...) to preserve the inter-packet interval adjustment:

//#ifndef NO_BUSY_WAITING
#if 0
	ts_tk = entertime_tk + m_ullInterval_tk;
#else
	if (m_ullTimeDiff_tk >= m_ullInterval_tk)
	{
		ts_tk = entertime_tk;
		m_ullTimeDiff_tk -= m_ullInterval_tk;
	}
	else
	{
		ts_tk = entertime_tk + m_ullInterval_tk - m_ullTimeDiff_tk;
		m_ullTimeDiff_tk = 0;
	}
#endif	

@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Apr 30, 2019
@maxsharabayko maxsharabayko added this to the v.1.3.3 milestone Apr 30, 2019
@ravtr
Copy link
Author

ravtr commented May 1, 2019

@maxlovic thanks for the quick reply, will try the suggested changes and share the results

@ravtr
Copy link
Author

ravtr commented May 1, 2019

@maxlovic I have disabled NO_BUSY_WAITING in srt.h and udt.h then I can see no more 10ms peaks.
below are the versions of modules we are using
libsrt .- 1.3.1 (libsrt1_1.3.1-1_amd64.deb)

Is there any plan to make this change as a configurable feature?
Can I make the change and raise a feature request?

@oviano
Copy link
Contributor

oviano commented May 1, 2019

Would the NO_BUSY_WAITING change described above be appropriate for Windows too? I understand that Windows timers are even worse, so this could help, right?

@maxsharabayko
Copy link
Collaborator

@ravtr

Is there any plan to make this change as a configurable feature?
Can I make the change and raise a feature request?

We are looking into this. Some tests on several platforms are to be performed.

  1. I think that BUSY_WAITING has for sure go to predefined options, visible to CMake. 10 ms waiting (current) should be the default one. But one should be able to enable busy waiting as a build option.
  2. 10 ms interval might be reduced to 1 ms for higher accuracy. Will have to check platforms support.

@oviano

Would the NO_BUSY_WAITING change described above be appropriate for Windows too? I understand that Windows timers are even worse, so this could help, right?

Yes, Windows timers are far less accurate. But busy waiting will produce some undesireable CPU load, so it can't be the general case solution. Maybe a fix will be to wait with timer if waiting time is higher than 10 ms, and with busy waiting - if less.. Will see.

@oviano
Copy link
Contributor

oviano commented May 2, 2019

So here is a suggested modified version of CTimer:sleepto() which uses the timer for waits >= 10ms and busy waits for any smaller amounts.

The * 10000 could eventually be replaced with some sort of platform-specific granularity constant.

Maybe @ravtr you could test this and observe CPU load vs the change you already made that just busy waits in all cases?

void CTimer::sleepto(uint64_t nexttime)
{
   // Use class member such that the method can be interrupted by others
   m_ullSchedTime = nexttime;

   uint64_t t;
   rdtsc(t);

#ifndef NO_BUSY_WAITING
   uint64_t dt = s_ullCPUFrequency * 10000;
   while (t < m_ullSchedTime && m_ullSchedTime - t >= dt)
#else
   while (t < m_ullSchedTime)
#endif
   {
       timeval now;
       timespec timeout;
       gettimeofday(&now, 0);
       if (now.tv_usec < 990000)
       {
           timeout.tv_sec = now.tv_sec;
           timeout.tv_nsec = (now.tv_usec + 10000) * 1000;
       }
       else
       {
           timeout.tv_sec = now.tv_sec + 1;
           timeout.tv_nsec = (now.tv_usec + 10000 - 1000000) * 1000;
       }
       THREAD_PAUSED();
       pthread_mutex_lock(&m_TickLock);
       pthread_cond_timedwait(&m_TickCond, &m_TickLock, &timeout);
       pthread_mutex_unlock(&m_TickLock);
       THREAD_RESUMED();

       rdtsc(t);
   }

#ifndef NO_BUSY_WAITING
   while (t < m_ullSchedTime)
   {
#ifdef IA32
       __asm__ volatile ("pause; rep; nop; nop; nop; nop; nop;");
#elif IA64
       __asm__ volatile ("nop 0; nop 0; nop 0; nop 0; nop 0;");
#elif AMD64
       __asm__ volatile ("nop; nop; nop; nop; nop;");
#endif

       rdtsc(t);
   }
#endif

}

@ravtr
Copy link
Author

ravtr commented May 8, 2019

@oviano @maxlovic
As suggested by @oviano, sleep is made busy wait only below 10ms and moved NO_BUSY_WAITING definition to CMakeList. please check the pull request yvvarun#1
You can plan to merge it based on your schedule. Let me know if you have any comments.
Without changes, CPU was around 30% and with the above change it's going up to 70%

@oviano
Copy link
Contributor

oviano commented May 8, 2019

The only thing I notice about that PR is it doesn't include the change suggested by @maxlovic to preserve the inter-packet interval adjustment. Just mentioning in case that was an oversight.

@ravtr
Copy link
Author

ravtr commented May 8, 2019

@maxtomilov I can see #678 is raised on master
Currently, we are using 1.3.1. Is it possible for you to give a pull request on 1.3.1
So that we can use your changes

@oviano
Copy link
Contributor

oviano commented May 8, 2019

Also, there are these lines in queue.cpp:

#ifdef NO_BUSY_WAITING
m_pTimer->tick();
#endif

But with the above hybrid sleep/busy wait loop, you still want it to call the ->tick() to wake it up because it may be in the pthread_cond_wait portion of the code, i.e. just make it:

m_pTimer->tick();

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented May 8, 2019

@ravtr @oviano Busy waiting usage together with normal waiting is better to be added as a next step.
First, the waiting interval can be reduced from 10 ms down to the desired time. Checking the impact of this change.

@oviano
Copy link
Contributor

oviano commented May 8, 2019

@ravtr @oviano Busy waiting usage together with normal waiting is better to be added as a next step.
First, the waiting interval can be reduced from 10 ms down to the desired time. Checking the impact of this change.

Makes sense. What would be the desired time though - it's going to vary per platform. The best I've been able to achieve on Windows is about 1700us.

@maxsharabayko
Copy link
Collaborator

@ravtr Version 1.3.1 with changes from PR 678: link. Please check the default build (without busy waiting).

@ravtr
Copy link
Author

ravtr commented May 10, 2019

@maxlovic thanks

@oviano
Copy link
Contributor

oviano commented Aug 26, 2019

So, here is the revised sleepto I am using having updated my SRT to 1.3.4.

Taking this second step is is very important - it reduces idle CPU from around 20% to almost zero on both Windows machines I have tried.

void CTimer::sleepto(uint64_t nexttime)
{
   // Use class member such that the method can be interrupted by others
   m_ullSchedTime = nexttime;

   uint64_t t;
   rdtsc(t);

#if USE_BUSY_WAITING
   uint64_t dt = s_ullCPUFrequency * 10000;
   while (t < m_ullSchedTime && m_ullSchedTime - t >= dt)
#else
   while (t < m_ullSchedTime)
#endif
   {
      const uint64_t wait_us = (m_ullSchedTime - t) / s_ullCPUFrequency;
      // The while loop ensures that (t < m_ullSchedTime).
      // Division by frequency ????? loos prevision.
      if (wait_us == 0)
         break;

      timeval now;
      gettimeofday(&now, 0);
      const uint64_t time_us = now.tv_sec * uint64_t(1000000) + now.tv_usec + wait_us;
      timespec timeout;
      timeout.tv_sec = time_us / 1000000;
      timeout.tv_nsec = (time_us % 1000000) * 1000;

      THREAD_PAUSED();
      pthread_mutex_lock(&m_TickLock);
      pthread_cond_timedwait(&m_TickCond, &m_TickLock, &timeout);
      pthread_mutex_unlock(&m_TickLock);
      THREAD_RESUMED();

      rdtsc(t);
   }

#if USE_BUSY_WAITING
   while (t < m_ullSchedTime)
   {
#ifdef IA32
      __asm__ volatile ("pause; rep; nop; nop; nop; nop; nop;");
#elif IA64
      __asm__ volatile ("nop 0; nop 0; nop 0; nop 0; nop 0;");
#elif AMD64
      __asm__ volatile ("nop; nop; nop; nop; nop;");
#elif defined(_WIN32) && !defined(__MINGW__)
      __nop ();
      __nop ();
      __nop ();
      __nop ();
      __nop ();
#endif

      rdtsc(t);
}
#endif

}

@maxsharabayko
Copy link
Collaborator

@oviano
In your approach you first wait on a CV if waiting time is more than 10 ms, and the rest time you wait in spin loop (busy waiting).

  1. 10 ms only works for Windows. For Linux and MacOS better to use 1 ms.
  2. pthread_cond_timedwait will wait at least the specified time, meaning that it can take longer, but not shorter time. It can wake up earlier only due to a spurious wake up.
    At the same time you are trying to wait on a CV for the whole interval.
    wait_us = (m_ullSchedTime - t) / s_ullCPUFrequency;
    So effectively busy waiting will take place only if a spurious wake up happens within the following interval: [t - 10ms; t), where t is the target wake up time (timeout).

To summ up, In this approach busy waiting is used only for waiting durations below 10 ms or after a spurious wake up, when the remaining waiting time is less than 10 ms.

What bitrate did you use for testing?

Can you create a PR?

@oviano
Copy link
Contributor

oviano commented Aug 27, 2019

Yes sorry there was a flaw in my edit this morning, it needs to be more like the below so that in the busy waiting mode it sleeps only for 10ms each iteration, until there is < 10ms remaining at which point it does the actual busy waiting in the second loop.

void CTimer::sleepto(uint64_t nexttime)
{
   // Use class member such that the method can be interrupted by others
   m_ullSchedTime = nexttime;

   uint64_t t;
   rdtsc(t);

#if USE_BUSY_WAITING
   const uint64_t wait_us = 10000;
   uint64_t dt = s_ullCPUFrequency * wait_us;
   while (t < m_ullSchedTime && m_ullSchedTime - t >= dt)
   {
#else
   while (t < m_ullSchedTime)
   {
      const uint64_t wait_us = (m_ullSchedTime - t) / s_ullCPUFrequency;
      // The while loop ensures that (t < m_ullSchedTime).
      // Division by frequency ????? loos prevision.
      if (wait_us == 0)
         break;
#endif

      timeval now;
      gettimeofday(&now, 0);
      const uint64_t time_us = now.tv_sec * uint64_t(1000000) + now.tv_usec + wait_us;
      timespec timeout;
      timeout.tv_sec = time_us / 1000000;
      timeout.tv_nsec = (time_us % 1000000) * 1000;

      THREAD_PAUSED();
      pthread_mutex_lock(&m_TickLock);
      pthread_cond_timedwait(&m_TickCond, &m_TickLock, &timeout);
      pthread_mutex_unlock(&m_TickLock);
      THREAD_RESUMED();

      rdtsc(t);
   }

#if USE_BUSY_WAITING
   while (t < m_ullSchedTime)
   {
#ifdef IA32
      __asm__ volatile ("pause; rep; nop; nop; nop; nop; nop;");
#elif IA64
      __asm__ volatile ("nop 0; nop 0; nop 0; nop 0; nop 0;");
#elif AMD64
      __asm__ volatile ("nop; nop; nop; nop; nop;");
#elif defined(_WIN32) && !defined(__MINGW__)
      __nop ();
      __nop ();
      __nop ();
      __nop ();
      __nop ();
#endif

      rdtsc(t);
   }
#endif
}

@maxsharabayko
Copy link
Collaborator

I would rather wait till m_ullSchedTime - 10 ms then.

@oviano
Copy link
Contributor

oviano commented Aug 27, 2019

Yes that would be even better, you are right.

@oviano
Copy link
Contributor

oviano commented Aug 27, 2019

So a refined version; now it will only do the pthread sleep up until t - 10ms (or 1ms for non-Windows).

void CTimer::sleepto(uint64_t nexttime)
{
   // Use class member such that the method can be interrupted by others
   m_ullSchedTime = nexttime;

   uint64_t t;
   rdtsc(t);

#if USE_BUSY_WAITING
#if defined(_WIN32)
   const uint64_t threshold = 10000;
#else
   const uint64_t threshold = 1000;
#endif
#endif

   while (t < m_ullSchedTime)
   {
#if USE_BUSY_WAITING
      uint64_t wait_us = (m_ullSchedTime - t) / s_ullCPUFrequency;
      if (wait_us > threshold)
         wait_us -= threshold;
      if (wait_us < threshold)
         break;
#else
      const uint64_t wait_us = (m_ullSchedTime - t) / s_ullCPUFrequency;
      if (wait_us == 0)
         break;
#endif

      timeval now;
      gettimeofday(&now, 0);
      const uint64_t time_us = now.tv_sec * uint64_t(1000000) + now.tv_usec + wait_us;
      timespec timeout;
      timeout.tv_sec = time_us / 1000000;
      timeout.tv_nsec = (time_us % 1000000) * 1000;

      THREAD_PAUSED();
      pthread_mutex_lock(&m_TickLock);
      pthread_cond_timedwait(&m_TickCond, &m_TickLock, &timeout);
      pthread_mutex_unlock(&m_TickLock);
      THREAD_RESUMED();

      rdtsc(t);
   }

#if USE_BUSY_WAITING
   while (t < m_ullSchedTime)
   {
#ifdef IA32
      __asm__ volatile ("pause; rep; nop; nop; nop; nop; nop;");
#elif IA64
      __asm__ volatile ("nop 0; nop 0; nop 0; nop 0; nop 0;");
#elif AMD64
      __asm__ volatile ("nop; nop; nop; nop; nop;");
#elif defined(_WIN32) && !defined(__MINGW__)
      __nop ();
      __nop ();
      __nop ();
      __nop ();
      __nop ();
#endif

      rdtsc(t);
   }
#endif

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants