Wrong statistics with version 2.3 and threads #116

klaus3000 · 2017-01-05T10:50:30Z

Hi!
We tested todays version from master (commit c02a170).

The problem is, that it does not report all requests, and even less responses. We started with 100q/s, increased to 200q/s, and then switched back to our old DSC version (reported as 2.0.0 rc1).

Requests

Responses

New DSC uses the same config as the old DSC.

When using newest DSC with option -T the count is correct (200q/s)

The text was updated successfully, but these errors were encountered:

jelu · 2017-01-05T10:56:54Z

Thanks for the report.

I have been trying to solve various thread related issues the past weeks but have been unsuccessfully so I will shortly release 2.4.0 which will have threads disabled by default.

Am also setting up continuous testing environment so we can monitor the develop branch and make sure nothing like this happens again.

klaus3000 · 2017-01-05T11:09:32Z

Another interesting thing. The new DSC with -T does count correct, but seems to have light fluctuations whereas the old DSC hat constantly 200 q/s - see below:

klaus3000 · 2017-01-05T11:10:00Z

And the peak of 250q/s in the first sample intervall looks strange too

jelu · 2017-01-05T11:14:55Z

This is probably due to the total reworking of the libpcap parts that was made trying to solve other issues.

Can you look in the XML/DAT files for the packet captured, parsed, dropped for the durations?

For reference what distribution and kernel version are you running?

klaus3000 · 2017-01-05T11:40:47Z

Here are the statistics for 4 consecutive minutes.

There are 200 q/s hitting the servers via 2 links, thus capturing on 2 interfaces. In total there are 400 packets/s or 24000packets/min (sample intervall).

The pkts_captured on eth2 seems to alternate for the value of packets in 1 second.

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5901"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17700"/>
  <pcap_stat val="filter_received" count="17705"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="6097"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18299"/>
  <pcap_stat val="filter_received" count="18295"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5998"/>
  <pcap_stat val="filter_received" count="5901"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17698"/>
  <pcap_stat val="filter_received" count="17703"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5995"/>
  <pcap_stat val="filter_received" count="6093"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18295"/>
  <pcap_stat val="filter_received" count="18290"/>
</ifname>

This are the statistics for the first sample intervall which falsely reports 250q/s

<ifname val="eth1">
  <pcap_stat val="filter_received" count="11205"/>
  <pcap_stat val="pkts_captured" count="7497"/>
  <pcap_stat val="kernel_dropped" count="3609"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="filter_received" count="33609"/>
  <pcap_stat val="pkts_captured" count="22793"/>
  <pcap_stat val="kernel_dropped" count="10815"/>
</ifname>

jelu · 2017-01-05T11:46:28Z

Can you paste 4 captures from old also?

klaus3000 · 2017-01-05T12:04:14Z

Forgot to mention - we are using Ubuntu 14.04 with kernel:
Linux mirror-tst 4.4.0-34-generic #53~14.04.1-Ubuntu SMP Wed Jul 27 16:56:40 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

klaus3000 · 2017-01-05T12:04:41Z

Here are 4 consecutive statistics using the old binary:

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="6000"/>
  <pcap_stat val="filter_received" count="6000"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17999"/>
  <pcap_stat val="filter_received" count="17999"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5999"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18000"/>
  <pcap_stat val="filter_received" count="18000"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5999"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17997"/>
  <pcap_stat val="filter_received" count="17997"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5998"/>
  <pcap_stat val="filter_received" count="5998"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17999"/>
  <pcap_stat val="filter_received" count="17999"/>
</ifname>

jelu · 2017-01-06T08:58:59Z

If you have not already please read my announcement regarding this:
https://lists.dns-oarc.net/pipermail/dsc/2017-January/000361.html

I have setup some test now for dsc, if you have any other ideas of tests please let me know:
https://dev.dns-oarc.net/jenkins/view/dsctest/

jelu · 2017-01-12T07:23:16Z

The testing platform is starting to yield some results, #122 may explain what you are seeing so could you check the start_time/stop_time in the files?

jelu · 2017-01-12T13:46:52Z

The spike at the beginning is because the new code is using pcap_create() and pcap_activate() before the sleeping period where dsc syncs to the interval, that means that packets are buffered while it sleeps. Old code used pcap_open_live() which is a pre libpcap 1.0 function which must work quite differently under the hood.

I will extend pcap-thread with functionality to active the interfaces after the sleep period.

klaus3000 · 2017-01-12T15:25:59Z

Just a question about the sleeping period: if the interval is 1hour - where does it sync to? To the next minute or to the next our? The latter would be bad because in case of a dsc restart we could loose up to 59minutes.

jelu · 2017-01-12T15:30:39Z

If your interval is 1 hour and you start a second after the hour it would wait 59:59, you should see this in your syslog.

klaus3000 · 2017-01-12T19:04:08Z

If i find some time I will write an option to start immediately with a reduced first interval. Regarding start_time/stop_time -> tomorrow.

jelu · 2017-01-12T19:11:10Z

Make it a dsc.conf option

klaus3000 · 2017-01-13T08:12:50Z

Regardin start/stop time:
The 2.3 version we are running threaded does show 1s leap between start und previous stop time once and a while (~ every 80th frame):
1484245200.dscdata.xml: start_time="1484245140" stop_time="1484245200">
1484245260.dscdata.xml: start_time="1484245201" stop_time="1484245260">

1484249940.dscdata.xml: start_time="1484249880" stop_time="1484249940">
1484250000.dscdata.xml: start_time="1484249941" stop_time="1484250000">

1484254620.dscdata.xml: start_time="1484254560" stop_time="1484254620">
1484254680.dscdata.xml: start_time="1484254621" stop_time="1484254680">

1484259420.dscdata.xml: start_time="1484259360" stop_time="1484259420">
1484259480.dscdata.xml: start_time="1484259421" stop_time="1484259480">

The 2.3 version we are running with -T does show 1s leap between start und previous stop time ~ every 6th frame.

The old 2.0 version does not show any gaps.

Shall we also test the current master?

jelu · 2017-01-13T08:25:53Z

You don't have to run master, thanks for the information.

Are any of these 2.3 based on the develop/a special commit?

jelu · 2017-01-13T08:45:38Z

Btw, I am putting in a fix for the burst at startup and will add your request for an option to withgo of the delayed startup.

klaus3000 · 2017-01-13T11:25:01Z

Our 2.3 version is actually commit c02a170

Issue #116: Feature for not waiting on the interval sync

jelu · 2017-01-13T16:02:56Z

Spike at start should be solved now and your feature is added, if you can please test the latest develop branch.

klaus3000 · 2017-01-16T17:31:26Z

testing with or without -T?

jelu · 2017-01-16T17:35:19Z

Threads are default disabled now so -T won't do anything unless you do: configure --enable-threads

jelu · 2017-01-20T12:44:22Z

Were you able to run the develop branch? Do you have any issue to report?

klaus3000 · 2017-01-20T15:56:53Z

Using commit 1569bbf
First test starting synced to the minute with 100q/s:

Seems to count correct know. Minor issue: the first intervall seems to starts capturing with 1s delay.

The respective XML file:

<array name="pcap_stats" dimensions="2" start_time="1484926320" stop_time="1484926380">
  <dimension number="1" type="ifname"/>
  <dimension number="2" type="pcap_stat"/>
  <data>
    <ifname val="eth1">
    </ifname>
    <ifname val="eth2">
      <pcap_stat val="filter_received" count="11995"/>
      <pcap_stat val="pkts_captured" count="11796"/>
    </ifname>
  </data>
</array>

jelu · 2017-01-20T17:58:48Z

Can you paste start_time/stop_time from that file?

klaus3000 · 2017-01-20T19:40:37Z

I pasted it already - see the first line. It looks correct.

klaus3000 · 2017-01-20T19:43:16Z

Maybe the 1 second delay is related to the reason why tcpdump has 1 second delay: http://unix.stackexchange.com/questions/224893/tcpdump-waits-a-second-before-displaying-packets

klaus3000 · 2017-01-20T19:50:47Z

Second test: using "no_wait_interval" and started at 30 seconds after full minute - looks fine.

One thing I noticed: there is no final XML file when dsc is killed although I used "dump_reports_on_exit"

jelu · 2017-01-20T20:29:02Z

How was dsc killed?

jelu · 2017-01-20T20:30:39Z

On Jan 20, 2017 20:40, "Klaus Darilion" <notifications@github.com> wrote: I pasted it already - see the first line. It looks correct. Sorry it didn't show in the email.

jelu · 2017-01-20T20:45:40Z

For the packets missed at the beginning its not related to tcpdump 1 sec. Problem here is that things take time and you can't really measure it and compensate it without adding additional processing time. In your example for the first period would be because now the interval sync is made then the activation of the pcaps and after that pcap-thread needs time setup the loop. This takes time and with a high qps you will initially miss some packets. Btw, did you use the new pcap_buffer_size config option? It will help in a lot of situations since the OS can buffer packets between XML writes and maybe even between activate and pcap-thread loop.

klaus3000 · 2017-01-20T22:07:18Z

a) It was killed with "kill "
b) I used a bad markup for the XML snippet. I guess this was th reason you didn't saw it in the email. I fixed it afterwards.
c) Strange that it is pretty much 1 second (based on the difference between filter_received and pkts_captured. I do not use the pcap_buffer_size config, as we usually do not have "dropped packets" reported.

jelu · 2017-01-20T22:30:13Z

a&c) if you can strace -fF -s 1024 during this with timings options (don't know exactly which) it would help a lot, send the output directly to me.

jelu · 2017-01-23T11:47:51Z

Okay, this is a problem but it is not related to libpcap, it is select():

11:31:50.048214 select(5, [4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 27591})
11:31:51.021481 write(1, "packet", 6)   = 6
...
11:31:51.024552 write(1, "packet", 6)   = 6
11:31:51.024659 select(5, [4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 4064})
11:31:52.020876 write(1, "packet", 6)   = 6

As you can see select() waits too long, I am generating 10-20 QPS but it is waiting 0.95 seconds until returning.

I will add an option to set the pcap-thread timeout in dsc.conf which is in milliseconds so you can set it to 10ms and hopefully just miss a packet or two during startup.

Issue #116: Add config option pcap_thread_timeout

jelu · 2017-01-23T16:01:00Z

Please reset develop and use pcap_thread_timeout 10;

…ove debug code

klaus3000 · 2017-01-24T06:45:42Z

Regarding "dump_reports_on_exit": I open a new issue

Regarding the missing packets in the first intervall. Using your suggestions everthing looks fine now. Although I have to admit that I do not understand the consequences of setting the pcap_thread_timeout to 10ms. Especially the docs say "As a workaround, set pcap_thread_timeout to a relevant millisecond timeout with regards to the packets per second received." Would this cause additional CPU load? What this help or make things worse in case of high packet rates? Should the value be higher or lower for high packet rates? What is the default value?

jelu · 2017-01-24T10:15:29Z

I will try and clear up the text in dsc.conf(5) but:

Would this cause additional CPU load?

Not really, lowest you can set it is to 1ms and that is a lot of time in today's servers.

What this help or make things worse in case of high packet rates?

Helps in all situations it seems since select(2) is a bit flaky.

Should the value be higher or lower for high packet rates?

Lower since you want it to react faster.

What is the default value?

1000 ms or 1 s, this is from the old dsc code that pcap-thread is built on.

This has only to do with the timeout within pcap-thread that is used for select(2), if your having a very high QPS it is likely that select(2) will never timeout anyway.

BUT there is something going on within select(2) for the first timeout interval that caused the delay you have been seeing in the graphs, I tried "bootstrapping" select by having <10ms timeouts for the first few cycles but it did not work and I saw the same delay for the first 1s timeout.

…ove debug code

jelu · 2017-01-25T08:28:47Z

How about f1b3ec3, is it more clear?

klaus3000 · 2017-01-25T08:49:39Z

Should the value be higher or lower for high packet rates?

Lower since you want it to react faster.

Wouldn't it be the other way? With high packet rate it does not timeout anyway, so the value does not matter. With low packet rates, dsc "blocks" other processing (e.g dumping a report) until a packet is received or select() times out - so a low value is actually needed for low query rates.

How about f1b3ec3, is it more clear?

Yes.

To me it seems that setting a default value of 100ms would ge good to have report dumping not delayed by 1s in case of low traffic.

jelu · 2017-01-25T10:40:58Z

Wouldn't it be the other way? With high packet rate it does not timeout anyway, so the value does not matter. With low packet rates, dsc "blocks" other processing (e.g dumping a report) until a packet is received or select() times out - so a low value is actually needed for low query rates.

Actually no, select(2) seems to delay even during high packet rate and dsc now uses pcap_thread_set_timedrun_to() so it stops at a absolute time and does not block reports if there are no packets.

To me it seems that setting a default value of 100ms would ge good to have report dumping not delayed by 1s in case of low traffic.

Yeah, looking at the old code (https://github.com/DNS-OARC/dsc/blob/v2.1.1/src/pcap.c#L883) it was 250ms so I agree the default should be changed.

…ove debug code

jelu added bug high labels Jan 5, 2017

jelu added a commit to jelu/dsc that referenced this issue Jan 13, 2017

Issue DNS-OARC#116: Feature for not waiting on the interval sync

fc13d73

jelu added a commit that referenced this issue Jan 13, 2017

Merge pull request #124 from jelu/no-wait

4bb3176

Issue #116: Feature for not waiting on the interval sync

jelu added a commit to jelu/dsc that referenced this issue Jan 23, 2017

Issue DNS-OARC#116: Add config option pcap_thread_timeout

667cc91

jelu added a commit that referenced this issue Jan 23, 2017

Merge pull request #132 from jelu/ptconf

94f680a

Issue #116: Add config option pcap_thread_timeout

jelu added a commit to jelu/dsc that referenced this issue Jan 23, 2017

Fix DNS-OARC#111, fix DNS-OARC#116: Update pcap-thread to v2.0.0, rem…

d088ffa

…ove debug code

jelu added a commit to jelu/dsc that referenced this issue Jan 24, 2017

Fix DNS-OARC#111, fix DNS-OARC#116: Update pcap-thread to v2.0.0, rem…

d5c757d

…ove debug code

jelu added a commit to jelu/dsc that referenced this issue Jan 25, 2017

Issue DNS-OARC#116: Try and make select issue more clear

f1b3ec3

jelu mentioned this issue Jan 25, 2017

Change default timeout to 100ms #139

Closed

jelu added a commit to jelu/dsc that referenced this issue Jan 26, 2017

Fix DNS-OARC#111, fix DNS-OARC#116: Update pcap-thread to v2.0.0, rem…

ee59572

…ove debug code

jelu closed this as completed in b0d2374 Jan 27, 2017

Wrong statistics with version 2.3 and threads #116

Wrong statistics with version 2.3 and threads #116

Comments

klaus3000 commented Jan 5, 2017

jelu commented Jan 5, 2017

klaus3000 commented Jan 5, 2017

klaus3000 commented Jan 5, 2017

jelu commented Jan 5, 2017

klaus3000 commented Jan 5, 2017

jelu commented Jan 5, 2017

klaus3000 commented Jan 5, 2017

klaus3000 commented Jan 5, 2017

jelu commented Jan 6, 2017

jelu commented Jan 12, 2017

jelu commented Jan 12, 2017

klaus3000 commented Jan 12, 2017

jelu commented Jan 12, 2017

klaus3000 commented Jan 12, 2017

jelu commented Jan 12, 2017

klaus3000 commented Jan 13, 2017

jelu commented Jan 13, 2017

jelu commented Jan 13, 2017

klaus3000 commented Jan 13, 2017

jelu commented Jan 13, 2017

klaus3000 commented Jan 16, 2017

jelu commented Jan 16, 2017 via email

jelu commented Jan 20, 2017

klaus3000 commented Jan 20, 2017 • edited Loading

jelu commented Jan 20, 2017 via email

klaus3000 commented Jan 20, 2017

klaus3000 commented Jan 20, 2017

klaus3000 commented Jan 20, 2017

jelu commented Jan 20, 2017 via email

jelu commented Jan 20, 2017 via email

jelu commented Jan 20, 2017 via email

klaus3000 commented Jan 20, 2017

jelu commented Jan 20, 2017 via email

jelu commented Jan 23, 2017

jelu commented Jan 23, 2017

klaus3000 commented Jan 24, 2017

jelu commented Jan 24, 2017 • edited Loading

jelu commented Jan 25, 2017

klaus3000 commented Jan 25, 2017

jelu commented Jan 25, 2017

klaus3000 commented Jan 20, 2017 •

edited

Loading

jelu commented Jan 24, 2017 •

edited

Loading