Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong statistics with version 2.3 and threads #116

Closed
klaus3000 opened this issue Jan 5, 2017 · 40 comments
Closed

Wrong statistics with version 2.3 and threads #116

klaus3000 opened this issue Jan 5, 2017 · 40 comments

Comments

@klaus3000
Copy link

Hi!
We tested todays version from master (commit c02a170).

The problem is, that it does not report all requests, and even less responses. We started with 100q/s, increased to 200q/s, and then switched back to our old DSC version (reported as 2.0.0 rc1).

Requests

dsc-grapher pl

Responses

dsc-grapher-responses pl

New DSC uses the same config as the old DSC.

When using newest DSC with option -T the count is correct (200q/s)

@jelu
Copy link
Member

jelu commented Jan 5, 2017

Thanks for the report.

I have been trying to solve various thread related issues the past weeks but have been unsuccessfully so I will shortly release 2.4.0 which will have threads disabled by default.

Am also setting up continuous testing environment so we can monitor the develop branch and make sure nothing like this happens again.

@klaus3000
Copy link
Author

Another interesting thing. The new DSC with -T does count correct, but seems to have light fluctuations whereas the old DSC hat constantly 200 q/s - see below:

dsc-grapher2 pl

@klaus3000
Copy link
Author

And the peak of 250q/s in the first sample intervall looks strange too

@jelu
Copy link
Member

jelu commented Jan 5, 2017

This is probably due to the total reworking of the libpcap parts that was made trying to solve other issues.

Can you look in the XML/DAT files for the packet captured, parsed, dropped for the durations?

For reference what distribution and kernel version are you running?

@klaus3000
Copy link
Author

Here are the statistics for 4 consecutive minutes.

There are 200 q/s hitting the servers via 2 links, thus capturing on 2 interfaces. In total there are 400 packets/s or 24000packets/min (sample intervall).

The pkts_captured on eth2 seems to alternate for the value of packets in 1 second.

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5901"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17700"/>
  <pcap_stat val="filter_received" count="17705"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="6097"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18299"/>
  <pcap_stat val="filter_received" count="18295"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5998"/>
  <pcap_stat val="filter_received" count="5901"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17698"/>
  <pcap_stat val="filter_received" count="17703"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5995"/>
  <pcap_stat val="filter_received" count="6093"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18295"/>
  <pcap_stat val="filter_received" count="18290"/>
</ifname>

This are the statistics for the first sample intervall which falsely reports 250q/s

<ifname val="eth1">
  <pcap_stat val="filter_received" count="11205"/>
  <pcap_stat val="pkts_captured" count="7497"/>
  <pcap_stat val="kernel_dropped" count="3609"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="filter_received" count="33609"/>
  <pcap_stat val="pkts_captured" count="22793"/>
  <pcap_stat val="kernel_dropped" count="10815"/>
</ifname>

@jelu
Copy link
Member

jelu commented Jan 5, 2017

Can you paste 4 captures from old also?

@klaus3000
Copy link
Author

Forgot to mention - we are using Ubuntu 14.04 with kernel:
Linux mirror-tst 4.4.0-34-generic #53~14.04.1-Ubuntu SMP Wed Jul 27 16:56:40 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

@klaus3000
Copy link
Author

Here are 4 consecutive statistics using the old binary:

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="6000"/>
  <pcap_stat val="filter_received" count="6000"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17999"/>
  <pcap_stat val="filter_received" count="17999"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5999"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="18000"/>
  <pcap_stat val="filter_received" count="18000"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5999"/>
  <pcap_stat val="filter_received" count="5999"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17997"/>
  <pcap_stat val="filter_received" count="17997"/>
</ifname>

<ifname val="eth1">
  <pcap_stat val="pkts_captured" count="5998"/>
  <pcap_stat val="filter_received" count="5998"/>
</ifname>
<ifname val="eth2">
  <pcap_stat val="pkts_captured" count="17999"/>
  <pcap_stat val="filter_received" count="17999"/>
</ifname>

@jelu
Copy link
Member

jelu commented Jan 6, 2017

If you have not already please read my announcement regarding this:
https://lists.dns-oarc.net/pipermail/dsc/2017-January/000361.html

I have setup some test now for dsc, if you have any other ideas of tests please let me know:
https://dev.dns-oarc.net/jenkins/view/dsctest/

@jelu
Copy link
Member

jelu commented Jan 12, 2017

The testing platform is starting to yield some results, #122 may explain what you are seeing so could you check the start_time/stop_time in the files?

@jelu
Copy link
Member

jelu commented Jan 12, 2017

The spike at the beginning is because the new code is using pcap_create() and pcap_activate() before the sleeping period where dsc syncs to the interval, that means that packets are buffered while it sleeps. Old code used pcap_open_live() which is a pre libpcap 1.0 function which must work quite differently under the hood.

I will extend pcap-thread with functionality to active the interfaces after the sleep period.

@klaus3000
Copy link
Author

Just a question about the sleeping period: if the interval is 1hour - where does it sync to? To the next minute or to the next our? The latter would be bad because in case of a dsc restart we could loose up to 59minutes.

@jelu
Copy link
Member

jelu commented Jan 12, 2017

If your interval is 1 hour and you start a second after the hour it would wait 59:59, you should see this in your syslog.

@klaus3000
Copy link
Author

If i find some time I will write an option to start immediately with a reduced first interval. Regarding start_time/stop_time -> tomorrow.

@jelu
Copy link
Member

jelu commented Jan 12, 2017

Make it a dsc.conf option

@klaus3000
Copy link
Author

Regardin start/stop time:
The 2.3 version we are running threaded does show 1s leap between start und previous stop time once and a while (~ every 80th frame):
1484245200.dscdata.xml: start_time="1484245140" stop_time="1484245200">
1484245260.dscdata.xml: start_time="1484245201" stop_time="1484245260">

1484249940.dscdata.xml: start_time="1484249880" stop_time="1484249940">
1484250000.dscdata.xml: start_time="1484249941" stop_time="1484250000">

1484254620.dscdata.xml: start_time="1484254560" stop_time="1484254620">
1484254680.dscdata.xml: start_time="1484254621" stop_time="1484254680">

1484259420.dscdata.xml: start_time="1484259360" stop_time="1484259420">
1484259480.dscdata.xml: start_time="1484259421" stop_time="1484259480">

The 2.3 version we are running with -T does show 1s leap between start und previous stop time ~ every 6th frame.

The old 2.0 version does not show any gaps.

Shall we also test the current master?

@jelu
Copy link
Member

jelu commented Jan 13, 2017

You don't have to run master, thanks for the information.

Are any of these 2.3 based on the develop/a special commit?

@jelu
Copy link
Member

jelu commented Jan 13, 2017

Btw, I am putting in a fix for the burst at startup and will add your request for an option to withgo of the delayed startup.

@klaus3000
Copy link
Author

Our 2.3 version is actually commit c02a170

jelu added a commit that referenced this issue Jan 13, 2017
Issue #116: Feature for not waiting on the interval sync
@jelu
Copy link
Member

jelu commented Jan 13, 2017

Spike at start should be solved now and your feature is added, if you can please test the latest develop branch.

@klaus3000
Copy link
Author

testing with or without -T?

@jelu
Copy link
Member

jelu commented Jan 16, 2017 via email

@jelu
Copy link
Member

jelu commented Jan 20, 2017

Were you able to run the develop branch? Do you have any issue to report?

@klaus3000
Copy link
Author

klaus3000 commented Jan 20, 2017

Using commit 1569bbf
First test starting synced to the minute with 100q/s:
dsc-grapher
Seems to count correct know. Minor issue: the first intervall seems to starts capturing with 1s delay.

The respective XML file:

<array name="pcap_stats" dimensions="2" start_time="1484926320" stop_time="1484926380">
  <dimension number="1" type="ifname"/>
  <dimension number="2" type="pcap_stat"/>
  <data>
    <ifname val="eth1">
    </ifname>
    <ifname val="eth2">
      <pcap_stat val="filter_received" count="11995"/>
      <pcap_stat val="pkts_captured" count="11796"/>
    </ifname>
  </data>
</array>

@jelu
Copy link
Member

jelu commented Jan 20, 2017 via email

@klaus3000
Copy link
Author

I pasted it already - see the first line. It looks correct.

@klaus3000
Copy link
Author

Maybe the 1 second delay is related to the reason why tcpdump has 1 second delay: http://unix.stackexchange.com/questions/224893/tcpdump-waits-a-second-before-displaying-packets

@klaus3000
Copy link
Author

Second test: using "no_wait_interval" and started at 30 seconds after full minute - looks fine.
dsc-grapher pl

One thing I noticed: there is no final XML file when dsc is killed although I used "dump_reports_on_exit"

@jelu
Copy link
Member

jelu commented Jan 20, 2017 via email

@jelu
Copy link
Member

jelu commented Jan 20, 2017 via email

@jelu
Copy link
Member

jelu commented Jan 20, 2017 via email

@klaus3000
Copy link
Author

a) It was killed with "kill "
b) I used a bad markup for the XML snippet. I guess this was th reason you didn't saw it in the email. I fixed it afterwards.
c) Strange that it is pretty much 1 second (based on the difference between filter_received and pkts_captured. I do not use the pcap_buffer_size config, as we usually do not have "dropped packets" reported.

@jelu
Copy link
Member

jelu commented Jan 20, 2017 via email

@jelu
Copy link
Member

jelu commented Jan 23, 2017

Okay, this is a problem but it is not related to libpcap, it is select():

11:31:50.048214 select(5, [4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 27591})
11:31:51.021481 write(1, "packet", 6)   = 6
...
11:31:51.024552 write(1, "packet", 6)   = 6
11:31:51.024659 select(5, [4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 4064})
11:31:52.020876 write(1, "packet", 6)   = 6

As you can see select() waits too long, I am generating 10-20 QPS but it is waiting 0.95 seconds until returning.

I will add an option to set the pcap-thread timeout in dsc.conf which is in milliseconds so you can set it to 10ms and hopefully just miss a packet or two during startup.

jelu added a commit to jelu/dsc that referenced this issue Jan 23, 2017
jelu added a commit that referenced this issue Jan 23, 2017
Issue #116: Add config option pcap_thread_timeout
@jelu
Copy link
Member

jelu commented Jan 23, 2017

Please reset develop and use pcap_thread_timeout 10;

jelu added a commit to jelu/dsc that referenced this issue Jan 23, 2017
@klaus3000
Copy link
Author

Regarding "dump_reports_on_exit": I open a new issue

Regarding the missing packets in the first intervall. Using your suggestions everthing looks fine now. Although I have to admit that I do not understand the consequences of setting the pcap_thread_timeout to 10ms. Especially the docs say "As a workaround, set pcap_thread_timeout to a relevant millisecond timeout with regards to the packets per second received." Would this cause additional CPU load? What this help or make things worse in case of high packet rates? Should the value be higher or lower for high packet rates? What is the default value?

@jelu
Copy link
Member

jelu commented Jan 24, 2017

I will try and clear up the text in dsc.conf(5) but:

Would this cause additional CPU load?

Not really, lowest you can set it is to 1ms and that is a lot of time in today's servers.

What this help or make things worse in case of high packet rates?

Helps in all situations it seems since select(2) is a bit flaky.

Should the value be higher or lower for high packet rates?

Lower since you want it to react faster.

What is the default value?

1000 ms or 1 s, this is from the old dsc code that pcap-thread is built on.

This has only to do with the timeout within pcap-thread that is used for select(2), if your having a very high QPS it is likely that select(2) will never timeout anyway.

BUT there is something going on within select(2) for the first timeout interval that caused the delay you have been seeing in the graphs, I tried "bootstrapping" select by having <10ms timeouts for the first few cycles but it did not work and I saw the same delay for the first 1s timeout.

jelu added a commit to jelu/dsc that referenced this issue Jan 24, 2017
jelu added a commit to jelu/dsc that referenced this issue Jan 25, 2017
@jelu
Copy link
Member

jelu commented Jan 25, 2017

How about f1b3ec3, is it more clear?

@klaus3000
Copy link
Author

Should the value be higher or lower for high packet rates?

Lower since you want it to react faster.

Wouldn't it be the other way? With high packet rate it does not timeout anyway, so the value does not matter. With low packet rates, dsc "blocks" other processing (e.g dumping a report) until a packet is received or select() times out - so a low value is actually needed for low query rates.

How about f1b3ec3, is it more clear?

Yes.

To me it seems that setting a default value of 100ms would ge good to have report dumping not delayed by 1s in case of low traffic.

@jelu
Copy link
Member

jelu commented Jan 25, 2017

Wouldn't it be the other way? With high packet rate it does not timeout anyway, so the value does not matter. With low packet rates, dsc "blocks" other processing (e.g dumping a report) until a packet is received or select() times out - so a low value is actually needed for low query rates.

Actually no, select(2) seems to delay even during high packet rate and dsc now uses pcap_thread_set_timedrun_to() so it stops at a absolute time and does not block reports if there are no packets.

To me it seems that setting a default value of 100ms would ge good to have report dumping not delayed by 1s in case of low traffic.

Yeah, looking at the old code (https://github.com/DNS-OARC/dsc/blob/v2.1.1/src/pcap.c#L883) it was 250ms so I agree the default should be changed.

jelu added a commit to jelu/dsc that referenced this issue Jan 26, 2017
@jelu jelu closed this as completed in b0d2374 Jan 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants