added option -N to allow fping output statistics in the format expected by netdata #105

ktsaou · 2016-10-15T23:52:19Z

Hi,

I am the founder of firehol.org and https://github.com/firehol/netdata (http://my-netdata.io).

This PR adds the option -N to fping to allow it output statistics in a netdata friendly format.

For each host given, netdata will create 3 charts, like that:

A wrapper script (/usr/libexec/netdata/plugins.d/fping.plugin) is required only for passing the valid options to fping. It is shown below and I'll add it to the netdata repo:

#!/usr/bin/env bash

me="${0}"

# the frequency to send info to netdata
# passed by netdata as the first parameter
update_every="${1-1}"

# the netdata configuration directory
# passed by netdata as an environment variable
NETDATA_CONFIG_DIR="${NETDATA_CONFIG_DIR-/etc/netdata}"

# -----------------------------------------------------------------------------
# configuration options
# can be overwritten at /etc/netdata/fping.conf

# the fping binary to use
# we need one that can output netdata friendly info
fping="$(which fping || command -v fping)"

# a space separated list of hosts to fping
hosts=""

# the time in milliseconds (1 sec = 1000 ms)
# to ping the hosts - by default 2 pings per iteration
ping_every="$((update_every * 1000 / 2))"

# how many retries to make if a host does not respond
retries=1

# -----------------------------------------------------------------------------

# load the configuration file
if [ ! -f "${NETDATA_CONFIG_DIR}/fping.conf" ]
then
    echo >&2 "${me}: configuration file '${NETDATA_CONFIG_DIR}/fping.conf' not found - nothing to do."
    echo "DISABLE"
    exit 1
fi

source "${NETDATA_CONFIG_DIR}/fping.conf"

if [ -z "${hosts}" ]
then
    echo >&2 "${me}: no hosts configued in '${NETDATA_CONFIG_DIR}/fping.conf' - nothing to do."
    echo "DISABLE"
    exit 1
fi

if [ -z "${fping}" -o ! -x "${fping}" ]
then
    echo >&2 "${me}: command '${fping}' is not executable - cannot proceed."
    echo "DISABLE"
    exit 1
fi

# the fping options we will use
options=( -N -l -R -Q ${update_every} -p ${ping_every} -r ${retries} ${hosts} )

# execute fping
exec "${fping}" "${options[@]}"

# if we cannot execute fping, stop
echo >&2 "${me}: command '${fping} ${options[@]}' failed to be executed."
echo "DISABLE"
exit 1

cc: netdata/netdata#1122

…ed by netdata

coveralls · 2016-10-15T23:54:33Z

Coverage decreased (-4.04%) to 76.65% when pulling a2cbca7 on ktsaou:develop into a8861f9 on schweikert:develop.

coveralls · 2016-10-31T21:08:53Z

Coverage decreased (-4.04%) to 76.65% when pulling 2426485 on ktsaou:develop into a8861f9 on schweikert:develop.

coveralls · 2016-10-31T21:42:32Z

Coverage decreased (-3.7%) to 76.962% when pulling 9c9c166 on ktsaou:develop into a8861f9 on schweikert:develop.

ktsaou · 2016-11-01T09:51:30Z

thanks!
This plugin can be seen in action here: http://registry.my-netdata.io/#menu_fping

schweikert · 2016-11-01T10:01:26Z

Very cool, thanks to you! Btw. there is a bit of a conceptual problem on how -Q works, because the reporting interval might force a report when a ping was sent and there wasn't time to receive it yet.

Also, I am going to improve the code a bit so that the reporting interval is going to be more precise. Currently it does "current_time + report_interval", which might be inaccurate.

See also: #97

ktsaou · 2016-11-01T10:33:18Z

You are right. This is why I disabled the lost % chart. It had a lot of false positive.

The problem gets even worst when multiple hosts are given. Because of the minimum -i is 10 for non-root users, the last hosts are more probable to face this issue.

It could be better to allow -i 1, but limit (for non-root users) the total packets per second for all hosts to 100. This would at least be more fare for all hosts.

In netdata, I plan to add the following alarms:

a warning if the max of the last 10 second maximum is above 1000
a critical if the the latency chart is not collected for 10 seconds (I saw that fping does not print the values if no packet is received).
a warning if the 10 second sum of the quality (% of packets returns) is below 99% and a critical if it is below 90% - however this is not going to work well. The problem is that by default netdata sends 5 pings per second, so 50 pings in 10 seconds. If one switches interval (send on one, received on the next), the alarm would be triggered.

schweikert · 2016-11-01T10:46:25Z

I have just committed some changes to improve things on that front. Maybe you could test with the latest code from the develop branch? I changed the following:

More precise interval between reports
Allow -i 1

You could try using for example -p 1007 with -Q 1, so that boundaries are mostly not the same for the loop and for the report.

ktsaou · 2016-11-01T12:26:52Z

sure! You are the best! I'll do later today.

ktsaou · 2016-11-01T20:24:46Z

ok, I tested it. This is what is see:

-i 1 is now accepted even for non-root users and p >= 20 is required. nice 👍
-Q now respects the interval specified and always prints something at the given interval.

I saw the code about h->discard_next_recv_i. However I can't find a difference:

# ./src/fping -l -Q 1 -p 100 10.11.13.81
[22:06:35]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 65.3/82.7/105
[22:06:36]
10.11.13.81 : xmt/rcv/%loss = 10/9/10%, min/avg/max = 70.5/99.5/137
[22:06:37]
10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 63.3/93.3/146
[22:06:38]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 66.0/76.6/87.5
[22:06:39]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 68.6/79.5/97.2
[22:06:40]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 72.4/79.9/91.5
[22:06:41]
10.11.13.81 : xmt/rcv/%loss = 10/6/40%, min/avg/max = 86.8/276/484
[22:06:42]
10.11.13.81 : xmt/rcv/%return = 10/13/130%, min/avg/max = 153/226/461
[22:06:43]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 161/170/181
[22:06:44]
10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 58.0/101/169
[22:06:45]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 59.0/70.9/82.7
[22:06:46]
10.11.13.81 : xmt/rcv/%loss = 9/9/0%, min/avg/max = 67.2/76.2/104
[22:06:47]
10.11.13.81 : xmt/rcv/%loss = 9/8/11%, min/avg/max = 82.4/111/166
[22:06:48]
10.11.13.81 : xmt/rcv/%return = 10/12/120%, min/avg/max = 171/395/769
[22:06:49]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 168/176/181
[22:06:50]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 167/180/193
[22:06:51]
10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 57.4/116/211
[22:06:52]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 60.5/71.7/93.4
[22:06:53]
10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 64.6/73.6/102
[22:06:54]

ktsaou · 2016-11-01T20:36:20Z

Regarding point 3, I meant I can't find a difference.

In general, discarding a response before its timeout, is probably a bad idea: ea37408#diff-2ff0a524bb528090fe44f3dd1af5a11cR1405

What could be better, is to discard a response immediately upon reception and before counting it anywhere, only after the timeout specified with -t has elapsed.

This means that if -t 500 is given, timings above 500ms should never be reported. Instead packets arrived after 500ms should be just ignored (if they are ignored, packet loss will be reported, or should have been reported depending on the timing the request was sent).

Of course, this means that -t 500 is probably a bad default. It should be 5000.

ktsaou · 2016-11-01T22:00:31Z

Consider this:

packet loss should exclusively be reported on timeout. I mean, a packet is sent at time t1 which is expected to timeout after 500ms at t1 + 500ms. A packet loss should be reported when that time expires.

So, counting packets sent and packets received to detect a packet loss is not good enough for -Q. fping should keep a simple fifo per host with the expiration time of each packet. If a packet has not been arrived by the time given by the first item in the fifo, a lost packet should be accounted and if that packet arrives after its timeout, it should be discarded and not taken into account at all.

What do you think?

ktsaou · 2016-11-01T23:02:14Z

Keep in mind that fping now consumes 100% cpu.
I submitted a fix at PR #106.

schweikert · 2016-11-02T07:24:23Z

I think that -Q was more of a late addition and not the initial concept of how fping was meant to be used. That's why the data structures are the way they are. I think that we can make it work well enough though. Something to consider is that you should use a -p value that is bigger than the timeout, so -p 100 -Q 1 doesn't work well, because it uses an implicit -t 500 (timeout). Can you try again with -p 500 -Q 5, for example? I think that it should provide better data.

ktsaou · 2016-11-02T07:35:26Z

hm... I use it with -l -Q 1 -p 200 -t 5000. For example, I trying the resolve an issue on azure where I have sub-second freezes on communication between VMs. That is, communication is frozen for 100ms or 200ms, so very frequent checks and reporting is needed. For sure -Q 5 is way too long...

I'll give a try to the settings you suggest though.

ktsaou · 2016-11-02T21:31:43Z

I think the patch that discards one packet, introduced another problem. Check this:

http://registry.my-netdata.io/#menu_fping

When a packet is discarded, the average reported is totally wrong.

This is probably a better example:

As you can see, the average crosses max. Shown as both above and below max...

ktsaou · 2016-11-02T21:33:28Z

This is even better:

When the received count is one less the sent one, the average is wrong...

ktsaou · 2016-11-02T21:38:03Z

The problem is that because the received count is one less (artificially discarded), this line:

fping/src/fping.c

Line 1360 in 2cb0860

avg = h->total_time_i / h->num_recv_i;

does the wrong calculation: h->total_time_i includes the discarded packet, but h->num_recv_i does not.

schweikert · 2016-11-03T08:08:05Z

hmm, you are right. it probably can be fixed by not increasing total_time_i in the case the response is discarded

schweikert · 2016-11-03T10:35:19Z

I have committed a fix. can you test again?

ktsaou · 2016-11-03T10:39:56Z

installed it at the same url:

http://registry.my-netdata.io/#menu_fping

it seems fixed.

added option -N to allow fping output statistics in the format expect…

a2cbca7

…ed by netdata

ktsaou mentioned this pull request Oct 15, 2016

Possible to have netdata show host latency with fping netdata/netdata#1122

Closed

ktsaou added 2 commits October 30, 2016 20:05

re-order chart information

eda4c85

added another dimension to track excess received packets

2426485

remove dimension lost in favor of returned

9c9c166

added help info

6470a3d

schweikert merged commit 6470a3d into schweikert:develop Nov 1, 2016

schweikert mentioned this pull request Nov 1, 2016

Incorrect results reported when pinging multiple hosts #97

Closed

ktsaou added a commit to ktsaou/netdata that referenced this pull request Nov 1, 2016

changes after schweikert/fping#105 being merged into fping repo

e556521

ktsaou mentioned this pull request Nov 3, 2016

fping latency and packet graphs are wrong netdata/netdata#1200

Closed

ktsaou mentioned this pull request Jan 23, 2017

External intervals for providing samples netdata/netdata#1612

Closed

schweikert mentioned this pull request Feb 17, 2017

Constant number of probes when using -Q reporting #113

Closed

schweikert mentioned this pull request Jul 26, 2020

refactored event loop, now for each ping create both next-ping+timeout events #193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added option -N to allow fping output statistics in the format expected by netdata #105

added option -N to allow fping output statistics in the format expected by netdata #105

ktsaou commented Oct 15, 2016

coveralls commented Oct 15, 2016

coveralls commented Oct 31, 2016

coveralls commented Oct 31, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 1, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 1, 2016

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016 •

edited

Loading

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

schweikert commented Nov 3, 2016

schweikert commented Nov 3, 2016

ktsaou commented Nov 3, 2016

added option -N to allow fping output statistics in the format expected by netdata #105

added option -N to allow fping output statistics in the format expected by netdata #105

Conversation

ktsaou commented Oct 15, 2016

coveralls commented Oct 15, 2016

coveralls commented Oct 31, 2016

coveralls commented Oct 31, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 1, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 1, 2016

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016 • edited Loading

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016

ktsaou commented Nov 1, 2016

schweikert commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

ktsaou commented Nov 2, 2016

schweikert commented Nov 3, 2016

schweikert commented Nov 3, 2016

ktsaou commented Nov 3, 2016

ktsaou commented Nov 1, 2016 •

edited

Loading