Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added option -N to allow fping output statistics in the format expected by netdata #105

Merged
merged 5 commits into from
Nov 1, 2016

Conversation

ktsaou
Copy link
Contributor

@ktsaou ktsaou commented Oct 15, 2016

Hi,

I am the founder of firehol.org and https://github.com/firehol/netdata (http://my-netdata.io).

This PR adds the option -N to fping to allow it output statistics in a netdata friendly format.

For each host given, netdata will create 3 charts, like that:

image

A wrapper script (/usr/libexec/netdata/plugins.d/fping.plugin) is required only for passing the valid options to fping. It is shown below and I'll add it to the netdata repo:

#!/usr/bin/env bash

me="${0}"

# the frequency to send info to netdata
# passed by netdata as the first parameter
update_every="${1-1}"

# the netdata configuration directory
# passed by netdata as an environment variable
NETDATA_CONFIG_DIR="${NETDATA_CONFIG_DIR-/etc/netdata}"

# -----------------------------------------------------------------------------
# configuration options
# can be overwritten at /etc/netdata/fping.conf

# the fping binary to use
# we need one that can output netdata friendly info
fping="$(which fping || command -v fping)"

# a space separated list of hosts to fping
hosts=""

# the time in milliseconds (1 sec = 1000 ms)
# to ping the hosts - by default 2 pings per iteration
ping_every="$((update_every * 1000 / 2))"

# how many retries to make if a host does not respond
retries=1

# -----------------------------------------------------------------------------

# load the configuration file
if [ ! -f "${NETDATA_CONFIG_DIR}/fping.conf" ]
then
    echo >&2 "${me}: configuration file '${NETDATA_CONFIG_DIR}/fping.conf' not found - nothing to do."
    echo "DISABLE"
    exit 1
fi

source "${NETDATA_CONFIG_DIR}/fping.conf"

if [ -z "${hosts}" ]
then
    echo >&2 "${me}: no hosts configued in '${NETDATA_CONFIG_DIR}/fping.conf' - nothing to do."
    echo "DISABLE"
    exit 1
fi

if [ -z "${fping}" -o ! -x "${fping}" ]
then
    echo >&2 "${me}: command '${fping}' is not executable - cannot proceed."
    echo "DISABLE"
    exit 1
fi

# the fping options we will use
options=( -N -l -R -Q ${update_every} -p ${ping_every} -r ${retries} ${hosts} )

# execute fping
exec "${fping}" "${options[@]}"

# if we cannot execute fping, stop
echo >&2 "${me}: command '${fping} ${options[@]}' failed to be executed."
echo "DISABLE"
exit 1

cc: netdata/netdata#1122

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.04%) to 76.65% when pulling a2cbca7 on ktsaou:develop into a8861f9 on schweikert:develop.

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.04%) to 76.65% when pulling 2426485 on ktsaou:develop into a8861f9 on schweikert:develop.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.7%) to 76.962% when pulling 9c9c166 on ktsaou:develop into a8861f9 on schweikert:develop.

@schweikert schweikert merged commit 6470a3d into schweikert:develop Nov 1, 2016
@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

thanks!
This plugin can be seen in action here: http://registry.my-netdata.io/#menu_fping

@schweikert
Copy link
Owner

Very cool, thanks to you! Btw. there is a bit of a conceptual problem on how -Q works, because the reporting interval might force a report when a ping was sent and there wasn't time to receive it yet.

Also, I am going to improve the code a bit so that the reporting interval is going to be more precise. Currently it does "current_time + report_interval", which might be inaccurate.

See also: #97

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

You are right. This is why I disabled the lost % chart. It had a lot of false positive.

The problem gets even worst when multiple hosts are given. Because of the minimum -i is 10 for non-root users, the last hosts are more probable to face this issue.

It could be better to allow -i 1, but limit (for non-root users) the total packets per second for all hosts to 100. This would at least be more fare for all hosts.

In netdata, I plan to add the following alarms:

  1. a warning if the max of the last 10 second maximum is above 1000
  2. a critical if the the latency chart is not collected for 10 seconds (I saw that fping does not print the values if no packet is received).
  3. a warning if the 10 second sum of the quality (% of packets returns) is below 99% and a critical if it is below 90% - however this is not going to work well. The problem is that by default netdata sends 5 pings per second, so 50 pings in 10 seconds. If one switches interval (send on one, received on the next), the alarm would be triggered.

@schweikert
Copy link
Owner

I have just committed some changes to improve things on that front. Maybe you could test with the latest code from the develop branch? I changed the following:

  • More precise interval between reports
  • Allow -i 1

You could try using for example -p 1007 with -Q 1, so that boundaries are mostly not the same for the loop and for the report.

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

sure! You are the best! I'll do later today.

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

ok, I tested it. This is what is see:

  1. -i 1 is now accepted even for non-root users and p >= 20 is required. nice 👍

  2. -Q now respects the interval specified and always prints something at the given interval.

  3. I saw the code about h->discard_next_recv_i. However I can't find a difference:

    # ./src/fping -l -Q 1 -p 100 10.11.13.81
    [22:06:35]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 65.3/82.7/105
    [22:06:36]
    10.11.13.81 : xmt/rcv/%loss = 10/9/10%, min/avg/max = 70.5/99.5/137
    [22:06:37]
    10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 63.3/93.3/146
    [22:06:38]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 66.0/76.6/87.5
    [22:06:39]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 68.6/79.5/97.2
    [22:06:40]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 72.4/79.9/91.5
    [22:06:41]
    10.11.13.81 : xmt/rcv/%loss = 10/6/40%, min/avg/max = 86.8/276/484
    [22:06:42]
    10.11.13.81 : xmt/rcv/%return = 10/13/130%, min/avg/max = 153/226/461
    [22:06:43]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 161/170/181
    [22:06:44]
    10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 58.0/101/169
    [22:06:45]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 59.0/70.9/82.7
    [22:06:46]
    10.11.13.81 : xmt/rcv/%loss = 9/9/0%, min/avg/max = 67.2/76.2/104
    [22:06:47]
    10.11.13.81 : xmt/rcv/%loss = 9/8/11%, min/avg/max = 82.4/111/166
    [22:06:48]
    10.11.13.81 : xmt/rcv/%return = 10/12/120%, min/avg/max = 171/395/769
    [22:06:49]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 168/176/181
    [22:06:50]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 167/180/193
    [22:06:51]
    10.11.13.81 : xmt/rcv/%return = 10/11/110%, min/avg/max = 57.4/116/211
    [22:06:52]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 60.5/71.7/93.4
    [22:06:53]
    10.11.13.81 : xmt/rcv/%loss = 10/10/0%, min/avg/max = 64.6/73.6/102
    [22:06:54]

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

Regarding point 3, I meant I can't find a difference.

In general, discarding a response before its timeout, is probably a bad idea: ea37408#diff-2ff0a524bb528090fe44f3dd1af5a11cR1405

What could be better, is to discard a response immediately upon reception and before counting it anywhere, only after the timeout specified with -t has elapsed.

This means that if -t 500 is given, timings above 500ms should never be reported. Instead packets arrived after 500ms should be just ignored (if they are ignored, packet loss will be reported, or should have been reported depending on the timing the request was sent).

Of course, this means that -t 500 is probably a bad default. It should be 5000.

ktsaou added a commit to ktsaou/netdata that referenced this pull request Nov 1, 2016
@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

Consider this:

packet loss should exclusively be reported on timeout. I mean, a packet is sent at time t1 which is expected to timeout after 500ms at t1 + 500ms. A packet loss should be reported when that time expires.

So, counting packets sent and packets received to detect a packet loss is not good enough for -Q. fping should keep a simple fifo per host with the expiration time of each packet. If a packet has not been arrived by the time given by the first item in the fifo, a lost packet should be accounted and if that packet arrives after its timeout, it should be discarded and not taken into account at all.

What do you think?

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 1, 2016

Keep in mind that fping now consumes 100% cpu.
I submitted a fix at PR #106.

@schweikert
Copy link
Owner

I think that -Q was more of a late addition and not the initial concept of how fping was meant to be used. That's why the data structures are the way they are. I think that we can make it work well enough though. Something to consider is that you should use a -p value that is bigger than the timeout, so -p 100 -Q 1 doesn't work well, because it uses an implicit -t 500 (timeout). Can you try again with -p 500 -Q 5, for example? I think that it should provide better data.

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 2, 2016

hm... I use it with -l -Q 1 -p 200 -t 5000. For example, I trying the resolve an issue on azure where I have sub-second freezes on communication between VMs. That is, communication is frozen for 100ms or 200ms, so very frequent checks and reporting is needed. For sure -Q 5 is way too long...

I'll give a try to the settings you suggest though.

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 2, 2016

I think the patch that discards one packet, introduced another problem. Check this:

http://registry.my-netdata.io/#menu_fping

image

When a packet is discarded, the average reported is totally wrong.

This is probably a better example:

image

As you can see, the average crosses max. Shown as both above and below max...

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 2, 2016

This is even better:

image

When the received count is one less the sent one, the average is wrong...

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 2, 2016

The problem is that because the received count is one less (artificially discarded), this line:

fping/src/fping.c

Line 1360 in 2cb0860

avg = h->total_time_i / h->num_recv_i;

does the wrong calculation: h->total_time_i includes the discarded packet, but h->num_recv_i does not.

@schweikert
Copy link
Owner

hmm, you are right. it probably can be fixed by not increasing total_time_i in the case the response is discarded

@schweikert
Copy link
Owner

I have committed a fix. can you test again?

@ktsaou
Copy link
Contributor Author

ktsaou commented Nov 3, 2016

installed it at the same url:

http://registry.my-netdata.io/#menu_fping

it seems fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants