Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low/No Disk space caused the Agent to crash and not recover. #2223

Merged
merged 2 commits into from
Jan 28, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 116 additions & 99 deletions checks.d/network.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
# project
from checks import AgentCheck
from utils.platform import Platform
from utils.subprocess_output import get_subprocess_output
from utils.subprocess_output import (
get_subprocess_output,
SubprocessOutputEmptyError,
)

BSD_TCP_METRICS = [
(re.compile("^\s*(\d+) data packets \(\d+ bytes\) retransmitted\s*$"), 'system.net.tcp.retrans_packs'),
Expand Down Expand Up @@ -176,6 +179,8 @@ def _check_linux(self, instance):
metrics = self._parse_linux_cx_state(lines[2:], self.TCP_STATES['netstat'], 5)
for metric, value in metrics.iteritems():
self.gauge(metric, value)
except SubprocessOutputEmptyError:
self.log.exception("Error collecting connection stats.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log the exception here as well, so we don't silence other important Exceptions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log or raise? Cause we're logging it with log.exception().... If we raise, then we'd be unable to collect stats such as the network stats below which would be collectible due to procfs being a pseudo FS and not require calling any subcommand.

Not sure what you mean here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps, sorry, I meant to do except Exception as e: and then log the Exception message e. If you end up doing the custom Exception class below, it'll look different obviously

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.exception() already adds the exception info - type, stacktrace to the logging statement.

I have added the custom exception - for future use, but I don't think it's necessary to go over all try...catch blocks that already handle Exception and add yet another block that basically will do nothing other than log the exception. Sounds like overkill. But that's just me, coming from a C/C++ background.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's not necessary, I just want to make sure we aren't silencing errors we don't expect. Best practice is to attempt to keep from running except Exception: as much as possible, for that reason.

Could we do a catch SubprocessOutputEmptyError: here instead?


proc = open('/proc/net/dev', 'r')
try:
Expand Down Expand Up @@ -289,111 +294,123 @@ def _check_bsd(self, instance):
if Platform.is_freebsd():
netstat_flags.append('-W')

output, _, _ = get_subprocess_output(["netstat"] + netstat_flags, self.log)
lines = output.splitlines()
# Name Mtu Network Address Ipkts Ierrs Ibytes Opkts Oerrs Obytes Coll
# lo0 16384 <Link#1> 318258 0 428252203 318258 0 428252203 0
# lo0 16384 localhost fe80:1::1 318258 - 428252203 318258 - 428252203 -
# lo0 16384 127 localhost 318258 - 428252203 318258 - 428252203 -
# lo0 16384 localhost ::1 318258 - 428252203 318258 - 428252203 -
# gif0* 1280 <Link#2> 0 0 0 0 0 0 0
# stf0* 1280 <Link#3> 0 0 0 0 0 0 0
# en0 1500 <Link#4> 04:0c:ce:db:4e:fa 20801309 0 13835457425 15149389 0 11508790198 0
# en0 1500 seneca.loca fe80:4::60c:ceff: 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 192.168.1 192.168.1.63 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# p2p0 2304 <Link#5> 06:0c:ce:db:4e:fa 0 0 0 0 0 0 0
# ham0 1404 <Link#6> 7a:79:05:4d:bf:f5 30100 0 6815204 18742 0 8494811 0
# ham0 1404 5 5.77.191.245 30100 - 6815204 18742 - 8494811 -
# ham0 1404 seneca.loca fe80:6::7879:5ff: 30100 - 6815204 18742 - 8494811 -
# ham0 1404 2620:9b::54 2620:9b::54d:bff5 30100 - 6815204 18742 - 8494811 -

headers = lines[0].split()

# Given the irregular structure of the table above, better to parse from the end of each line
# Verify headers first
# -7 -6 -5 -4 -3 -2 -1
for h in ("Ipkts", "Ierrs", "Ibytes", "Opkts", "Oerrs", "Obytes", "Coll"):
if h not in headers:
self.logger.error("%s not found in %s; cannot parse" % (h, headers))
return False

current = None
for l in lines[1:]:
# Another header row, abort now, this is IPv6 land
if "Name" in l:
break

x = l.split()
if len(x) == 0:
break

iface = x[0]
if iface.endswith("*"):
iface = iface[:-1]
if iface == current:
# skip multiple lines of same interface
continue
else:
current = iface

# Filter inactive interfaces
if self._parse_value(x[-5]) or self._parse_value(x[-2]):
iface = current
metrics = {
'bytes_rcvd': self._parse_value(x[-5]),
'bytes_sent': self._parse_value(x[-2]),
'packets_in.count': self._parse_value(x[-7]),
'packets_in.error': self._parse_value(x[-6]),
'packets_out.count': self._parse_value(x[-4]),
'packets_out.error':self._parse_value(x[-3]),
}
self._submit_devicemetrics(iface, metrics)
try:
output, _, _ = get_subprocess_output(["netstat"] + netstat_flags, self.log)
lines = output.splitlines()
# Name Mtu Network Address Ipkts Ierrs Ibytes Opkts Oerrs Obytes Coll
# lo0 16384 <Link#1> 318258 0 428252203 318258 0 428252203 0
# lo0 16384 localhost fe80:1::1 318258 - 428252203 318258 - 428252203 -
# lo0 16384 127 localhost 318258 - 428252203 318258 - 428252203 -
# lo0 16384 localhost ::1 318258 - 428252203 318258 - 428252203 -
# gif0* 1280 <Link#2> 0 0 0 0 0 0 0
# stf0* 1280 <Link#3> 0 0 0 0 0 0 0
# en0 1500 <Link#4> 04:0c:ce:db:4e:fa 20801309 0 13835457425 15149389 0 11508790198 0
# en0 1500 seneca.loca fe80:4::60c:ceff: 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 192.168.1 192.168.1.63 20801309 - 13835457425 15149389 - 11508790198 -
# en0 1500 2001:470:1f 2001:470:1f07:11d 20801309 - 13835457425 15149389 - 11508790198 -
# p2p0 2304 <Link#5> 06:0c:ce:db:4e:fa 0 0 0 0 0 0 0
# ham0 1404 <Link#6> 7a:79:05:4d:bf:f5 30100 0 6815204 18742 0 8494811 0
# ham0 1404 5 5.77.191.245 30100 - 6815204 18742 - 8494811 -
# ham0 1404 seneca.loca fe80:6::7879:5ff: 30100 - 6815204 18742 - 8494811 -
# ham0 1404 2620:9b::54 2620:9b::54d:bff5 30100 - 6815204 18742 - 8494811 -

headers = lines[0].split()

# Given the irregular structure of the table above, better to parse from the end of each line
# Verify headers first
# -7 -6 -5 -4 -3 -2 -1
for h in ("Ipkts", "Ierrs", "Ibytes", "Opkts", "Oerrs", "Obytes", "Coll"):
if h not in headers:
self.logger.error("%s not found in %s; cannot parse" % (h, headers))
return False

current = None
for l in lines[1:]:
# Another header row, abort now, this is IPv6 land
if "Name" in l:
break

x = l.split()
if len(x) == 0:
break

iface = x[0]
if iface.endswith("*"):
iface = iface[:-1]
if iface == current:
# skip multiple lines of same interface
continue
else:
current = iface

# Filter inactive interfaces
if self._parse_value(x[-5]) or self._parse_value(x[-2]):
iface = current
metrics = {
'bytes_rcvd': self._parse_value(x[-5]),
'bytes_sent': self._parse_value(x[-2]),
'packets_in.count': self._parse_value(x[-7]),
'packets_in.error': self._parse_value(x[-6]),
'packets_out.count': self._parse_value(x[-4]),
'packets_out.error':self._parse_value(x[-3]),
}
self._submit_devicemetrics(iface, metrics)
except SubprocessOutputEmptyError:
self.log.exception("Error collecting connection stats.")


netstat, _, _ = get_subprocess_output(["netstat", "-s", "-p" "tcp"], self.log)
#3651535 packets sent
# 972097 data packets (615753248 bytes)
# 5009 data packets (2832232 bytes) retransmitted
# 0 resends initiated by MTU discovery
# 2086952 ack-only packets (471 delayed)
# 0 URG only packets
# 0 window probe packets
# 310851 window update packets
# 336829 control packets
# 0 data packets sent after flow control
# 3058232 checksummed in software
# 3058232 segments (571218834 bytes) over IPv4
# 0 segments (0 bytes) over IPv6
#4807551 packets received
# 1143534 acks (for 616095538 bytes)
# 165400 duplicate acks
# ...

self._submit_regexed_values(netstat, BSD_TCP_METRICS)
try:
netstat, _, _ = get_subprocess_output(["netstat", "-s", "-p" "tcp"], self.log)
#3651535 packets sent
# 972097 data packets (615753248 bytes)
# 5009 data packets (2832232 bytes) retransmitted
# 0 resends initiated by MTU discovery
# 2086952 ack-only packets (471 delayed)
# 0 URG only packets
# 0 window probe packets
# 310851 window update packets
# 336829 control packets
# 0 data packets sent after flow control
# 3058232 checksummed in software
# 3058232 segments (571218834 bytes) over IPv4
# 0 segments (0 bytes) over IPv6
#4807551 packets received
# 1143534 acks (for 616095538 bytes)
# 165400 duplicate acks
# ...

self._submit_regexed_values(netstat, BSD_TCP_METRICS)
except SubprocessOutputEmptyError:
self.log.exception("Error collecting TCP stats.")


def _check_solaris(self, instance):
# Can't get bytes sent and received via netstat
# Default to kstat -p link:0:
netstat, _, _ = get_subprocess_output(["kstat", "-p", "link:0:"], self.log)
metrics_by_interface = self._parse_solaris_netstat(netstat)
for interface, metrics in metrics_by_interface.iteritems():
self._submit_devicemetrics(interface, metrics)

netstat, _, _ = get_subprocess_output(["netstat", "-s", "-P" "tcp"], self.log)
# TCP: tcpRtoAlgorithm= 4 tcpRtoMin = 200
# tcpRtoMax = 60000 tcpMaxConn = -1
# tcpActiveOpens = 57 tcpPassiveOpens = 50
# tcpAttemptFails = 1 tcpEstabResets = 0
# tcpCurrEstab = 0 tcpOutSegs = 254
# tcpOutDataSegs = 995 tcpOutDataBytes =1216733
# tcpRetransSegs = 0 tcpRetransBytes = 0
# tcpOutAck = 185 tcpOutAckDelayed = 4
# ...
self._submit_regexed_values(netstat, SOLARIS_TCP_METRICS)
try:
netstat, _, _ = get_subprocess_output(["kstat", "-p", "link:0:"], self.log)
metrics_by_interface = self._parse_solaris_netstat(netstat)
for interface, metrics in metrics_by_interface.iteritems():
self._submit_devicemetrics(interface, metrics)
except SubprocessOutputEmptyError:
self.log.exception("Error collecting kstat stats.")

try:
netstat, _, _ = get_subprocess_output(["netstat", "-s", "-P" "tcp"], self.log)
# TCP: tcpRtoAlgorithm= 4 tcpRtoMin = 200
# tcpRtoMax = 60000 tcpMaxConn = -1
# tcpActiveOpens = 57 tcpPassiveOpens = 50
# tcpAttemptFails = 1 tcpEstabResets = 0
# tcpCurrEstab = 0 tcpOutSegs = 254
# tcpOutDataSegs = 995 tcpOutDataBytes =1216733
# tcpRetransSegs = 0 tcpRetransBytes = 0
# tcpOutAck = 185 tcpOutAckDelayed = 4
# ...
self._submit_regexed_values(netstat, SOLARIS_TCP_METRICS)
except SubprocessOutputEmptyError:
self.log.exception("Error collecting TCP stats.")


def _parse_solaris_netstat(self, netstat_output):
Expand Down
27 changes: 15 additions & 12 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@
from util import get_os, yLoader
from utils.platform import Platform
from utils.proxy import get_proxy
from utils.subprocess_output import get_subprocess_output
from utils.subprocess_output import (
get_subprocess_output,
SubprocessOutputEmptyError,
)


# CONSTANTS
AGENT_VERSION = "5.7.0"
Expand Down Expand Up @@ -570,17 +574,16 @@ def get_system_stats():

platf = sys.platform

if Platform.is_linux(platf):
output, _, _ = get_subprocess_output(['grep', 'model name', '/proc/cpuinfo'], log)
systemStats['cpuCores'] = len(output.splitlines())

if Platform.is_darwin(platf):
output, _, _ = get_subprocess_output(['sysctl', 'hw.ncpu'], log)
systemStats['cpuCores'] = int(output.split(': ')[1])

if Platform.is_freebsd(platf):
output, _, _ = get_subprocess_output(['sysctl', 'hw.ncpu'], log)
systemStats['cpuCores'] = int(output.split(': ')[1])
try:
if Platform.is_linux(platf):
output, _, _ = get_subprocess_output(['grep', 'model name', '/proc/cpuinfo'], log)
systemStats['cpuCores'] = len(output.splitlines())

if Platform.is_darwin(platf) or Platform.is_freebsd(platf):
output, _, _ = get_subprocess_output(['sysctl', 'hw.ncpu'], log)
systemStats['cpuCores'] = int(output.split(': ')[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your code, but seems like this could be written as:

if Platform.is_linux(platf):
...
elif Platform.is_darwin(platf) or Platform.is_freebsd(platf):
...

except SubprocessOutputEmptyError as e:
log.warning("unable to retrieve number of cpuCores. Failed with error %s", e)

if Platform.is_linux(platf):
systemStats['nixV'] = platform.dist()
Expand Down
8 changes: 7 additions & 1 deletion utils/subprocess_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@

log = logging.getLogger(__name__)

class SubprocessOutputEmptyError(Exception):
pass

# FIXME: python 2.7 has a far better way to do this
def get_subprocess_output(command, log, shell=False, stdin=None):
def get_subprocess_output(command, log, shell=False, stdin=None, output_expected=True):
"""
Run the given subprocess command and return it's output. Raise an Exception
if an error occurs.
Expand All @@ -37,6 +39,10 @@ def get_subprocess_output(command, log, shell=False, stdin=None):

stdout_f.seek(0)
output = stdout_f.read()

if output_expected and output is None:
raise SubprocessOutputEmptyError("get_subprocess_output expected output but had none.")

return (output, err, proc.returncode)


Expand Down