Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial stab at a new aggregate output check. #52

Closed

Conversation

icebourg
Copy link

Pull Request Checklist

Is this in reference to an existing issue?
This is related to issue #19, check-aggregate.rb does not work with newer Sensu servers to show collected output. That script looks like it has grown quite a lot over time and I don't use most of those features, so I wanted to get something out that supports output and some minimum number of default options. To that end, I've created this as a new check so that it won't break anybody that is currently relying on the features in the current check.

General

  • Update Changelog following the conventions laid out on Our CHANGELOG Guidelines

  • Update README with any necessary configuration snippets

  • [N/A] Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

New Plugins

  • Tests

  • Add the plugin to the README

  • Does it have a complete header as outlined here

Purpose

Very similar to check-aggregate.rb but it actually works with newer versions of the Sensu API to support output. It does not support all of the same features of check-aggregate.rb mostly because I don't use all of those features and some of the check-aggregate.rb code depends on API functionality that no longer exists. (such as the summarize feature) I'm not opposed to adding those features to this check, and have tried to leave room to enable that to be simple, but I have not taken that on in this PR.

Known Compatibility Issues

Requires Sensu 0.24 or greater

@icebourg
Copy link
Author

icebourg commented Jan 17, 2018

Here's an example of it running against my staging sensu:

./check-aggregate-output.rb -a http://staging-sensu:4567 -c web_health --critical_count 3 -A 180 --message "Web Server Health Check"
CheckAggregateOutput OK: Web Server Health Check

@majormoses
Copy link
Member

Thanks for talking through some of the stuff in slack and submitting this. I will take a closer look shortly.

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good, I do have some suggestions some in the form of preventing bugs others are things that we can do to enhance the check.

description: 'Collects all non-ok outputs',
default: true

option :warning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I realize this is following the existing pattern it might be a good time to make some would be breaking changes and make this a bit better. Rather than having four options I believe we can reduce this by having --warning, --critical, and --alert-on which we can default to whatever the community thinks makes the most sense. This reduces the required knobs to get the check working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through this again and maybe it does make sense as is so that you can monitor both. For example you might want to alert on one handler definition for SLA threshold but another for do early detection. This really is expanding on the differences between severity and urgency. I do think that it would still be better served as 2 seperate check definitions as it provides more control and clarity to operators.

short: '-C PERCENT',
long: '--critical PERCENT',
description: 'PERCENT critical before critical (can be change with --ignore-severity)',
proc: proc(&:to_i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment regarding floats vs integers.

short: '-W PERCENT',
long: '--warning PERCENT',
description: 'PERCENT warning before warning (can be change with --ignore-severity)',
proc: proc(&:to_i)
Copy link
Member

@majormoses majormoses Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be floats rather than integers so that you can report and alert on SLA figures which are usually represented in the number of 9's. 99% of 1,000,000 events is not the same as 99.99% of the same number of events.

description: 'PERCENT warning before warning (can be change with --ignore-severity)',
proc: proc(&:to_i)

option :warning_count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I realize this is following the existing pattern it might be a good time to make some would be breaking changes and make this a bit better. Rather than having four options I believe we can reduce this by having --warning, --critical, and something like --alert-on to specify count or percentages. I would suggest default of percentages as we can make better sane defaults for small and larger setups. This reduces the required knobs to get the check working.

description: 'PERCENT critical before critical (can be change with --ignore-severity)',
proc: proc(&:to_i)

option :critical_count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about combining options.

verify_ssl: verify_mode)
JSON.parse(request.get)
rescue Errno::ECONNREFUSED
warning 'Connection refused'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to change anything here but in general it is better to return a status and a message from a function rather than executing immediately as it makes it harder to write tests for small units of code.

end

def output_results
severities = %w(critical warning unknown)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make more sense to leverage something like this for future proofing new severities, unlike the previous one we would want to reject ok. While this is unlikely to occur in the short term as sensu has adhered to the check spec of nagios. With sensu 2.x on the horizon I believe they will likely enhance upon the nagios check spec at some point in the future.


def aggregate_results
results = api_request("/aggregates/#{config[:check]}?max_age=#{config[:age]}")['results']
warning "No aggregates found in last #{config[:age]} seconds" if %w(ok warning critical unknown).all? { |x| results[x].zero? }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make more sense to leverage something like this for future proofing new severities. While this is unlikely to occur in the short term as sensu has adhered to the check spec of nagios with sensu 2.x on the horizon it's possible they might want to enhance on top of the nagios check spec.

@majormoses
Copy link
Member

closing due to lack of response, to get this through you can either re-open the PR or open a new one.

@majormoses majormoses closed this Sep 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants