Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for maximum size for Output of checks #5233

Merged
merged 14 commits into from
Jun 26, 2019

Conversation

pierresouchay
Copy link
Contributor

This PR will allow administrator to limit the size of Output produced by healthchecks,
when set at agent level, it will be the maximum output for all healthchecks produced
by agents.

It will also allow specific healthchecks to go below the limit set by administrator,
OutputMaxSize if set in check specification will now allow to go beyond agent level
setting, but will allow to reduce it further.

We had the following configuration at agent level

  • check_output_max_size (default: 4k) : max number for all checks of the agent
  • check.OutputMaxSize: (default: 4k)

@orarnon
Copy link

orarnon commented Jan 17, 2019

I can say that we have suffered from big health-checks. We had to use the "args" argument and wrap the health-check with a script. A more native way of reducing the output would have made a big difference while waiting for developers to change or add a param to their endpoint.

This PR will allow administrator to limit the size of Output produced by healthchecks,
when set at agent level, it will be the maximum output for all healthchecks produced
by agents.

It will also allow specific healthchecks to go below the limit set by administrator,
OutputMaxSize if set in check specification will now allow to go beyond agent level
setting, but will allow to reduce it further.

We had the following configuration  at agent level
 * check_output_max_size (default: 4k) : max number for all checks of the agent
 * check.OutputMaxSize: (default: 4k)
@banks banks requested a review from a team May 14, 2019 15:36
Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Pierre,

Thanks for the PR, I have some comments inline.

Additionally, the website's docs should also be updated to reflect these changes.

You can add the CLI and config docs to: website/source/docs/agent/options.html.md, and the check docs to: website/source/api/agent/check.html.md.

agent/checks/check_test.go Outdated Show resolved Hide resolved
agent/checks/check_test.go Outdated Show resolved Hide resolved
agent/agent.go Outdated Show resolved Hide resolved
agent/checks/check.go Outdated Show resolved Hide resolved
if maxOutputSize == 0 {
maxOutputSize = 4096
}
if chkType.OutputMaxSize > 0 && maxOutputSize > chkType.OutputMaxSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mention in your description, this will only override the agent check if the check's OutputMaxSize is less than the agent's.

Could you expand more on why we shouldn't allow the check's max to override the agent's max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddygv I wanted to be as conservative as possible (meaning, that for instance, if someone is storing those results in a DB and assumed size will never be more than 4k, we won't break this assumption), but allowing large infrastructure to reduce the size of their checks since it might be hugely impacting when checks are varying a lot (for instance, the normal result if checks if 'OK', while the broken checks returns large stack traces), ask @orarnon for instance who had this issue (we did had it as well on very large clusters in the past)

Increasing this value is trivial and could be done in another patch if someone is resquesting it, but if you really want, we can remove this limit in this PR as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay Thanks. Can you please expand on the need for the check-specific max output size? Would adding a max for the agent alone be sufficient?

The implementation would be simpler if we only set this max once. Additionally, there is a UX mismatch where someone can discard output at the agent level, but at the check level the max size can only be brought down to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the reason for that is serialization of payload: if the user specify 0, it means no value has been specified in payload (for instance for a user not setting OutputMaxSize in the check definition and thus preserving backward compatibility.
As you pointed out in a previous comment, if the user completely want to disable checks output, it can be done by disabling completely check output.

agent/config/builder.go Outdated Show resolved Hide resolved
* Renamed `checks.BufSize` to `checks.DefaultBufSize`
* Use the constant where appropriate
* Better error message when `check_output_max_size` has wrong value
@pierresouchay
Copy link
Contributor Author

@freddygv all DONE

@pierresouchay
Copy link
Contributor Author

@freddygv the license/cla does not pass while I agreed for a while (before this PR) the license. Is there is a way to pass this test (that used to pass) ? I have the same issue in #4548, #5455 and #4576

@freddygv
Copy link
Contributor

freddygv commented Jun 3, 2019

@pierresouchay should be fixed now, let us know if it happens again.

Will send back some more comments about this PR tomorrow.

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay Thanks for the updates. Just added some more comments inline.

Also, I am not familiar with the output of gRPC health checks, but I'm seeing that here the resulting err.Error() isn't being truncated based on the check output max size. It seems like there should also be a truncation here since the error output could be large.

website/source/docs/agent/options.html.md Outdated Show resolved Hide resolved
@@ -117,6 +117,14 @@ will exit with an error at startup.
[go-sockaddr](https://godoc.org/github.com/hashicorp/go-sockaddr/template)
template

* <a name="check_output_max_size"></a><a href="#check_output_max_size">`-check_output_max_size</a> -
Override the default limit of 4k for maximum size of checks, this is a value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description could be simplified further to something like:

The maximum output size for health checks. Limiting check output size reduces pressure on Consul servers when many checks are producing large outputs. Defaults to 4096, and must be at least 1. To disable check output capture use discard_check_output.

@@ -117,6 +117,14 @@ will exit with an error at startup.
[go-sockaddr](https://godoc.org/github.com/hashicorp/go-sockaddr/template)
template

* <a name="check_output_max_size"></a><a href="#check_output_max_size">`-check_output_max_size</a> -
Override the default limit of 4k for maximum size of checks, this is a value
between 1 and 4096, setting it to 0 is not allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing in the code a constraint that restricts max size to be <= 4096. This PR only checks the lower bound, which I think should be OK.

website/source/api/agent/check.html.md Outdated Show resolved Hide resolved
agent/config/flags.go Outdated Show resolved Hide resolved
@@ -2239,6 +2243,13 @@ func (a *Agent) addCheck(check *structs.HealthCheck, chkType *structs.CheckType,

// Check if already registered
if chkType != nil {
maxOutputSize := a.config.CheckOutputMaxSize
if maxOutputSize == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be unnecessary. How could a.config.CheckOutputMaxSize end up as 0? The config builder has a default value of 4096 and returns an error if the input is < 1.

if maxOutputSize == 0 {
maxOutputSize = 4096
}
if chkType.OutputMaxSize > 0 && maxOutputSize > chkType.OutputMaxSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay Thanks. Can you please expand on the need for the check-specific max output size? Would adding a max for the agent alone be sufficient?

The implementation would be simpler if we only set this max once. Additionally, there is a UX mismatch where someone can discard output at the agent level, but at the check level the max size can only be brought down to 1.

agent/agent_endpoint.go Outdated Show resolved Hide resolved
pierresouchay and others added 3 commits June 4, 2019 18:05
suggestion from @freddygv

Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
suggestion from @freddygv

Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
@hanshasselberg
Copy link
Member

@pierresouchay I just looked at your PR again and I think there are some comments left.

pierresouchay and others added 2 commits June 24, 2019 17:38
@freddygv
Copy link
Contributor

@pierresouchay Thanks for addressing those comments, this is almost ready. The last portion is to update a test around the TTL check update, so that CI can pass.

The truncation behavior was moved from AgentUpdateCheck to updateTTLCheck but the test for AgentUpdateCheck still verifies that behavior:
https://travis-ci.org/hashicorp/consul/jobs/549970434

So the sub-test TestAgent_UpdateCheck/log_output_limit can be removed and that behavior should be tested here instead.

Check that OutputMaxSize is properly taken into account.

BUGFIX: properly update local state on update of TTL
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jun 25, 2019

@freddygv I did find actually a bug doing the feature and realizing that the agent state was not updated (I found the bug while looking at weird results for TestAgent_UpdateCheck/log_output_limit ), thus I kept the existing tests and added one.

I actually wonder if it might linked to #4460

@freddygv freddygv merged commit 0e907f5 into hashicorp:master Jun 26, 2019
@freddygv
Copy link
Contributor

Looks good, thanks for working through those changes!

Regarding the issue you referenced about syncChanges, it does seem like they're right, but I'd have to dig deeper to make sure it can be removed from the API endpoints.

@pierresouchay
Copy link
Contributor Author

@freddygv thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants