Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple "Error 13" message improvement #666

Merged
merged 78 commits into from
Apr 1, 2021
Merged

Conversation

eric846
Copy link
Contributor

@eric846 eric846 commented Mar 29, 2021

When there is a URI exception or failing failure predicate, log troubleshooting tips and return the tips in the Nighthawk Service error description.

Add a default failure predicate for benchmark.stream_resets > 0, since stream resets usually indicate a misconfigured protocol.

NOTE: If you expect stream resets in a benchmark session, run with --failure-predicate benchmark.stream_resets:1000000000 to avoid early termination.

Related to #181 and #613.

eric846 added 30 commits June 1, 2020 17:23
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…double

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…variable_setter

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…ent.Output turns out not to include the status

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…plugin names, log thresholds only once per session

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…lure predicate for stream_resets

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
@eric846 eric846 added the waiting-for-review A PR waiting for a review. label Mar 29, 2021
oschaaf
oschaaf previously approved these changes Mar 29, 2021
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM

@dubious90
Copy link
Contributor

@wjuan-AFK Please review and assign back to me once done

@dubious90 dubious90 requested a review from wjuan-AFK March 29, 2021 16:00
wjuan-AFK
wjuan-AFK previously approved these changes Mar 30, 2021
Copy link
Contributor

@wjuan-AFK wjuan-AFK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (counters.find("sequencer.failed_terminations") == counters.end()) {
return true;
} else {
ENVOY_LOG(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) How do you feel about splitting this into two log statements - one error log "Terminated early because of a failure predicate" and a second log (info/debug/verbose/something) with all of these extra details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

response.mutable_error_detail()->set_message(
"Unknown failure. See Nighthawk Service logs. Make sure the URI is well formed and the DNS "
"name resolves (if applicable). Check the output for problematic counter values. The "
"default Nighthawk failure predicates report failure if (1) Nighthawk could not connect to "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there was a lot of churn in this investigation, so please feel free to push back.

Were we unable to identify where the predicates are resulting in failures? This feels like it'd be much better if we could instead log at the point of a predicate failing what predicate was failing and what its value was set to. Is that not possible?

Copy link
Contributor Author

@eric846 eric846 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Termination predicates are based on counters which don't seem to store their own names. See termination_predicate_impl.h. We could log the first time each counter was incremented in benchmark_client_impl.cc, without knowing whether it will trigger a failure predicate. Logging in either place could impact performance unless we add logic to bypass logging on subsequent times. The user would have to check the log for the detailed info since it would take more code to propagate the information to the point where we return from the service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ENVOY_LOG_ONCE these days... but thinking about this, maybe if we'd allow wiring in some sort of callback when constructing predicates, which would fire when one indicates termination, that could be helpful here. There is one gotcha to that, which is that one part of the chain might conditionally override the return value of another. This is hypothetical with the current predicate implementations, but conceptually it doesn't feel right to me. So thinking out loud some more, maybe this callback should be specific to a new top level predicate implementation, which never indicates termination all by itself - it relies on the next links in the chain to figure that out -- but is just there to serialize what happened in the chain when termination commenced, for the purpose of communicating that upwards. It think it would be useful to be able to communicate this in a structured way, especially in the gRPC flows.
TLDR; the current state seems like a good enhancement to me, perhaps punt improving predicates to do a better job at facilitating UX to an issue and discuss further out of the context of this issue if need be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that we can go forward without fixing this in this PR. Creating an issue for improving termination predicates (with logging or something) sounds like a good idea to me.

"Unknown failure. See Nighthawk Service logs. Make sure the URI is well formed and the DNS "
"name resolves (if applicable). Check the output for problematic counter values. The "
"default Nighthawk failure predicates report failure if (1) Nighthawk could not connect to "
"the target (see 'benchmark.pool_connection_failure' counter; check the address and port "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether or not we make the above change, can we create a TODO/issue a bug for moving this to a documentation location for common trouble areas (similar to our internal playbooks)? This whole log feels like it really belongs in a markdown document rather than in a log statement.

(Afterwards, we can instead recommend that people reference our troubleshooting document, instead of including everything in this log statement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #668.

eric846 added 3 commits March 30, 2021 18:06
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
@eric846 eric846 dismissed stale reviews from wjuan-AFK and oschaaf via 5520728 April 1, 2021 15:13
@dubious90 dubious90 merged commit 8c41c54 into envoyproxy:main Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-review A PR waiting for a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants