Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health check: structured active healthcheck logging #3176

Merged
merged 24 commits into from
Jul 5, 2018
Merged

health check: structured active healthcheck logging #3176

merged 24 commits into from
Jul 5, 2018

Conversation

jsedgwick
Copy link

Description:
As discussed in #2028

The idea is to transition to proto logging when real sinks are available, but right now I stop short and dump the proto to a file in JSON form. Compare with the manual JSON of outlier detection logging, which could now be migrated to this approach.

The only TODO is docs. Where should this be documented considering there are proto changes as well?

Risk Level: Low / Medium

Testing:
Added new tests and updated existing

Docs Changes:
TBD pending discussion of approach. Where should I document this logging?

Release Notes:
Will add w/ docs.

Fixes #2028

James Sedgwick added 5 commits April 23, 2018 12:06
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@mattklein123
Copy link
Member

@junr03 please take first pass.

// [#protodoc-title: Health check logging]
// [#proto-status: draft]

message ActiveHealthCheckEvent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #2028 you mentioned shared pieces of this proto for logging active health checking events and a future outlier detection proto-ized version of the logs. However, the proto you have laid out is specific to active hc, and is located next to the hc proto instead of inside /cluster. Just want to hear your thoughts on that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't know when/if we'd get around to converting the outlier logging to proto, and I liked the idea of colocating the associated component's proto w/ the logging proto, so I just kept it simple for now. I left the proto as draft status so we can always refactor in the future.


/**
* Log an unhealthy host ejection event.
* @param host supplies the host that generated the event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for additional params

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/**
* Log a healthy host addition event.
* @param host supplies the host that generated the event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for additional params

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

HealthCheckEventLoggerSharedPtr event_logger;
if (!hc_config.event_log_path().empty()) {
event_logger =
std::make_shared<HealthCheckEventLoggerImpl>(log_manager, hc_config.event_log_path());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about the event logger's shared_ptr:

  1. It seems that each health checker is the sole owner of the event logger, as each health checker construction is preceded by a event logger construction. Is the reason to make is a shared ptr because below we pass the event logger to a factory context and that context can share the event logger to more owners? Ah I see, the ownership is shared between the health checker and all the active sessions. Why wouldn't the healthchecker own the event logger and give a reference to the active sessions? Or have the active session write to the parent_'s event_logger_?

  2. Why do we return the event logger's shared ptr from the context, and pass it into constructors by reference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, I can't remember, I kind of just fixed a couple signatures and then propagated necessary changes until everything compiled. But you're right that it all feels weird. I'll refactor so it makes more sense. I think I only had shared_ptrs in the first place because that's how the outlier loggers were structured..

message ActiveHealthCheckEvent {
string host_address = 1 [(validate.rules).string.min_bytes = 1];
string cluster_name = 2 [(validate.rules).string.min_bytes = 1];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log the type of health checking that caused the event?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding. Can we also thread the type of healthchecker (grpc, http, redis, etc.) that is causing the ejection into the log?

James Sedgwick added 3 commits April 26, 2018 11:33
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

@junr03 updated PTAL

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

merged master, @junr03 PTAL before the merge conflicts pile up?

Copy link
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last comment from me. Sorry for the lag here.

}
}

enum HealthCheckFailureType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this. I was also wondering about adding the type of health check that failed (redis, http, tcp).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Right. Sure, will do.

James Sedgwick added 3 commits May 8, 2018 18:33
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

PTAL @junr03

@jsedgwick
Copy link
Author

PTAL @danielhochman

Copy link
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed all my comments. Thanks

@danielhochman
Copy link
Contributor

needs docs. is there still some question about where to put them? i would think a section in the arch overview health checking docs (similar to outlier detection logging). and a reference in common_messages.rst

i would also add a TODO or open an issue to convert outlier detection to use the same logging mechanism.

James Sedgwick added 4 commits May 11, 2018 08:29
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

PTAL @junr03 @danielhochman

@jsedgwick
Copy link
Author

also opened #3405

@mattklein123 mattklein123 self-assigned this May 16, 2018
Copy link
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small nit in the docs, otherwise @mattklein123 please take a look

--------------------------

A per-healthchecker log of ejection and addition events can optionally be produced by Envoy by
specifying a log file path in `the HealthCheckConfig <envoy_api_field_core.HealthCheck.event_log_path>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use snake_case for field names

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff. Code structure LGTM. I have some API comments/questions but hopefully should be pretty quick to change.


// [#protodoc-title: Health check logging]
// :ref:`Health check logging <arch_overview_health_check_logging>`.
// [#proto-status: draft]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of draft status, let's just move this into the v2alpha namespace to clear indicate that the log structure might change in the future. Up to you if you want to do envoy.api.v2alpha.core or some other location that makes more sense.

}

message HealthCheckEjectUnhealthy {
// The type of failure that caused this ejection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: end sentences with full stops (same in other places).

HealthCheckFailureType failure_type = 1;

// The timeout after which health checks fail for hosts in this cluster
google.protobuf.Duration timeout = 2 [(validate.rules).duration.required = true];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little strange to include timeout when we also log Active/Passive failures. Additionally, we aren't logging all the other types of timeouts that have been added. Unless we are going to log times specific to this event, I would probably drop timeout/unhealthy_threshold from this log since it can be intuited from the config.

// The number of healthy health checks required before a host is marked
// healthy. Note that during startup, only a single successful health check is
// required to mark a host healthy.
google.protobuf.UInt32Value healthy_threshold = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for similar reasons as above I would drop this.


// Whether this addition is the result of the first ever health check on a host, in which case
// the above healthy_threshold is bypassed and the host is immediately added.
bool first_check = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me if you want to keep this since it is actually situationally dependent, but I could go either way.

@@ -87,7 +87,8 @@ void MessageUtil::loadFromFile(const std::string& path, Protobuf::Message& messa
}

std::string MessageUtil::getJsonStringFromMessage(const Protobuf::Message& message,
const bool pretty_print) {
const bool pretty_print,
const bool always_print_primitive_fields) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humorously, I was just looking at adding this in a different change. Cool!

@@ -213,7 +213,8 @@ class MessageUtil {
* @return std::string of formatted JSON object.
*/
static std::string getJsonStringFromMessage(const Protobuf::Message& message,
bool pretty_print = false);
bool pretty_print = false,
bool always_print_primitive_fields = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update doc comment

@mattklein123
Copy link
Member

Oh, please add a release note also. Thank you!

@mattklein123
Copy link
Member

@jsedgwick see #3494. Let's put the health check access log proto somewhere in the "output" folder (or whatever we end up calling it when that PR is merged). Thank you! cc @htuch

James Sedgwick added 2 commits May 31, 2018 10:51
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

PTAL @mattklein123 I think I addressed all your comments + moved format proto to data/core/v2alpha

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM, I have a few nits on the proto, and a question on proto structure for @htuch and @mrice32, though since this is v2alpha we can always change it later so IMO it's not a big deal either way right now.

@@ -176,6 +176,9 @@ message HealthCheck {
//
// The default value for "healthy edge interval" is the same as the default interval.
google.protobuf.Duration healthy_edge_interval = 16;

// Specifies the path to the health check event log.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cross link to relevant docs here about the event log, and also specify that if empty, no event log will be written?

// :ref:`Health check logging <arch_overview_health_check_logging>`.

message HealthCheckEvent {
HealthCheckerType health_checker_type = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add enum validation here


message HealthCheckEjectUnhealthy {
// The type of failure that caused this ejection.
HealthCheckFailureType failure_type = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum validation


message HealthCheckEvent {
HealthCheckerType health_checker_type = 1;
string host_address = 2 [(validate.rules).string.min_bytes = 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch @mrice32 I haven't been fully tracking the convo in #3478. What are we thinking here for this type of data? Should we be using the full address type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, full address preferable. The only situation it might not be the preferred choice is if we don't have the possibility of there being a port encoded (i.e. not :80), it's definitely TCP and it can't be a pipe..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing from #3478... what happens when we have hosts that appear at multiple levels and priorities? Do we have unique HC events, do we have locality/priority information associated with them?


option (gogoproto.equal_all) = true;

// [#protodoc-title: Health check logging events]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is logging, doesn't it belong in envoy.data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in envoy.data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was focusing on the core :)

enum HealthCheckFailureType {
ACTIVE = 0;
PASSIVE = 1;
NETWORK = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also consider the discussion in #3478 (comment)

@stale
Copy link

stale bot commented Jun 18, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 18, 2018
@stale
Copy link

stale bot commented Jun 25, 2018

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Jun 25, 2018
James Sedgwick added 3 commits June 29, 2018 11:35
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

jsedgwick commented Jun 30, 2018

Sorry folks, this flew off my radar - I updated per above comments with the exception of
#3176 (comment)
since I couldn't tell if there was a conclusion and whether the enum I introduced is inappropriate. As Matt pointed out its v2alpha so probably doesn't have to settles now.

PTAL @htuch @mattklein123 @danielhochman. Thanks!

Edit: Not sure how to reopen...

@junr03 junr03 reopened this Jun 30, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 30, 2018
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Excited to get this in. Just a few small nits and we can ship.

@@ -185,6 +185,10 @@ message HealthCheck {
//
// The default value for "healthy edge interval" is the same as the default interval.
google.protobuf.Duration healthy_edge_interval = 16;

// Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/healthy/health

@@ -185,6 +185,10 @@ message HealthCheck {
//
// The default value for "healthy edge interval" is the same as the default interval.
google.protobuf.Duration healthy_edge_interval = 16;

// Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`.
// health check event log. If empty, no event log will be written.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicate "health check event log."

HealthCheckEventLoggerImpl(AccessLog::AccessLogManager& log_manager, const std::string& file_name)
: file_(log_manager.createAccessLog(file_name)) {}

virtual ~HealthCheckEventLoggerImpl() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete

James Sedgwick added 2 commits July 5, 2018 10:36
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

(still working on the DCO issue)

@jsedgwick
Copy link
Author

oops, will fix docs

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit a5da078 into envoyproxy:master Jul 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants