-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health check: structured active healthcheck logging #3176
health check: structured active healthcheck logging #3176
Conversation
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@junr03 please take first pass. |
// [#protodoc-title: Health check logging] | ||
// [#proto-status: draft] | ||
|
||
message ActiveHealthCheckEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #2028 you mentioned shared pieces of this proto for logging active health checking events and a future outlier detection proto-ized version of the logs. However, the proto you have laid out is specific to active hc, and is located next to the hc proto instead of inside /cluster. Just want to hear your thoughts on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didn't know when/if we'd get around to converting the outlier logging to proto, and I liked the idea of colocating the associated component's proto w/ the logging proto, so I just kept it simple for now. I left the proto as draft status so we can always refactor in the future.
|
||
/** | ||
* Log an unhealthy host ejection event. | ||
* @param host supplies the host that generated the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs for additional params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/** | ||
* Log a healthy host addition event. | ||
* @param host supplies the host that generated the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs for additional params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
HealthCheckEventLoggerSharedPtr event_logger; | ||
if (!hc_config.event_log_path().empty()) { | ||
event_logger = | ||
std::make_shared<HealthCheckEventLoggerImpl>(log_manager, hc_config.event_log_path()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions about the event logger's shared_ptr:
-
It seems that each health checker is the sole owner of the event logger, as each health checker construction is preceded by a event logger construction. Is the reason to make is a shared ptr because below we pass the event logger to a factory context and that context can share the event logger to more owners?Ah I see, the ownership is shared between the health checker and all the active sessions. Why wouldn't the healthchecker own the event logger and give a reference to the active sessions? Or have the active session write to theparent_
'sevent_logger_
? -
Why do we return the event logger's shared ptr from the context, and pass it into constructors by reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, I can't remember, I kind of just fixed a couple signatures and then propagated necessary changes until everything compiled. But you're right that it all feels weird. I'll refactor so it makes more sense. I think I only had shared_ptrs in the first place because that's how the outlier loggers were structured..
message ActiveHealthCheckEvent { | ||
string host_address = 1 [(validate.rules).string.min_bytes = 1]; | ||
string cluster_name = 2 [(validate.rules).string.min_bytes = 1]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log the type of health checking that caused the event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding. Can we also thread the type of healthchecker (grpc, http, redis, etc.) that is causing the ejection into the log?
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@junr03 updated PTAL |
merged master, @junr03 PTAL before the merge conflicts pile up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last comment from me. Sorry for the lag here.
} | ||
} | ||
|
||
enum HealthCheckFailureType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this. I was also wondering about adding the type of health check that failed (redis, http, tcp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Right. Sure, will do.
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
PTAL @junr03 |
PTAL @danielhochman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed all my comments. Thanks
needs docs. is there still some question about where to put them? i would think a section in the arch overview health checking docs (similar to outlier detection logging). and a reference in common_messages.rst i would also add a TODO or open an issue to convert outlier detection to use the same logging mechanism. |
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
PTAL @junr03 @danielhochman |
also opened #3405 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one small nit in the docs, otherwise @mattklein123 please take a look
-------------------------- | ||
|
||
A per-healthchecker log of ejection and addition events can optionally be produced by Envoy by | ||
specifying a log file path in `the HealthCheckConfig <envoy_api_field_core.HealthCheck.event_log_path>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use snake_case for field names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff. Code structure LGTM. I have some API comments/questions but hopefully should be pretty quick to change.
|
||
// [#protodoc-title: Health check logging] | ||
// :ref:`Health check logging <arch_overview_health_check_logging>`. | ||
// [#proto-status: draft] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of draft status, let's just move this into the v2alpha namespace to clear indicate that the log structure might change in the future. Up to you if you want to do envoy.api.v2alpha.core
or some other location that makes more sense.
} | ||
|
||
message HealthCheckEjectUnhealthy { | ||
// The type of failure that caused this ejection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: end sentences with full stops (same in other places).
HealthCheckFailureType failure_type = 1; | ||
|
||
// The timeout after which health checks fail for hosts in this cluster | ||
google.protobuf.Duration timeout = 2 [(validate.rules).duration.required = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little strange to include timeout when we also log Active/Passive failures. Additionally, we aren't logging all the other types of timeouts that have been added. Unless we are going to log times specific to this event, I would probably drop timeout/unhealthy_threshold from this log since it can be intuited from the config.
// The number of healthy health checks required before a host is marked | ||
// healthy. Note that during startup, only a single successful health check is | ||
// required to mark a host healthy. | ||
google.protobuf.UInt32Value healthy_threshold = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for similar reasons as above I would drop this.
|
||
// Whether this addition is the result of the first ever health check on a host, in which case | ||
// the above healthy_threshold is bypassed and the host is immediately added. | ||
bool first_check = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me if you want to keep this since it is actually situationally dependent, but I could go either way.
@@ -87,7 +87,8 @@ void MessageUtil::loadFromFile(const std::string& path, Protobuf::Message& messa | |||
} | |||
|
|||
std::string MessageUtil::getJsonStringFromMessage(const Protobuf::Message& message, | |||
const bool pretty_print) { | |||
const bool pretty_print, | |||
const bool always_print_primitive_fields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humorously, I was just looking at adding this in a different change. Cool!
@@ -213,7 +213,8 @@ class MessageUtil { | |||
* @return std::string of formatted JSON object. | |||
*/ | |||
static std::string getJsonStringFromMessage(const Protobuf::Message& message, | |||
bool pretty_print = false); | |||
bool pretty_print = false, | |||
bool always_print_primitive_fields = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update doc comment
Oh, please add a release note also. Thank you! |
@jsedgwick see #3494. Let's put the health check access log proto somewhere in the "output" folder (or whatever we end up calling it when that PR is merged). Thank you! cc @htuch |
PTAL @mattklein123 I think I addressed all your comments + moved format proto to data/core/v2alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -176,6 +176,9 @@ message HealthCheck { | |||
// | |||
// The default value for "healthy edge interval" is the same as the default interval. | |||
google.protobuf.Duration healthy_edge_interval = 16; | |||
|
|||
// Specifies the path to the health check event log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cross link to relevant docs here about the event log, and also specify that if empty, no event log will be written?
// :ref:`Health check logging <arch_overview_health_check_logging>`. | ||
|
||
message HealthCheckEvent { | ||
HealthCheckerType health_checker_type = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add enum validation here
|
||
message HealthCheckEjectUnhealthy { | ||
// The type of failure that caused this ejection. | ||
HealthCheckFailureType failure_type = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum validation
|
||
message HealthCheckEvent { | ||
HealthCheckerType health_checker_type = 1; | ||
string host_address = 2 [(validate.rules).string.min_bytes = 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, full address preferable. The only situation it might not be the preferred choice is if we don't have the possibility of there being a port encoded (i.e. not :80
), it's definitely TCP and it can't be a pipe..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing from #3478... what happens when we have hosts that appear at multiple levels and priorities? Do we have unique HC events, do we have locality/priority information associated with them?
|
||
option (gogoproto.equal_all) = true; | ||
|
||
// [#protodoc-title: Health check logging events] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is logging, doesn't it belong in envoy.data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in envoy.data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I was focusing on the core
:)
enum HealthCheckFailureType { | ||
ACTIVE = 0; | ||
PASSIVE = 1; | ||
NETWORK = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also consider the discussion in #3478 (comment)
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Sorry folks, this flew off my radar - I updated per above comments with the exception of PTAL @htuch @mattklein123 @danielhochman. Thanks! Edit: Not sure how to reopen... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks. Excited to get this in. Just a few small nits and we can ship.
@@ -185,6 +185,10 @@ message HealthCheck { | |||
// | |||
// The default value for "healthy edge interval" is the same as the default interval. | |||
google.protobuf.Duration healthy_edge_interval = 16; | |||
|
|||
// Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/healthy/health
@@ -185,6 +185,10 @@ message HealthCheck { | |||
// | |||
// The default value for "healthy edge interval" is the same as the default interval. | |||
google.protobuf.Duration healthy_edge_interval = 16; | |||
|
|||
// Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`. | |||
// health check event log. If empty, no event log will be written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate "health check event log."
HealthCheckEventLoggerImpl(AccessLog::AccessLogManager& log_manager, const std::string& file_name) | ||
: file_(log_manager.createAccessLog(file_name)) {} | ||
|
||
virtual ~HealthCheckEventLoggerImpl() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delete
(still working on the DCO issue) |
oops, will fix docs |
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description:
As discussed in #2028
The idea is to transition to proto logging when real sinks are available, but right now I stop short and dump the proto to a file in JSON form. Compare with the manual JSON of outlier detection logging, which could now be migrated to this approach.
The only TODO is docs. Where should this be documented considering there are proto changes as well?
Risk Level: Low / Medium
Testing:
Added new tests and updated existing
Docs Changes:
TBD pending discussion of approach. Where should I document this logging?
Release Notes:
Will add w/ docs.
Fixes #2028