health check: structured active healthcheck logging #3176

jsedgwick · 2018-04-23T22:33:18Z

Description:
As discussed in #2028

The idea is to transition to proto logging when real sinks are available, but right now I stop short and dump the proto to a file in JSON form. Compare with the manual JSON of outlier detection logging, which could now be migrated to this approach.

The only TODO is docs. Where should this be documented considering there are proto changes as well?

Risk Level: Low / Medium

Testing:
Added new tests and updated existing

Docs Changes:
TBD pending discussion of approach. Where should I document this logging?

Release Notes:
Will add w/ docs.

Fixes #2028

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

mattklein123 · 2018-04-23T22:35:20Z

@junr03 please take first pass.

junr03 · 2018-04-24T17:45:35Z

api/envoy/api/v2/core/health_check_logging.proto

+// [#protodoc-title: Health check logging]
+// [#proto-status: draft]
+
+message ActiveHealthCheckEvent {


In #2028 you mentioned shared pieces of this proto for logging active health checking events and a future outlier detection proto-ized version of the logs. However, the proto you have laid out is specific to active hc, and is located next to the hc proto instead of inside /cluster. Just want to hear your thoughts on that

Yeah I didn't know when/if we'd get around to converting the outlier logging to proto, and I liked the idea of colocating the associated component's proto w/ the logging proto, so I just kept it simple for now. I left the proto as draft status so we can always refactor in the future.

junr03 · 2018-04-24T17:58:45Z

include/envoy/upstream/health_checker.h

+
+  /**
+   * Log an unhealthy host ejection event.
+   * @param host supplies the host that generated the event.


docs for additional params

junr03 · 2018-04-24T17:58:57Z

include/envoy/upstream/health_checker.h

+
+  /**
+   * Log a healthy host addition event.
+   * @param host supplies the host that generated the event.


docs for additional params

junr03 · 2018-04-24T18:42:48Z

source/common/upstream/health_checker_impl.cc

+  HealthCheckEventLoggerSharedPtr event_logger;
+  if (!hc_config.event_log_path().empty()) {
+    event_logger =
+        std::make_shared<HealthCheckEventLoggerImpl>(log_manager, hc_config.event_log_path());


Two questions about the event logger's shared_ptr:

It seems that each health checker is the sole owner of the event logger, as each health checker construction is preceded by a event logger construction. Is the reason to make is a shared ptr because below we pass the event logger to a factory context and that context can share the event logger to more owners? Ah I see, the ownership is shared between the health checker and all the active sessions. Why wouldn't the healthchecker own the event logger and give a reference to the active sessions? Or have the active session write to the parent_'s event_logger_?

Why do we return the event logger's shared ptr from the context, and pass it into constructors by reference?

Gosh, I can't remember, I kind of just fixed a couple signatures and then propagated necessary changes until everything compiled. But you're right that it all feels weird. I'll refactor so it makes more sense. I think I only had shared_ptrs in the first place because that's how the outlier loggers were structured..

junr03 · 2018-04-24T18:55:44Z

api/envoy/api/v2/core/health_check_logging.proto

+message ActiveHealthCheckEvent {
+  string host_address = 1 [(validate.rules).string.min_bytes = 1];
+  string cluster_name = 2 [(validate.rules).string.min_bytes = 1];
+


can we log the type of health checking that caused the event?

thanks for adding. Can we also thread the type of healthchecker (grpc, http, redis, etc.) that is causing the ejection into the log?

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-04-26T19:24:38Z

@junr03 updated PTAL

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-05-01T17:28:23Z

merged master, @junr03 PTAL before the merge conflicts pile up?

junr03

one last comment from me. Sorry for the lag here.

junr03 · 2018-05-03T14:41:13Z

api/envoy/api/v2/core/health_check_logging.proto

+  }
+}
+
+enum HealthCheckFailureType {


thanks for adding this. I was also wondering about adding the type of health check that failed (redis, http, tcp).

Oh! Right. Sure, will do.

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-05-09T01:48:06Z

PTAL @junr03

jsedgwick · 2018-05-10T14:53:19Z

PTAL @danielhochman

junr03

addressed all my comments. Thanks

danielhochman · 2018-05-10T20:36:20Z

needs docs. is there still some question about where to put them? i would think a section in the arch overview health checking docs (similar to outlier detection logging). and a reference in common_messages.rst

i would also add a TODO or open an issue to convert outlier detection to use the same logging mechanism.

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-05-16T19:07:30Z

PTAL @junr03 @danielhochman

jsedgwick · 2018-05-16T19:10:00Z

also opened #3405

junr03

one small nit in the docs, otherwise @mattklein123 please take a look

junr03 · 2018-05-22T17:03:18Z

docs/root/intro/arch_overview/health_checking.rst

+--------------------------
+
+A per-healthchecker log of ejection and addition events can optionally be produced by Envoy by
+specifying a log file path in `the HealthCheckConfig <envoy_api_field_core.HealthCheck.event_log_path>`.


nit: use snake_case for field names

mattklein123

Cool stuff. Code structure LGTM. I have some API comments/questions but hopefully should be pretty quick to change.

mattklein123 · 2018-05-22T18:27:11Z

api/envoy/api/v2/core/health_check_logging.proto

+
+// [#protodoc-title: Health check logging]
+// :ref:`Health check logging <arch_overview_health_check_logging>`.
+// [#proto-status: draft]


Instead of draft status, let's just move this into the v2alpha namespace to clear indicate that the log structure might change in the future. Up to you if you want to do envoy.api.v2alpha.core or some other location that makes more sense.

mattklein123 · 2018-05-22T19:27:07Z

api/envoy/api/v2/core/health_check_logging.proto

+}
+
+message HealthCheckEjectUnhealthy {
+  // The type of failure that caused this ejection


nit: end sentences with full stops (same in other places).

mattklein123 · 2018-05-22T19:31:05Z

api/envoy/api/v2/core/health_check_logging.proto

+  HealthCheckFailureType failure_type = 1;
+
+  // The timeout after which health checks fail for hosts in this cluster
+  google.protobuf.Duration timeout = 2 [(validate.rules).duration.required = true];


It's a little strange to include timeout when we also log Active/Passive failures. Additionally, we aren't logging all the other types of timeouts that have been added. Unless we are going to log times specific to this event, I would probably drop timeout/unhealthy_threshold from this log since it can be intuited from the config.

mattklein123 · 2018-05-22T19:31:16Z

api/envoy/api/v2/core/health_check_logging.proto

+  // The number of healthy health checks required before a host is marked
+  // healthy. Note that during startup, only a single successful health check is
+  // required to mark a host healthy.
+  google.protobuf.UInt32Value healthy_threshold = 1;


for similar reasons as above I would drop this.

mattklein123 · 2018-05-22T19:31:43Z

api/envoy/api/v2/core/health_check_logging.proto

+
+  // Whether this addition is the result of the first ever health check on a host, in which case
+  // the above healthy_threshold is bypassed and the host is immediately added.
+  bool first_check = 2;


Fine with me if you want to keep this since it is actually situationally dependent, but I could go either way.

mattklein123 · 2018-05-22T19:32:52Z

source/common/protobuf/utility.cc

@@ -87,7 +87,8 @@ void MessageUtil::loadFromFile(const std::string& path, Protobuf::Message& messa
 }

 std::string MessageUtil::getJsonStringFromMessage(const Protobuf::Message& message,
-                                                  const bool pretty_print) {
+                                                  const bool pretty_print,
+                                                  const bool always_print_primitive_fields) {


humorously, I was just looking at adding this in a different change. Cool!

mattklein123 · 2018-05-22T19:33:08Z

source/common/protobuf/utility.h

@@ -213,7 +213,8 @@ class MessageUtil {
   * @return std::string of formatted JSON object.
   */
  static std::string getJsonStringFromMessage(const Protobuf::Message& message,
-                                              bool pretty_print = false);
+                                              bool pretty_print = false,
+                                              bool always_print_primitive_fields = false);


nit: update doc comment

mattklein123 · 2018-05-22T19:43:29Z

Oh, please add a release note also. Thank you!

mattklein123 · 2018-05-28T22:59:41Z

@jsedgwick see #3494. Let's put the health check access log proto somewhere in the "output" folder (or whatever we end up calling it when that PR is merged). Thank you! cc @htuch

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-06-01T00:25:27Z

PTAL @mattklein123 I think I addressed all your comments + moved format proto to data/core/v2alpha

mattklein123

Change LGTM, I have a few nits on the proto, and a question on proto structure for @htuch and @mrice32, though since this is v2alpha we can always change it later so IMO it's not a big deal either way right now.

mattklein123 · 2018-06-01T21:43:06Z

api/envoy/api/v2/core/health_check.proto

@@ -176,6 +176,9 @@ message HealthCheck {
  //
  // The default value for "healthy edge interval" is the same as the default interval.
  google.protobuf.Duration healthy_edge_interval = 16;
+
+  // Specifies the path to the health check event log.


Can we cross link to relevant docs here about the event log, and also specify that if empty, no event log will be written?

mattklein123 · 2018-06-01T21:44:38Z

api/envoy/data/core/v2alpha/health_check_event.proto

+// :ref:`Health check logging <arch_overview_health_check_logging>`.
+
+message HealthCheckEvent {
+  HealthCheckerType health_checker_type = 1;


can we add enum validation here

mattklein123 · 2018-06-01T21:44:51Z

api/envoy/data/core/v2alpha/health_check_event.proto

+
+message HealthCheckEjectUnhealthy {
+  // The type of failure that caused this ejection.
+  HealthCheckFailureType failure_type = 1;


enum validation

mattklein123 · 2018-06-01T21:46:17Z

api/envoy/data/core/v2alpha/health_check_event.proto

+
+message HealthCheckEvent {
+  HealthCheckerType health_checker_type = 1;
+  string host_address = 2 [(validate.rules).string.min_bytes = 1];


@htuch @mrice32 I haven't been fully tracking the convo in #3478. What are we thinking here for this type of data? Should we be using the full address type?

Yes, full address preferable. The only situation it might not be the preferred choice is if we don't have the possibility of there being a port encoded (i.e. not :80), it's definitely TCP and it can't be a pipe..

One thing from #3478... what happens when we have hosts that appear at multiple levels and priorities? Do we have unique HC events, do we have locality/priority information associated with them?

htuch · 2018-06-04T19:28:47Z

api/envoy/data/core/v2alpha/health_check_event.proto

+
+option (gogoproto.equal_all) = true;
+
+// [#protodoc-title: Health check logging events]


If this is logging, doesn't it belong in envoy.data?

It is in envoy.data?

My bad, I was focusing on the core :)

htuch · 2018-06-04T19:36:28Z

api/envoy/data/core/v2alpha/health_check_event.proto

+enum HealthCheckFailureType {
+  ACTIVE = 0;
+  PASSIVE = 1;
+  NETWORK = 2;


Please also consider the discussion in #3478 (comment)

stale · 2018-06-18T23:10:59Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

stale · 2018-06-25T23:27:51Z

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-06-30T00:02:28Z

Sorry folks, this flew off my radar - I updated per above comments with the exception of
#3176 (comment)
since I couldn't tell if there was a conclusion and whether the enum I introduced is inappropriate. As Matt pointed out its v2alpha so probably doesn't have to settles now.

PTAL @htuch @mattklein123 @danielhochman. Thanks!

Edit: Not sure how to reopen...

mattklein123

LGTM, thanks. Excited to get this in. Just a few small nits and we can ship.

mattklein123 · 2018-07-03T18:22:48Z

api/envoy/api/v2/core/health_check.proto

@@ -185,6 +185,10 @@ message HealthCheck {
  //
  // The default value for "healthy edge interval" is the same as the default interval.
  google.protobuf.Duration healthy_edge_interval = 16;
+
+  // Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`.


s/healthy/health

mattklein123 · 2018-07-03T18:22:59Z

api/envoy/api/v2/core/health_check.proto

@@ -185,6 +185,10 @@ message HealthCheck {
  //
  // The default value for "healthy edge interval" is the same as the default interval.
  google.protobuf.Duration healthy_edge_interval = 16;
+
+  // Specifies the path to the :ref:`healthy check event log <arch_overview_health_check_logging>`.
+  // health check event log. If empty, no event log will be written.


Remove duplicate "health check event log."

mattklein123 · 2018-07-03T18:28:45Z

source/common/upstream/health_checker_base_impl.h

+  HealthCheckEventLoggerImpl(AccessLog::AccessLogManager& log_manager, const std::string& file_name)
+      : file_(log_manager.createAccessLog(file_name)) {}
+
+  virtual ~HealthCheckEventLoggerImpl() {}


nit: delete

…lthcheck_logging

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick · 2018-07-05T18:00:58Z

(still working on the DCO issue)

jsedgwick · 2018-07-05T18:02:45Z

oops, will fix docs

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

mattklein123

Thanks!

James Sedgwick added 5 commits April 23, 2018 12:06

health check event logging

fea8182

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

5531843

…lthcheck_logging

fix redis

662d705

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

another refactor

05b9194

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

format

6799f87

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

mattklein123 assigned junr03 Apr 23, 2018

junr03 reviewed Apr 24, 2018

View reviewed changes

James Sedgwick added 3 commits April 26, 2018 11:33

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

22af255

…lthcheck_logging

address comments

79cf7cc

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

format

9f85fdf

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

merge

6c75857

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

junr03 reviewed May 3, 2018

View reviewed changes

James Sedgwick added 3 commits May 8, 2018 18:33

add health checker types

099f8ec

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

a5e2ab1

…lthcheck_logging

fix after merge

1c0dc24

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick mentioned this pull request May 10, 2018

Ability to format the outlier event log entry #3335

Closed

junr03 reviewed May 10, 2018

View reviewed changes

James Sedgwick added 4 commits May 11, 2018 08:29

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

826c99d

…lthcheck_logging

merge and add docs

421f3d9

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

tweak

0294538

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

format

92b8dbe

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

jsedgwick mentioned this pull request May 16, 2018

Convert outlier event logging to proto #3405

Closed

mattklein123 self-assigned this May 16, 2018

junr03 reviewed May 22, 2018

View reviewed changes

mattklein123 reviewed May 22, 2018

View reviewed changes

James Sedgwick added 2 commits May 31, 2018 10:51

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

197db40

…lthcheck_logging

update

9eadc6c

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

mattklein123 reviewed Jun 1, 2018

View reviewed changes

htuch reviewed Jun 4, 2018

View reviewed changes

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 18, 2018

stale bot closed this Jun 25, 2018

James Sedgwick added 3 commits June 29, 2018 11:35

typo

7720da7

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

merge

7d4d018

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

update

d47bbaa

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

junr03 reopened this Jun 30, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 30, 2018

mattklein123 reviewed Jul 3, 2018

View reviewed changes

James Sedgwick added 2 commits July 5, 2018 10:36

Merge branch 'master' of https://github.com/envoyproxy/envoy into hea…

bc0e04b

…lthcheck_logging

nits

100164e

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

docs build

2f27dd3

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>

mattklein123 approved these changes Jul 5, 2018

View reviewed changes

mattklein123 merged commit a5da078 into envoyproxy:master Jul 5, 2018

nezdolik mentioned this pull request Jan 7, 2019

upstream: Outlier ejection proto logging #5517

Merged


		option (gogoproto.equal_all) = true;

		// [#protodoc-title: Health check logging events]

health check: structured active healthcheck logging #3176

health check: structured active healthcheck logging #3176

Conversation

jsedgwick commented Apr 23, 2018

mattklein123 commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsedgwick commented Apr 26, 2018

jsedgwick commented May 1, 2018

junr03 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsedgwick commented May 9, 2018

jsedgwick commented May 10, 2018

junr03 left a comment

Choose a reason for hiding this comment

danielhochman commented May 10, 2018

jsedgwick commented May 16, 2018

jsedgwick commented May 16, 2018

junr03 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 commented May 22, 2018

mattklein123 commented May 28, 2018

jsedgwick commented Jun 1, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Jun 18, 2018

stale bot commented Jun 25, 2018

jsedgwick commented Jun 30, 2018 • edited Loading

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsedgwick commented Jul 5, 2018

jsedgwick commented Jul 5, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

jsedgwick commented Jun 30, 2018 •

edited

Loading