Skip to content

Commit

Permalink
Add a failure threshold argument (#11)
Browse files Browse the repository at this point in the history
* Initial pass of percentage of failures to cause the healthcheck to fail

When using the healthcheck for kubernetes liveness checks, dont want to kill
off a connector when only a few tasks fail.

* Change name of value, include it in result

* Add in option to change which containers are considered for failure

* Fix argument name

* Fix tests
  • Loading branch information
samrees authored Oct 16, 2020
1 parent 1ba68da commit 5eef1b0
Show file tree
Hide file tree
Showing 17 changed files with 107 additions and 18 deletions.
32 changes: 27 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ A simple healthcheck wrapper to monitor Kafka Connect.
<img src="https://i.imgur.com/veSZDFf.png"/>
</p>

Kafka Connect Healthcheck is a server that wraps the Kafka Connect API and provides a singular API endpoint to determine the health of a Kafka Connect instance. This can be used to alert or take action on unhealthy connectors and tasks.
Kafka Connect Healthcheck is a server that wraps the Kafka Connect API and provides a singular API endpoint to determine the health of a Kafka Connect instance. This can be used to alert or take action on unhealthy connectors and tasks.

This can be used in numerous ways. It can sit as a standalone service for monitoring purposes, it can be used as a sidecar container to mark Kafka Connect workers as unhealthy in Kubernetes, or it can be used to provide logs of when connectors/tasks failed and reasons for their failures.

Expand Down Expand Up @@ -38,7 +38,7 @@ kafka-connect-healthcheck
The server will now be running on [localhost:18083][localhost].

### Docker
The `kafka-connect-healthcheck` image can be found on Docker Hub.
The `kafka-connect-healthcheck` image can be found on Docker Hub.

You can pull down the latest image by running:

Expand All @@ -55,7 +55,7 @@ docker run --rm -it -p 18083:18083 devshawn/kafka-connect-healthcheck
The server will now be running on [localhost:18083][localhost].

## Configuration
Kafka Connect Healthcheck can be configured via command-line arguments or by environment variables.
Kafka Connect Healthcheck can be configured via command-line arguments or by environment variables.

#### Port
The port for the `kafka-connect-healthcheck` API.
Expand Down Expand Up @@ -86,8 +86,18 @@ The worker ID to monitor (usually the IP address of the connect worker). If none

**Note**: It is highly recommended to run an instance of the healthcheck for each worker if you're planning to restart containers based on the health.

#### Considered Containers
A comma-separated list of which type of kafka connect container to be considered in the healthcheck calculation.

| Usage | Value |
|-----------------------|---------------------------------------------|
| Environment Variable | `HEALTHCHECK_CONSIDERED_CONTAINERS` |
| Command-Line Argument | `--considered-containers` |
| Default Value | `CONNECTOR,TASK` |
| Valid Values | `CONNECTOR`, `TASK` |

#### Unhealthy States
A comma-separated list of connector and tasks states to be marked as unhealthy.
A comma-separated list of connector and tasks states to be marked as unhealthy.

| Usage | Value |
|-----------------------|---------------------------------------------|
Expand All @@ -96,7 +106,19 @@ A comma-separated list of connector and tasks states to be marked as unhealthy.
| Default Value | `FAILED` |
| Valid Values | `FAILED`, `PAUSED`, `UNASSIGNED`, `RUNNING` |

**Note**: It's recommended to keep this defaulted to `FAILED`, but paused connectors or tasks can be marked as unhealthy by passing `FAILED,PAUSED`.
**Note**: It's recommended to keep this defaulted to `FAILED`, but paused connectors or tasks can be marked as unhealthy by passing `FAILED,PAUSED`.

#### Failure Threshold Percentage
A number between 1 and 100. If set, this is the percentage of connectors that must fail for the healthcheck to fail.

| Usage | Value |
|-----------------------|---------------------------------------------|
| Environment Variable | `HEALTHCHECK_FAILURE_THRESHOLD_PERCENTAGE` |
| Command-Line Argument | `--failure-threshold-percentage` |
| Default Value | `0` |
| Valid Values | 1 to 100 |

By default, **any** failures will cause the healthcheck to fail.

#### Log Level
The level of logs to be shown by the application.
Expand Down
30 changes: 28 additions & 2 deletions kafka_connect_healthcheck/health.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@

class Health:

def __init__(self, connect_url, worker_id, unhealthy_states, auth):
def __init__(self, connect_url, worker_id, unhealthy_states, auth, failure_threshold_percentage, considered_containers):
self.connect_url = connect_url
self.worker_id = worker_id
self.unhealthy_states = [x.upper().strip() for x in unhealthy_states]
self.failure_threshold = failure_threshold_percentage * .01
self.considered_containers = [x.lower().strip() for x in considered_containers]
self.kwargs = {}
if auth and ":" in auth:
self.kwargs["auth"] = tuple(auth.split(":"))
Expand All @@ -37,7 +39,31 @@ def get_health_result(self):
connector_names = self.get_connector_names()
connector_statuses = self.get_connectors_health(connector_names)
self.handle_healthcheck(connector_statuses, health_result)
health_result["healthy"] = len(health_result["failures"]) == 0

connector_count = len(connector_names)
task_count = sum(len(c["tasks"]) for c in connector_statuses)

container_count = 0
if "connector" in self.considered_containers:
container_count += connector_count
if "task" in self.considered_containers:
container_count += task_count

failure_count = len([f for f in health_result["failures"] if f["type"] in self.considered_containers])

# guards against division by zero. if we have no connectors or tasks we are deciding to pass
if container_count > 0:
health_result["failure_rate"] = failure_count/container_count
else:
health_result["failure_rate"] = 0.0

health_result["failure_threshold"] = self.failure_threshold
health_result["healthy"] = health_result["failure_rate"] <= health_result["failure_threshold"]

# broker errors override any failure calculation
if any([f for f in health_result["failures"] if f["type"] == "broker"]):
health_result["healthy"] = False

except Exception as ex:
logging.error("Error while attempting to calculate health result. Assuming unhealthy. Error: {}".format(ex))
logging.error(ex)
Expand Down
2 changes: 1 addition & 1 deletion kafka_connect_healthcheck/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def main():

server_class = HTTPServer
health_object = health.Health(args.connect_url, args.connect_worker_id, args.unhealthy_states.split(","),
args.basic_auth)
args.basic_auth, args.failure_threshold_percentage, args.considered_containers.split(","))
handler = partial(RequestHandler, health_object)
httpd = server_class(("0.0.0.0", args.healthcheck_port), handler)
logging.info("Healthcheck server started at: http://localhost:{}".format(args.healthcheck_port))
Expand Down
15 changes: 15 additions & 0 deletions kafka_connect_healthcheck/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,21 @@ def get_parser():
help="A comma separated lists of connector and task states to be marked as unhealthy. Default: FAILED."
)

parser.add_argument("--considered-containers",
default=os.environ.get("HEALTHCHECK_CONSIDERED_CONTAINERS", "CONNECTOR,TASK").upper(),
dest="considered_containers",
nargs="?",
help="A comma separated lists of container types to consider for failure calculations. Default: CONNECTOR,TASK."
)

parser.add_argument("--failure-threshold-percentage",
default=os.environ.get("HEALTHCHECK_FAILURE_THRESHOLD_PERCENTAGE", 0),
dest="failure_threshold_percentage",
type=int,
nargs="?",
help="A number between 1 and 100. If set, this is the percentage of connectors that must fail for the healthcheck to fail."
)

parser.add_argument("--basic-auth",
default=os.environ.get("HEALTHCHECK_BASIC_AUTH", ""),
dest="basic_auth",
Expand Down
4 changes: 3 additions & 1 deletion tests/data/expected/1-healthy.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/10-unhealthy-multiple-connectors.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.3,
"failure_threshold": 0.0,
"healthy": false
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/11-healthy-no-connectors.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
2 changes: 2 additions & 0 deletions tests/data/expected/12-unhealthy-task-with-trace.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.3333333333333333,
"failure_threshold": 0.0,
"healthy": false
}
2 changes: 2 additions & 0 deletions tests/data/expected/13-unhealthy-broker-connection.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": false
}
2 changes: 2 additions & 0 deletions tests/data/expected/14-basic-auth.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
4 changes: 3 additions & 1 deletion tests/data/expected/2-unhealthy.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 1.0,
"failure_threshold": 0.0,
"healthy": false
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/4-healthy-worker-id-correct.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/5-healthy-worker-id-unused.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.5,
"failure_threshold": 0.0,
"healthy": false
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/8-healthy-multiple-tasks.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}
4 changes: 3 additions & 1 deletion tests/data/expected/9-healthy-multiple-connectors.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@
"failure_states": [
"FAILED"
],
"failure_rate": 0.0,
"failure_threshold": 0.0,
"healthy": true
}
}

0 comments on commit 5eef1b0

Please sign in to comment.