Skip to content

Conversation

v0idpwn
Copy link
Member

@v0idpwn v0idpwn commented Jul 1, 2025

Introduces Supavisor.Health, which provides a function that runs health checks.

Added two checks:

  • Acceptable ERPC latencies: fails if a node has high latency to all other nodes through :erpc. Doesn't run if in a 1 or 2 node cluster. Fails if all requests have latency over 500ms or fail.
  • Database reachable: fails if can't run a simple query in the database.

Calls this function on the health check endpoint, and return 503 if health checks are failing. After some time, if the condition persists, the infrastructure should restart the instance.

Introduces `Supavisor.Health`, which provides a function that runs
health checks.

Added two checks:
- Acceptable ERPC latencies: fails if a node has high latency to all
other nodes through :erpc. Doesn't run if in a 1 or 2 node cluster.
Fails if all requests have latency over 500ms or fail.
- Database reachable: fails if can't run a simple query in the database.

Calls this function on the health check endpoint, and return 503
if health checks are failing. After some time, if the condition persists,
the infrastructure should restart the instance.
@v0idpwn v0idpwn requested a review from a team as a code owner July 1, 2025 20:13
@chasers
Copy link
Contributor

chasers commented Jul 1, 2025

Otherwise looks great

@abc3
Copy link
Contributor

abc3 commented Jul 2, 2025

🔥

@v0idpwn v0idpwn merged commit 89db1b2 into main Jul 2, 2025
19 of 22 checks passed
@v0idpwn v0idpwn deleted the feat/health-check branch July 2, 2025 15:27
@v0idpwn v0idpwn mentioned this pull request Jul 28, 2025
v0idpwn added a commit that referenced this pull request Jul 29, 2025
### Features
- **Authentication cleartext password support** - Added support for
cleartext password authentication method (#707)
- **Runtime-configurable connection retries** - Support for runtime
configuration of connection retries and infinite retries (#705)
- **Enhanced health checks** - Check database and eRPC capabilities
during health check operations (#691)
- **More consistency with postgres on auth errors** - Improves errors in
some client libraries (#711)

### Performance Improvements

- **Optimized ranch usage** - Supavisor now uses a constant number of
ranch instances for improved performance and resource management when
hosting a large number of pools (#706)

### Monitoring

- **New OS memory metrics** - gives a more accurate picture of memory
usage (#704)
- **Add a promex plugin for cluster metrics** - for tracking latency and
connection status (#690)
- **Client connection lifetime metrics** - adds a metric about how long
each connection is connected for (#688)
- **Process monitoring** - Log when large process heaps and long message
queues (#689)

### Bug Fixes

- **Client handler query cancellation** - Fixed handling of
`:cancel_query` when state is `:idle` (#692)

### Migration Notes

- Instances running a small number of pools may see an increase in
memory usage. This can be mitigated by changing the ranch shard or the
acceptor counts.
- If using any of the new used ports, may need to change the defaults
- Review monitoring dashboards and include new metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants