Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Citus healthcheck function/view for basic sanity checks #4276

Open
thanodnl opened this issue Oct 28, 2020 · 1 comment
Open

Create a Citus healthcheck function/view for basic sanity checks #4276

thanodnl opened this issue Oct 28, 2020 · 1 comment
Labels

Comments

@thanodnl
Copy link
Member

As a distributed database Citus relies on the Citus catalog tables to be consistent with the shards existing on the workers. This is guaranteed by executing anything modifying this in distributed transactions.

Every once in a while we run into clusters where the metadata is not consistent with the physical data on the workers. Having a health check function that can quickly give an overview of everything the is not in a consistent state can be very beneficial during these debugging sessions to get a quick overview of every shard or any other distributed object being inconsistent on workers compared to the coordinator metadata.

With a function like this it could also be sampled by automation to get a view overtime to quickly diagnose when the cluster entered an inconsistent state. This could greatly improve the determination when the time of origin was.

The exact output of a function/view for this needs to be discussed and can take multiple forms. Ideally it would be as versatile as pg_stat_activity is for tracking the state of backends. It could output all tracked objects per worker. This will turn into a big view. The benefit is we can use SQL to quickly filter and analyse. Unfortunately it might be a function with a lot of network traffic which might not all be required for the final result. I don't think we can easily prune that down based on applied filters.

An example output could be:

object type fully qualified name worker status
shard public.mytable_100200 worker1 available
shard public.mytable_100201 worker 2 available
shard public.mytable_100202 worker 3 missing
type public.mytype worker 1 consistent
type public.mytype worker 2 inconsistent
type public.mytype worker 3 missing

Every tracked object needs to provide some functionality on verifying its state on every worker. We can start with shards and slowly increase the coverage for every supported object.

The view above could easily be grouped by worker and status to quickly guage the consistency per worker. It could also be filtered to only show inconsistent items etc.

We would want to standardize on a limited set (enum?) of status' for easy use in monitoring.

@thanodnl thanodnl changed the title Create a Citus healthcheck function for basic sanity checks Create a Citus healthcheck function/view for basic sanity checks Oct 28, 2020
@SaitTalhaNisanci
Copy link
Contributor

It could make sense to also check if any node in the cluster can connect to every other node for each user/certificate connection. If not, this might be a problem in repartition joins etc where a worker node needs to connect to another worker node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants