How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

aleskxyz · 2023-11-06T09:08:29Z

aleskxyz
Nov 6, 2023

Hi,

Cortext selects a leader from the cluster of HA Prometheus to retrieve samples. Imagine a network partition situation where each Prometheus can scrape data from some instances. With the current Cortext design, only samples from the elected Prometheus will be written to long-term storage, and samples from other Prometheuses will be discarded, resulting in gaps for samples that are scraped only by the standby Prometheus.

Does Cortext have a solution for this, or can it handle this situation like Thanos, which deduplicates data at query time?

Thanks.

alanprot · 2023-11-06T16:58:22Z

alanprot
Nov 6, 2023
Maintainer

Why do you have samples that are scraped only by the standby prometheus? I think the idea is to have 2 prometheus instances scraping exactly the same metrics. The reason why we only accept 1 replica is because TSDB will reject duplicate samples (metrics with the same timestamp for the same series with different values). Alan Diego

…

On Mon, Nov 6, 2023 at 1:08 AM aleskxyz ***@***.***> wrote: Hi, Cortext selects a leader from the cluster of HA Prometheus to retrieve samples. Imagine a network partition situation where each Prometheus can scrape data from some instances. With the current Cortext design, only samples from the elected Prometheus will be written to long-term storage, and samples from other Prometheuses will be discarded, resulting in gaps for samples that are scraped only by the standby Prometheus. Does Cortext have a solution for this, or can it handle this situation like Thanos, which deduplicates data at query time? Thanks. — Reply to this email directly, view it on GitHub <#5633>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6XK4DPC7WEQ4D4SARLFK3YDCSJVAVCNFSM6AAAAAA67E7VSCVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHAYTQMBQGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

aleskxyz Nov 6, 2023
Author

Thanks for your reply!
As I told above, we may see this inconsistency in case of network partition.
Imagine we have 2 prometheus in 2 different racks that both of them are scraping all instances.
when internal connection between 2 racks is disrupted, then the active prometheus cannot scrape resources in the other rack but the local prometheus of that rack is still working.
Thanks

alanprot · 2023-11-06T19:08:20Z

alanprot
Nov 6, 2023
Maintainer

So the problem is the "fail over time"? The default value is 15 seconds its configurable: ha_tracker_update_timeout Alan Diego

…

On Mon, Nov 6, 2023 at 10:33 AM aleskxyz ***@***.***> wrote: Thanks for your reply! As I told above, we may see this inconsistency in case of network partition. Imagine we have 2 prometheus in 2 different racks that both of them are scraping all instances. when internal connection between 2 racks is disrupted, then the active prometheus cannot scrape resources in the other rack but the local prometheus of that rack is still working. Thanks — Reply to this email directly, view it on GitHub <#5633 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6XK4FQKIV477GPNERF4D3YDEUQTAVCNFSM6AAAAAA67E7VSCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TIOJQHEYTE> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How Can Cortext Handle Data Gaps in Network Partition Scenarios in HA Cluster of Prometheuses? #5633

Uh oh!

aleskxyz Nov 6, 2023

Replies: 2 comments · 1 reply

Uh oh!

alanprot Nov 6, 2023 Maintainer

Uh oh!

aleskxyz Nov 6, 2023 Author

Uh oh!

alanprot Nov 6, 2023 Maintainer

aleskxyz
Nov 6, 2023

Replies: 2 comments 1 reply

alanprot
Nov 6, 2023
Maintainer

aleskxyz Nov 6, 2023
Author

alanprot
Nov 6, 2023
Maintainer