[Thanos Receive] --receive.replication-factor=2 leads to remote write unavailability #7274

pvlltvk · 2024-04-11T11:13:17Z

pvlltvk
Apr 11, 2024

Hi everyone!
I have a question regarding the --receive.replication-factor param in Thanos Receive. I deployed a scheme with Routing and Ingesting receivers with 4 replicas of each.
Config of Routing receives:

  - args:
    - receive
    - --log.level=debug
    - --log.format=json
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
    - --receive.replication-factor=2
    - --receive.hashrings-algorithm=ketama

Config of Ingestor receives:

  - args:
    - receive
    - --log.level=info
    - --log.format=json
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --tsdb.path=/var/thanos/receive
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --tsdb.retention=8h
    - --receive.local-endpoint=$(NAME).test-receive-receive-headless.$(NAMESPACE).svc.cluster.local:10901
    - --tsdb.min-block-duration=2h
    - --tsdb.max-block-duration=2h
    - --tsdb.out-of-order.time-window=2h

hashrings.json:

    [
      {
        "endpoints": [
          {
            "address": "test-receive-receive-0.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-1.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-2.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-3.test-receive-receive-headless.sre.svc.cluster.local:10901"
          }
        ],
        "hashring": "test1",
        "tenants": [
          "test1"
        ]
      },
      {
        "endpoints": [
          {
            "address": "test-receive-receive-0.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-1.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-2.test-receive-receive-headless.sre.svc.cluster.local:10901"
          },
          {
            "address": "test-receive-receive-3.test-receive-receive-headless.sre.svc.cluster.local:10901"
          }
        ],,
        "hashring": "test2",
        "tenants": [
          "test2"
        ]
      }
    ]

When I set --receive.replication-factor=2 every time I delete any Ingesting receive pod I get a lot of "backing off forward request for endpoint" errors and the successful remote write to this Thanos Receive instance drops to 0 until the pod is ready again.
This behaviour confuses me a bit because as far as I understand from the docs:

If any time-series in a write request received by a Thanos receiver is not successfully written to at least (REPLICATION_FACTOR + 1)/2 nodes, the receiver responds with an error
a Routing receive should write requests successfully since there are 3 other Ingesting receive is still alive.
Could someone clarify this for me?

I saw a discussion with the same topic, but seems like there is no answer currently
#5108

Answered by MichaHoffmann

Apr 11, 2024

Ah, if you send a write it will get chopped up and fanned out to all other nodes if its big enough (we hash the series and route it to the node that owns that hash in accordance with the hashring - for every series). So any request that contains enough series will probably touch the node that is going away and that particular series will fail to reach quorum.

View full answer

MichaHoffmann · 2024-04-11T11:23:45Z

MichaHoffmann
Apr 11, 2024
Maintainer

Hey,

quroum in the code is currently this https://github.com/thanos-io/thanos/blob/8227108dba098a6cf4aa7c00c13ed1ae42c2d088/pkg/receive/handler.go#L990C1-L994C1, (rf/2 + 1) which would be 2 for rf=2, so rf=2 cannot tolerate one node going away.

6 replies

pvlltvk Apr 11, 2024
Author

@MichaHoffmann
Thanks you for the prompt response!
However I still don't get it, sorry. So quorum for rf=2 is also 2, but I have 3 pods which are still available. Could you clarify this?

MichaHoffmann Apr 11, 2024
Maintainer

Ah, if you send a write it will get chopped up and fanned out to all other nodes if its big enough (we hash the series and route it to the node that owns that hash in accordance with the hashring - for every series). So any request that contains enough series will probably touch the node that is going away and that particular series will fail to reach quorum.

Answer selected by pvlltvk

pvlltvk Apr 11, 2024
Author

Thanks for clarification!

pvlltvk Apr 11, 2024
Author

The last question regarding hashring If you don't mind.
I got the idea and its realisation, but having 3 replicas for each series (if we increase rf up to 3) seems too expensive for me.
I think it would be nice if we could set up hashring and write quorum differently: 3 nodes in a group for each hashing section, quorum = 2, but the number of replicas is 2. Does it make sense in current implementation of hashring, what do you think?

MichaHoffmann Apr 11, 2024
Maintainer

Personally I think that with RF 2, we should tolerate the loss of one node. @verejoel talked about that at ThanosCon too.

evilr00t Jun 6, 2024

We use RF=2 and so far it makes our deployment much more stable - we use spot instances to host our EC2s used by EKS and in case of any disruption our ruler pods are not going crazy like they did using RF=1.

With rF=2 it looks like there is replication but it does not guarantee that all metrics / streams will be replicated. Some of the metrics are available if the node/pod goes down, this can lead to confusion around the team's - what is going on with my service/data? as well it might trigger Ruler and its rules... To guarantee that your setup can survive node/pod going down without losing any metrics you'll have to use rF=3

I've also noticed that thanos-receive-controller should be configured carefully - using options like --allow-dynamic-scaling or --allow-only-ready-replicas might stop distributor (during spot interruptions) if you're using Ketama and amount of nodes/pods is lower than your rF.

SS from the ThanosCon:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Thanos Receive] --receive.replication-factor=2 leads to remote write unavailability #7274

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Thanos Receive] --receive.replication-factor=2 leads to remote write unavailability #7274

pvlltvk Apr 11, 2024

Replies: 1 comment · 6 replies

MichaHoffmann Apr 11, 2024 Maintainer

pvlltvk Apr 11, 2024 Author

MichaHoffmann Apr 11, 2024 Maintainer

pvlltvk Apr 11, 2024 Author

pvlltvk Apr 11, 2024 Author

MichaHoffmann Apr 11, 2024 Maintainer

evilr00t Jun 6, 2024

pvlltvk
Apr 11, 2024

Replies: 1 comment 6 replies

MichaHoffmann
Apr 11, 2024
Maintainer

pvlltvk Apr 11, 2024
Author

MichaHoffmann Apr 11, 2024
Maintainer

pvlltvk Apr 11, 2024
Author

pvlltvk Apr 11, 2024
Author

MichaHoffmann Apr 11, 2024
Maintainer