-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos component of receive got recovered from panic. #6047
Comments
We are seeing the same issue on thanos v0.29.0. Thanos receive seems to restart out of nowhere. When the TSBD is loading, we also get the same stacktrace. |
Looks like the panic is happening here: https://github.com/thanos-io/thanos/blob/main/pkg/store/tsdb.go#L104. I wonder if the |
Hi @fpetkovski |
As @fpetkovski said, according to stack trace it seems like |
@matej-g Thanks for your take care. |
I've raised #6067 which believe should fix this. |
Hi @fpetkovski Promethues version: v2.40.5 promethues external labels is the same in every instances
k get po -n thanos
NAME READY STATUS RESTARTS AGE
thanos-compact-0 1/1 Running 0 100m
thanos-query-7bd9bfb7cf-c5s2l 1/1 Running 0 102m
thanos-query-7bd9bfb7cf-svjlp 1/1 Running 0 97m
thanos-query-7bd9bfb7cf-z284k 1/1 Running 0 102m
thanos-receive-0 1/1 Running 0 90m
thanos-receive-1 0/1 Running 6 (4m51s ago) 96m
thanos-receive-2 1/1 Running 0 102m
thanos-rule-0 1/1 Running 0 101m
thanos-rule-1 1/1 Running 0 96m
thanos-store-0 1/1 Running 0 99m
thanos-store-1 1/1 Running 0 96m apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-receive-hashrings
namespace: thanos
data:
thanos-receive-hashrings.json: |
[
{
"hashring": "soft-tenants",
"endpoints":
[
"thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901",
"thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901",
"thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901"
]
}
]
--- |
I don't think out of order samples should lead to restarts. Can you post the logs of:
|
Hi @fpetkovski i run command of
|
Could you also post the logs of the describe command? |
Sure. However i had been increased cpu/memory requests earlier. Previously error updated cpu limit from 8 to 16 as well as memory limit from 48 to 56
|
@fpetkovski Not OP, we are facing the same problem (the panic due to nil store LabelSet) that doesn't go away even with changes in #6067. This is happening on one particular environment (also happens to have the data folder in a PV like the original poster). We are on 0.30.2 with the changes in #6067 cherry-picked. Trying to isolate/figure out a minimal reproducible example, but wanted to check with you - if you can think of a reason why this bug would still occur. |
You might be hitting #6190 |
@fpetkovski The problem seems to have been fixed with changes from #6203. Fwiw, I was able to reliably reproduce the panic (and confirm the fix with #6203) by simulating a slow file system using FUSE. (Nothing more than the FS shown in this blog post https://www.stavros.io/posts/python-fuse-filesystem/ with some time.sleep(1) added to key functions.) My initial experiments were with some data from the original Ingestor pod that had crashed, but I just confirmed that this panic occurs even when starting out with an empty data folder (on the slow fs). Just fyi, in case it helps in testing/debugging other problems. |
Hi Thanos developers
i got error message from pod of receive component. We adopted receive component to implement multi kubernetes cluster promethues metrics query. Totally we receive about 30 cluster promethues metrics to thanos receive so far. At first we got all of replicas(3 replicas) pod of receive are restart several times that i guess maybe reached the resources limitation so i scale up the limit of request cpu and memory also observed a while are back to fine(without any restarts times situation). But after few hours. One of receive pod got following error. Have anyone meet same issue. Thanks.
Promethues version: v2.40.5
Thanos version: thanosio/thanos:v0.30.1
Receive compoment arguments:
The text was updated successfully, but these errors were encountered: