-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint replaying very very slow #5569
Comments
I have same issue on version v0.26.0 and v0.27.0. (and maybe version v0.24.0? I forgot and lost log) I have a large dataset which average 48,130 samples per second and I need to store it 7 days for fast query. |
Yeah this is becoming increasingly an issue for us as well, where replays on some replicas for some tenants can take ten(s) of minutes. I guess it has been like this since at least a few versions (so my assumption is this is not related to any recent change); if it even is related to Thanos changes and not just the fact users handle more data. Either way this require more investigation and potential ideas on how to alleviate it. I was also looking at the experimental snapshot-on-shutdown feature in TSDB (https://ganeshvernekar.com/blog/prometheus-tsdb-snapshot-on-shutdown/) and trying it in receive, but I'd need to first take a better look at this feature. |
Memory snapshotting seems to be a TSDB option. Maybe it's just a matter of setting this flag when starting receiver TSDBs? |
prometheus/prometheus#10973 This commit has been merged to Prometheus main branch already. I think this is also worth trying to see if this could help. |
I will close this issue since the improvements from upstream Prometheus are already included. |
versions
OS : RHEL8
Architecture
hashing (2 routers and 3 receivers with replica=3)
Commande line :
/usr/bin/thanos receive --http-address 0.0.0.0:19904 --grpc-address 0.0.0.0:19903 --remote-write.address 0.0.0.0:19291 --label=thanos_replica="p1thanosprod02 " --tsdb.path=/projet/data/thanos/receive --tsdb.retention=1d --objstore.config-file=/etc/thanos/objstore.yaml --receive.default-tenant-id=prod
Object Storage Provider: S3
What happened:
After upgrading to 0.27, each restart of receiver takes more than 5min per tenant
What you expected to happen:
Fast replaying
Full logs to relevant components:
This is due to checkpoint replay (see checkpoint_replay_duration=5m5.947265781s)
The text was updated successfully, but these errors were encountered: