-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: high cpu when upgrading from 0.12.2 with old data #2793
Comments
could you take a cpu profile of this? |
Please find 30s cpu profiles attached, they were taken almost at the same time: |
Thanks, looks like GC is the current bottleneck: DIFF 0.12.2 vs 0.13.0: |
We can repro this on our side, looking into this more (: |
I think it's multiTSDB so #2012 |
It's not particularly in v0.13.0, but this code does not help for sure: https://github.com/prometheus/prometheus/blob/b788986717e1597452ca25e5219510bb787165c7/tsdb/db.go#L812 TSDB we use, forces GC multiple times. It might make sense for single TSDB, but not really for multiple. |
Turns out this high cpu usage happens when i'm start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc)
And then log is full of |
Also #2823 |
This looks like something we experienced as well, but not when I recall this we wiped the dir indeed. It's migration issue from non-multiTSDB to multiTSDB. This can be easily repro in unit test when porting old receive. Let's fix it, might be annoying for Receive users who want to upgrade to 0.13+ |
I think for you you can ensure it works with higher CPU long enough for WAL from 0.12 to be non-existent then wipe those dirs. In the mean time we will look on fix. Still help wanted (: |
Those look like the migration didn't work correctly. The dirs look like they are block directories. Just to make sure though, you are not running a multi tenant setup right? |
Yes, we're not running multi tenancy, and these are just block dirs. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Still the migration path has to be fixed in theory. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Thanos, Prometheus and Golang version used:
thanos, version 0.13.0 (branch: HEAD, revision: adf6fac)
build user: circleci@b8cd18f8b553
build date: 20200622-10:04:50
go version: go1.14.2
Object Storage Provider:
GCP
What happened:
We're using 3 thanos-receive v0.12.2 pods running with
--receive.replication-factor=3
.At 3:30 i've restarted one pod (green line) as v0.13.0, and it's cpu usage doubled:
Memory usage is 5-10% higher, which is fine.
Here is another graph, from node where pod has been running:
What you expected to happen:
Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.
How to reproduce it (as minimally and precisely as possible):
We run receive with such args:
Full logs to relevant components:
These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: