-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store Gateway: Unexpected postings length #6402
Comments
Closing as we found out the root cause was due to #6303. Then we deployed another image with an older thanos version which doesn't include that pr. Then when store gatway tried to read data from cache, it failed because it cannot understand the compression scheme there. |
We think the same issue might happen during the rollout of this change as well. However, if the cached data is streamed snappy encoded then the older version store gateway will fail to decode it. |
One idea I have so far to improve the rollout: Use different cache keys for snappy and streamed snappy encoding. So during rollout, store gateway with streamed snappy encoding will cache miss and try to fetch data from S3. Older version of store gateway can still use the existing cache key. Finally, all cache keys will be using the new format due to cache TTL. The issue is that it might consume more items/memory of our cache but it is the most seamless way. |
We had a similar case when adding native histograms to query frontend. Maybe errors from cache retrieval should lead to invalidating the key? |
In this case, if the cached content is encoded using streamed snappy, then it is valid and we shouldn't invalidate it. If it is old version of store gateway in this case, I think it can ignore the decoding error and fetch data from S3 without setting any caches. But I feel it might have some edge cases as well so using different cache keys should be easier. WDYT? Btw, I am not aware of that we have a way to invalidate a key in a remote cache. You mean we set the key to a predefined value to represent invalid data? |
Hm I see, so an old version of store-gw can't read new cached postings right? |
@fpetkovski Yeah... So it might be a problem during rollout and we currently throw error if failed to decode |
Close this one as I think we are able to fix this by using a different cache key |
Thanos, Prometheus and Golang version used:
Latest version of Thanos
What happened:
This was actually an error log of Cortex store gateway component. Cortex store gateway is basically a wrapper of Thanos store gateway.
The error message for Thanos part was actually
fetch series for block 01H14YNTKA70WYGBBZ7ZD1SFQM: expanded matching posting: get postings: decode postings: unexpected postings length, should be 6741151740 bytes for 1685287935 postings, got 2431 bytes
.We got a lot of blocks throwing almost the same error when decoding fetched postings from cache, with the same number of expected postings and length.
I checked code and found out that the error was actually from here https://github.com/thanos-io/thanos/blob/main/pkg/store/bucket.go#L2429.
The weird thing is that
1685287935
number of postings is an odd number and none of our blocks have that many series. The number of postings is actually the first uin32 number of the data fetched in the cache so I think something might go wrong with the caching layer. We are using memcached.This error happened once and after that we are unable to see this error again.
What you expected to happen:
No such issue.
The text was updated successfully, but these errors were encountered: