-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store (s3, gcs): invalid memory address or nil pointer dereference #335
Comments
A new development in my case is the following:
I deleted the block in question from S3 and the process continued and finished successfully eventually. However, after the compactor finished, I tried again a particularly heavy dashboard and it failed again with the same error. Finally, I was able to run the dashboard after a 4 restarts with the same error and 4 refreshes. There is some kind of cache that helped me display it I guess. |
Thanks for these details! This sounds like we are hitting S3 not being strong consistent. We (store or compactor) think that the block is uploaded, but actually, it is only partially visible against the bucket (e.g partial index file). That's my theory so far. not confirmed yet. Did not experience any of this for GCS |
Hey @Bplotka. I work with @mihailgmihaylov and I'm in the loop with our Thanos experiments. I just wanted to mention the following regarding S3 consistency, as per the official AWS documentation:
Does the Thanos sidecar do overwrites, or does it always create new objects? How about the compactor? |
So yea, I was assessing it some time ago: #298 Basically, all our objects are 100% immutable. Even compactor when compacting/downsampling etc is adding new "block" and remove old ones. |
Ups, I just found out the shipper (sidecar logic for uploading block) is actually doing check (
Before upload... And it is putting us into the caveat place I guess :/ This might mean that maybe meta json upload is not consistent. This does not immediately explain index reader nil pointers, but we might be in weird situation indeed. Let me think how to mitigate this or/and is there anything else. |
We can replace this But my question related to caveat is following: But what about |
My guess would be that it's consistent as the full path (actually the object's key) is different, |
Ok, will try to fix this finding for now. |
Another store segfault found by
Looks really similar. |
Additional info: The above exception happened on GCS. |
Hello, I've just hit the same issue (I think), my env is:
This seems to always be preceeded my messages complaining about missing blocks:
|
Interesting. This might be really connected to the issue fixed here: #389 We forgot to properly check error on upload of chunks and index file so block were sometimes corrupted without notice (for example on network blip). This might be it. ): |
can follow up the store handling for such blocks, but we definitely better handling of corrupted, partially uploads. Related issue: #377 IMO to be unblocked here, please remove the broken blocks without missing bits and run newest code on sidecars. I am looking now how to ensure store gateway handles these gracefully |
That would explain while it started failing a while after I upgraded from rc1 to thanos:master-2018-06-15-5b66b72 I will attempt to repair it later today. Thanks for the hint. |
Why? The issue was all the time, until today fix ): |
It seems to be resolved for now, but I had to delete a few blocks from the store. I've upgraded all deployments to the newest master tag as you suggested. If it happens again I'll let you know. (This project is great btw, works much better then using remote storage adapter) |
awesome! Closing this for now, feel free to ping me here if you can repro |
How can one identify or find these "broken blocks" and what is a sign that a block is broken that we should delete it? We saw this today and based on the previous messages, our search turned up this:
I assume update: apparently that meta.json does exist. Not sure why Thanos is unable to find it, since GCS has strong consistency guarantees? |
The whole |
Good question. Not sure. In fact, I can repro this panic only at the very beginning after restart of the store gateway, when I want query immediately from the thanos-query. I don't think it is because of partial upload so reopening for further investigation. After restart it works just perfectly nice for same query.. I think there is some race condition between store gateway being ready and thanos query using this store. |
to follow up on @tonglil comment, we deployed the and now the
We did not delete the directory yet, but instead run the I'm going to remove the "bad" directory today and see if that help. To give you a bit more context, we're running our prometheus with 24h retention and most of our grafana queries hit the |
|
I finally removed the dictionary and it helped with the Thanks for providing a very useful tool! |
I've been hitting this issue time and time again. I can reliably reproduce it with a new Instance and S3 (DO Spaces) as storage backend. It takes a few days for the issue to appear but this time there is no error message about a broken block. It just crashes when a query for a time frame outside of Prometheus 24h retention hits the store. Exception:
|
@bwplotka I was getting panics pretty consistently with But whatever you did, it looks like it improved things greatly, I haven't been able to reproduce the issue with |
I've got another error:
|
@xjewer hmm your issue is in different place, which might help to debug this maybe... What object storage you are using? @BenoitKnecht can you check logs as the request now should fail with sufficient info what went wrong for those queries that were triggering panic |
@bwplotka I did check the logs, but that's the thing: it's not even failing anymore, the request succeeds, so there's nothing in the logs. Incidentally, Thanos store is now using way less RAM than on I'm going to try and downgrade to |
No, nothing changed, so it must be lower traffic on your side (: |
OK, so I'm not able to reproduce the issue on |
me as well, I don't see the initial lazyPostings error so far |
I can confirm: |
That seems to have resolved the issue for me as well. It's been up for 3 days so far with no crashes. |
@bwplotka can those changes be merged into master so I can build against the latest version, it seems to have fixed the issue for now. We can reopen this if it occurs again. |
Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge cases? If with this check we will have still panic - it can only mean races. However, -race is not showing anything interesting on, even extended tests. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Debugging #335. Potentially broken preloadPosting is missing some postings, because of bad partitioning. Unit tests are there, but we might not get all edge cases? If with this check we will have still panic - it can only mean races. However, -race is not showing anything interesting on, even extended tests. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Following up on this, running this image: However, the logs are not showing the messages that you added in #627. A different panic is now appearing:
|
It happens here every hour or two but no one wants to upload a core dump that is around 200GB (at least in our case since we have a bucket which is >1TB). And I haven't had enough time to dig into it and research the warts of the Go's memory handling code myself. |
Yea.. so as we suspected previously, it has to do with OOM. Wonder what we can do to improve case. I think the acceptance criteria for this issue is then to add means to control memory consumption to fail query request rather than killing box. This seems like a first solution: #798 so similar control flags as Prometheus has. |
Okay, we got a bit more RAM and it is still reproducible even though there is a lot of free RAM :( The original PR that I've made is still very valid so that random queries wouldn't be able to kill Thanos Store but the problem lies somewhere else. I will be able to load up a debugger now since there is lots of RAM and the core dump files are huge, and, unfortunately, it's impossible to upload such huge files. |
Yes, we checked Golang memory management and the behaviour you mentioned should not happen (nil on lack of mem). We recently found race condition on this: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1627 Now it's inside lock, but it was before lock and we fixed that recently: here: 1b6f6da#diff-a75f50a9f5bf5b21a862e4e7c6bd1576 Can we check master if this is still reproducible? Highly plausible that we fixed this. |
Will do and update this ticket. Thank you a lot! 👍 Awesome work |
I updated our cluster components to thanos 0.3.2 recently. While testing data access for old data (only available via store/bucket access) I started to get similar panics:
|
Since I migrated the data from a 3 node prometheus cluster to S3 buckets with the thanos sidecar and I run the thanos store node agains that data, a strange issue occurred:
panic: runtime error: invalid memory address or nil pointer dereference
after a query for a long period of time.
Since initially Prometheus was using the default min/max-block-duration options, the data I migrated was compressed (to level 5 mostly) so in order to migrate it I manually changed the meta.json files to:
This migrated the data successfully, but may be in part of the issue.
When I executed this query:
For a period of 8w, I got this error and the thanos store pod got restarted:
After second execution, however, the same query returned a satisfying result without any other panic errors.
I managed to find this pattern (however strange it may seem).
When I start to make a graph from a small period of time and use the "+" (Grow the time range) to rapidly extend the time period, I get
nil pointer
for even a small periods like 1d.If I run the query directly for a large period of time (12w or more), I receive a prompt reply without an error.
The text was updated successfully, but these errors were encountered: