-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
does not start up after corrupted meta.json file #4058
does not start up after corrupted meta.json file #4058
Comments
I did a PR that should address this issue. @fabxc and @gouthamve will advise soon. |
I ran into this today, had a single Prometheus server with the exact same behavior as above.
I moved the dir out of the data directory and Prometheus restarted happily. |
I faced this problem also today.
|
do you by any chance use NFS? There was some discussion in the past and NFS sometimes behaves weird and I don't think there was anything we could do to prevent this so for this reason NFS is considered unsupported. |
@krasi-georgiev: working with @bmihaescu so I can comment on this. The error is from a kops-deployed cluster running on AWS using EBS. |
That would be hard to troubleshoot. I would need some specific steps how to replicate. How often does it happen and can you replicate with latest release? |
@Vlaaaaaaad , @bmihaescu are you sure you have enough free space?. (suggested by Brian on IRC so worth checking.) |
@krasi-georgiev oh, that is likely to be the issue. We do have some free space, but not much. Is there some documentation on how much space should be free( a certain value, a percentage)? This is happening on two older clusters( k8s 1.10.3 and 1.9.6), with prometheus-operator v0.17.0 and prometheus v2.2.1 so the issue might be fixed in newer versions. Tagging @markmunozoz too. |
I am not 100% sure, but logically I would say at least 5 times your biggest block. btw there are plans to add storage based retention so should help use cases where storage is limited. prometheus-junkyard/tsdb#343 |
anyone else wants to add anything else before we marked as resolved? @haraldschilly did you find the cause for your case?
This is implemented as part of the tsdb cli scan tool which is still in review. |
I am running into the same issue on prometheus 2.4.3 with vagrant. When I suspend my machine virtual box seems to crash and after the crash I reboot the machine and usually one but sometimes up to 90% of my
I am not seeing this in production yet. I guess simply because my machines rarely ever crash. |
I double checked the code again and the only way I could see this happening is if using nfs or other non POSIX filesystem. @slomo can you replicate this every time? |
are you maybe mounting a dir from the host to use as a data dir? |
@krasi-georgiev I'm the one originally reporting this. In case it helps, this did thappen a GCE PD disk, mounted via |
yeah GCE PD disk is ok. |
well, I don't remember seeing any logs specific to that with useful info. it usually happens when there is an OOM event and the kernel kills the prometheus job or the whole VM is shutdown. I think the main underlying reason is that ext4 isn't 100% atomic. This makes me think I should try using zfs or btrfs. |
It is an ext4 inside an virtualbox vm. I would say it happens on every virtual box crash, I'll try to reproduce it. |
steps to reproduce would really help so I can try to replicate as well. Thanks! |
@slomo any luck with steps to replicate this? |
Well. In my setup (which contains a lot of consul sd) hosts I can reproduce it by resetting the virtualbox vm. I tried to create a smaller setup with just a few static node_exporters that are queried and I cant trigger the corruption anymore. |
so you think it is related to the SD being used? |
@krasi-georgiev I think it would be jumping a bit fast to conclusions to say that sd is at fault, but it definitely requires a certain complexity to occur. I have 6 jobs with a total of ca. 400 target, all targets are added using service discovery with consul. @haraldschilly Could you roughly describe your setup? Do you use service discovery and how many hosts/applications do you monitor. |
@slomo thanks for the update, any chance to ping me on IRC to speed this up? |
@slomo maybe the main difference is the load. Try it with a higher load - 400 static targets |
We are hitting the same issue on single prometheus instance (version 2.6.0 with local storage), running inside a docker container. So far it happened twice, out of ~50 deployed instances:
It's not directly connected with the container restart, as in majority of cases it starts without any issues. It's also not a matter of not enough disk space. As discussed with @gouthamve, we are planning to mitigate this, by introducing a check for an empty |
Are we not creating the meta.json atomically? |
@mmencner would you mind trying it with the latest release as there have been a lot of changes to fix such issues since 2.6. Would also need the full logs to start some useful troubleshooting. @brian-brazil just had a quick look and it does indeed happen atomically. My guess is that something happens during compaction when creating a new block. |
I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it. |
yes I suspect something similar. especially the case of resetting the VM. |
@brian-brazil @krasi-georgiev We are facing the same issue. Sometimes lots of meta.json files are zero-sized. We run Prometheus on local ext4 FS. Looking at the |
Strangely enough, all the meta.json files have the same modification time and zero size:
There are no errors in prometheus log... |
hm I would expect that the kernel should handle the file sync so don't think this is the culprit. How long does it take to replicate? can you ping me on #prometheus-dev , @krasi-georgiev and will try to replicate and find the culprit as this has been a pending issue for a while now. |
Close definitely doesn't guarantee sync. Also, if a node crashes before the kernel flushes its write-back cache, then we can end-up with a file with no contents, yet successful write/close/rename.
Not sure. Happens sporadically. What I can say is that we've seen it only after a node crashed. Everything is fine during normal operation. |
@pborzenkov yeah maybe you are right, just checked the code and Fsync is called for the other block Write operations. I will open a PR with the fsync for |
@krasi-georgiev I'll be happy to test (though, definitely not in production :)), but crash-related bugs are notoriously hard to reproduce. I tried to check the bug using ALICE (http://research.cs.wisc.edu/adsl/Software/alice/doc/adsl-doc.html) which greatly helped me in the past and that is what I got: Here is the write part of the test (tsdb.WriteMetaFile just calls tsdb.writeMetaFile):
And here the checker:
This is what I got with unmodified tsdb:
And this is what I got after adding
While this is definitely not a proof that the bug is indeed fixed, the tool has great track record and usually finds real problems. |
wow, that is amazing. Thanks for spending the time. I will open a PR soon. |
Just opened a PR that would close this issue and will go in the next release. Feel free to reopen if you still experience the same issue after that. |
This ticket is a follow up of #2805 (there are similar comments at the bottom after closing it)
What did you do?
Run prometheus in a kubernetes cluster. On a GCE PD disk.
What did you see instead? Under which circumstances?
It crashed upon start, logfile:
The point here is, that the
meta.json
file has a size of zero:Manual resolution
I've deleted that directory
01CAF1K5SQZT4HBQE9P6W7J56E
with the problematic meta.json file in it and now it start up fine again.Environment
System information:
Linux 4.10.0-40-generic x86_64
Prometheus version:
("official" docker build)
Expected behavior
What I would wish is that prometheus starts up and doesn't CrashLoop. It should either
[directoryname].broken/
?The text was updated successfully, but these errors were encountered: