Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does not start up after corrupted meta.json file #4058

Closed
haraldschilly opened this issue Apr 7, 2018 · 37 comments · Fixed by prometheus-junkyard/tsdb#573 · May be fixed by prometheus-junkyard/tsdb#320
Closed

does not start up after corrupted meta.json file #4058

haraldschilly opened this issue Apr 7, 2018 · 37 comments · Fixed by prometheus-junkyard/tsdb#573 · May be fixed by prometheus-junkyard/tsdb#320

Comments

@haraldschilly
Copy link

haraldschilly commented Apr 7, 2018

This ticket is a follow up of #2805 (there are similar comments at the bottom after closing it)

What did you do?

Run prometheus in a kubernetes cluster. On a GCE PD disk.

What did you see instead? Under which circumstances?

It crashed upon start, logfile:

level=error ts=2018-04-07T04:28:53.784390578Z caller=main.go:582
err="Opening storage failed unexpected end of JSON input"
level=info ts=2018-04-07T04:28:53.784418708Z caller=main.go:584
msg="See you next time!"

The point here is, that the meta.json file has a size of zero:

> ls -l /data/*/meta.json
[...]
-rw-rw-r--    1 1000     2000           283 Apr  7 03:05 01CAF10VV5FNDJ6PG84E6RSEV3/meta.json
-rw-rw-r--    1 1000     2000             0 Apr  7 03:15 01CAF1K5SQZT4HBQE9P6W7J56E/meta.json

Manual resolution

I've deleted that directory 01CAF1K5SQZT4HBQE9P6W7J56E with the problematic meta.json file in it and now it start up fine again.

Environment

  • System information:

    Linux 4.10.0-40-generic x86_64

  • Prometheus version:

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

("official" docker build)

  • Logs:
level=error ts=2018-04-07T08:51:28.789897822Z caller=main.go:582 err="Opening storage failed unexpected end of JSON input"

Expected behavior

What I would wish is that prometheus starts up and doesn't CrashLoop. It should either

  • ignore that directory, saying in the log that meta.json is faulty
  • maybe move it to [directoryname].broken/ ?
  • reconstruct the meta.json file from the data
  • delete the problematic directory (bit harsh, ignoring might be better)
@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Apr 7, 2018

I did a PR that should address this issue. @fabxc and @gouthamve will advise soon.
prometheus-junkyard/tsdb#283

@TimSimmons
Copy link

I ran into this today, had a single Prometheus server with the exact same behavior as above.

$ ls -l 01CB7X7ZX2E1C6R1SQHVTVVX9J
total 180008
drwxr-xr-x 2 prometheus prometheus      4096 Apr 16 19:00 chunks
-rw-r--r-- 1 prometheus prometheus 184323907 Apr 16 19:00 index
-rw-r--r-- 1 prometheus prometheus         0 Apr 16 19:00 meta.json
-rw-r--r-- 1 prometheus prometheus         0 Apr 16 19:00 tombstones

I moved the dir out of the data directory and Prometheus restarted happily.

@bmihaescu
Copy link

I faced this problem also today.
Finding and deleting the folder with the empty json solved the issue.

root@ip-xxx-xxx-xxx-xxx:/var/lib/kubelet/pods/ce0fba7f-c649-11e8-128370c697fc/volumes/kubernetes.io~aws-ebs/pvc-a7cf0ff9-11e8-acf2-020ae50968fe/prometheus-db# find ./ -name meta.json -size 0
./01CRRG0QYZS8GJQGS96QZYR5TJ/meta.json

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Oct 2, 2018

do you by any chance use NFS?

There was some discussion in the past and NFS sometimes behaves weird and I don't think there was anything we could do to prevent this so for this reason NFS is considered unsupported.

@Vlaaaaaaad
Copy link

@krasi-georgiev: working with @bmihaescu so I can comment on this. The error is from a kops-deployed cluster running on AWS using EBS.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Oct 8, 2018

That would be hard to troubleshoot. I would need some specific steps how to replicate.

How often does it happen and can you replicate with latest release?

@krasi-georgiev
Copy link
Contributor

@Vlaaaaaaad , @bmihaescu are you sure you have enough free space?. (suggested by Brian on IRC so worth checking.)
During compaction it needs a bit for temporary holder during the joining.

@Vlaaaaaaad
Copy link

@krasi-georgiev oh, that is likely to be the issue. We do have some free space, but not much. Is there some documentation on how much space should be free( a certain value, a percentage)?
@bmihaescu will have more details as he was the one who did the investigation and fix.

This is happening on two older clusters( k8s 1.10.3 and 1.9.6), with prometheus-operator v0.17.0 and prometheus v2.2.1 so the issue might be fixed in newer versions.

Tagging @markmunozoz too.

@krasi-georgiev
Copy link
Contributor

I am not 100% sure, but logically I would say at least 5 times your biggest block.
Maybe a bit less will do, but storage is not expensive these days so better be on the safe side.

btw there are plans to add storage based retention so should help use cases where storage is limited. prometheus-junkyard/tsdb#343

@krasi-georgiev
Copy link
Contributor

anyone else wants to add anything else before we marked as resolved?

@haraldschilly did you find the cause for your case?

delete the problematic directory (bit harsh, ignoring might be better)

This is implemented as part of the tsdb cli scan tool which is still in review.

@slomo
Copy link

slomo commented Dec 10, 2018

I am running into the same issue on prometheus 2.4.3 with vagrant. When I suspend my machine virtual box seems to crash and after the crash I reboot the machine and usually one but sometimes up to 90% of my meta.json files are empty.

[vagrant@prometheus ~]$ sudo find /var/lib/prometheus  -iname meta.json -size 0b | wc
    345     345   19665
[vagrant@prometheus ~]$ sudo find /var/lib/prometheus  -iname meta.json | wc
    357     357   20349

I am not seeing this in production yet. I guess simply because my machines rarely ever crash.
I will try to upgrade to the latest version. But I am a bit scared now that I might loose also data in production on power outage.

@krasi-georgiev
Copy link
Contributor

I double checked the code again and the only way I could see this happening is if using nfs or other non POSIX filesystem.

@slomo can you replicate this every time?

@krasi-georgiev
Copy link
Contributor

are you maybe mounting a dir from the host to use as a data dir?

@haraldschilly
Copy link
Author

haraldschilly commented Dec 10, 2018

@krasi-georgiev I'm the one originally reporting this. In case it helps, this did thappen a GCE PD disk, mounted via /dev/sda1 on /prometheus type ext4 (rw,relatime,data=ordered) and /dev/sdd on /data type ext4 (rw,relatime,discard,data=ordered) in a docker container.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Dec 10, 2018

yeah GCE PD disk is ok.
It would help to trace the logs for any errors or warnings before the crash.

@haraldschilly
Copy link
Author

well, I don't remember seeing any logs specific to that with useful info. it usually happens when there is an OOM event and the kernel kills the prometheus job or the whole VM is shutdown. I think the main underlying reason is that ext4 isn't 100% atomic. This makes me think I should try using zfs or btrfs.

@slomo
Copy link

slomo commented Dec 10, 2018

It is an ext4 inside an virtualbox vm. I would say it happens on every virtual box crash, I'll try to reproduce it.

@krasi-georgiev
Copy link
Contributor

steps to reproduce would really help so I can try to replicate as well. Thanks!

@krasi-georgiev
Copy link
Contributor

@slomo any luck with steps to replicate this?
If possible please ping me on IRC (@krasi-georgiev #prometheus-dev) to speed up the troubleshooting.

@slomo
Copy link

slomo commented Dec 12, 2018

Well. In my setup (which contains a lot of consul sd) hosts I can reproduce it by resetting the virtualbox vm. I tried to create a smaller setup with just a few static node_exporters that are queried and I cant trigger the corruption anymore.

@krasi-georgiev
Copy link
Contributor

so you think it is related to the SD being used?

@slomo
Copy link

slomo commented Dec 13, 2018

@krasi-georgiev I think it would be jumping a bit fast to conclusions to say that sd is at fault, but it definitely requires a certain complexity to occur. I have 6 jobs with a total of ca. 400 target, all targets are added using service discovery with consul.

@haraldschilly Could you roughly describe your setup? Do you use service discovery and how many hosts/applications do you monitor.

@krasi-georgiev
Copy link
Contributor

@slomo thanks for the update, any chance to ping me on IRC to speed this up?

@krasi-georgiev
Copy link
Contributor

@slomo maybe the main difference is the load. Try it with a higher load - 400 static targets

@mmencner
Copy link

We are hitting the same issue on single prometheus instance (version 2.6.0 with local storage), running inside a docker container. So far it happened twice, out of ~50 deployed instances:

  • First time, only the last block had an empty meta.json
  • Second time, it was last two blocks with an empty meta.json

It's not directly connected with the container restart, as in majority of cases it starts without any issues. It's also not a matter of not enough disk space.

As discussed with @gouthamve, we are planning to mitigate this, by introducing a check for an empty meta.json and deleting directories where this condition is true, but it's definitely not a perfect solution.

@brian-brazil
Copy link
Contributor

Are we not creating the meta.json atomically?

@krasi-georgiev
Copy link
Contributor

@mmencner would you mind trying it with the latest release as there have been a lot of changes to fix such issues since 2.6. Would also need the full logs to start some useful troubleshooting.

@brian-brazil just had a quick look and it does indeed happen atomically. My guess is that something happens during compaction when creating a new block.

@brian-brazil
Copy link
Contributor

brian-brazil commented Feb 28, 2019

I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it.

@krasi-georgiev
Copy link
Contributor

I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it.

yes I suspect something similar. especially the case of resetting the VM.

@pborzenkov
Copy link
Contributor

@brian-brazil @krasi-georgiev We are facing the same issue. Sometimes lots of meta.json files are zero-sized. We run Prometheus on local ext4 FS.

Looking at the writeMetaFile, shouldn't the file contents itself be synced before close?

@pborzenkov
Copy link
Contributor

Strangely enough, all the meta.json files have the same modification time and zero size:

[root@xxx]# ls -la /var/lib/prometheus/data/*/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5NZP8MVJXC9P2TQ9QJB1FEV/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5PM9ED3MV7HJSVG5ES6W6KR/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5Q8WM6TR1AH1STJXNRWA55R/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5QXFSYARB5E8M7T4K8MWX8E/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5RJ2ZMZZENCT48KD2KJ0B85/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5S6P5D1WV8G6FKE5RHHEETA/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5SV9B4H5A5BPZ6AYMVD2KX3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5TFWGXREGZJ4EJ51EREXH9D/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5V4FPQCHJ2VTMZECGNBCM0T/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5VS2WEDDKTHV2WEBG1BT5MB/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5WDP25JN7T2QX2AM3WPK3V3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5X29FTXJ6C0Z9P1R2GN9PY3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5XPWDRF57JT697BZN5KDY14/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5YBF6R8F6ZFG17TT2JQ91H9/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5Z02CF9RG1WYPAHMARNHJWG/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5ZMNJ7CG1XZTT295HH68NCB/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D6098R1BQQ0N04N9FARC4ZWJ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D60XW6GTZCDAYWDC2PXYBMXP/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D61JF921X3HHNESWTM1A5HF7/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D6272H50VCEDRHJ53NBXXTAQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D62VNYAA6TA5MX11GE3NQPWJ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D63G8XJ907DQK9MC5PAE8CJP/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D644W0QY5AYHW9Z943Y4PFEQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D64SF6PQ5WAPMGN4V7BRV4BQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D65E2CP1N2C6WTPBPCXW0N70/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D662NSQ94HJZ05XHHZ75E683/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D66Q8T3C8YV3PSKZWW6572B4/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67BVVRJXG5KJY01RBFBZ1W2/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67BVWYTJRQ4KVNB5GSECZ43/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67JQK0EEPFJ50EAF9900P1Z/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67SKA8EGHH9FAHVRWQVP1Y4/meta.json

There are no errors in prometheus log...

@krasi-georgiev
Copy link
Contributor

hm I would expect that the kernel should handle the file sync so don't think this is the culprit.

How long does it take to replicate?

can you ping me on #prometheus-dev , @krasi-georgiev and will try to replicate and find the culprit as this has been a pending issue for a while now.

@pborzenkov
Copy link
Contributor

hm I would expect that the kernel should handle the file sync so don't think this is the culprit.

Close definitely doesn't guarantee sync. Also, if a node crashes before the kernel flushes its write-back cache, then we can end-up with a file with no contents, yet successful write/close/rename.

How long does it take to replicate?

Not sure. Happens sporadically. What I can say is that we've seen it only after a node crashed. Everything is fine during normal operation.

@krasi-georgiev
Copy link
Contributor

@pborzenkov yeah maybe you are right, just checked the code and Fsync is called for the other block Write operations.
https://github.com/prometheus/tsdb/blob/7757fe6f21d4d9f21bf7a1da92fe573b7d4318c3/compact.go#L629

I will open a PR with the fsync for writeMetaFile and will add will attach a binary to it if you want to test it.

@pborzenkov
Copy link
Contributor

pborzenkov commented Mar 26, 2019

@krasi-georgiev I'll be happy to test (though, definitely not in production :)), but crash-related bugs are notoriously hard to reproduce. I tried to check the bug using ALICE (http://research.cs.wisc.edu/adsl/Software/alice/doc/adsl-doc.html) which greatly helped me in the past and that is what I got:

Here is the write part of the test (tsdb.WriteMetaFile just calls tsdb.writeMetaFile):

package main

import (
        "math/rand"
        "os"
        "time"

        "github.com/oklog/ulid"
        "github.com/prometheus/tsdb"
)

func main() {
        now := time.Now()

        err := tsdb.WriteMetaFile(".", &tsdb.BlockMeta{
                ULID:    ulid.MustNew(ulid.Now(), rand.New(rand.NewSource(123))),
                MinTime: now.Add(-2*time.Hour).Unix() * 1000,
                MaxTime: now.Unix() * 1000,
                Version: 1,
        })
        if err != nil {
                os.Exit(1)
        }
        os.Exit(0)
}

And here the checker:

package main

import (
        "os"

        "github.com/prometheus/tsdb"
)

func main() {
        _, err := tsdb.ReadMetaFile(os.Args[1])
        if err != nil && !os.IsNotExist(err) {
                os.Exit(1)
        }
        os.Exit(0)
}

This is what I got with unmodified tsdb:

❯ alice-check --traces_dir=../traces_dir --checker=../../read-test/read-test --debug_level 1
-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool.
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
Logical operations:
0       creat("meta.json.tmp", parent=107377493, mode='0666', inode=107377690)
1       append("meta.json.tmp", offset=0, count=159, inode=107377690)
2       rename(dest="meta.json", source="meta.json.tmp", source_hardlinks=1, source_parent=107377493, dest_size=0, dest_hardlinks=0, source_size=159, dest_pare
nt=107377493, source_inode=107377690, dest_inode=False)
3       fsync(".", size=60, inode=107377493)
-------------------------------------
Finding vulnerabilities...
(Dynamic vulnerability) Ordering: Operation 1 needs to be persisted before 2
(Static vulnerability) Ordering: Operation /usr/lib/golang/src/syscall/asm_linux_amd64.s:27[syscall.Syscall] needed before /usr/lib/golang/src/syscall/asm_linu
x_amd64.s:53[syscall.Syscall6]
Done finding vulnerabilities.

And this is what I got after adding f.Sync() call before f.Close in writeMetaFile:

❯ alice-check --traces_dir=../traces_dir --checker=../../read-test/read-test --debug_level 1
-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool.
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
Logical operations:
0       creat("meta.json.tmp", parent=107379907, mode='0666', inode=107378562)
1       append("meta.json.tmp", offset=0, count=159, inode=107378562)
2       fsync("meta.json.tmp", size=159, inode=107378562)
3       rename(dest="meta.json", source="meta.json.tmp", source_hardlinks=1, source_parent=107379907, dest_size=0, dest_hardlinks=0, source_size=159, dest_pare
nt=107379907, source_inode=107378562, dest_inode=False)
4       fsync(".", size=60, inode=107379907)
-------------------------------------
Finding vulnerabilities...
Done finding vulnerabilities.

While this is definitely not a proof that the bug is indeed fixed, the tool has great track record and usually finds real problems.

@krasi-georgiev
Copy link
Contributor

wow, that is amazing. Thanks for spending the time. I will open a PR soon.

@krasi-georgiev
Copy link
Contributor

Just opened a PR that would close this issue and will go in the next release. Feel free to reopen if you still experience the same issue after that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
10 participants