Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downsample/compact: Compactor is not able to compact downsampled blocks. #281

Closed
bwplotka opened this issue Apr 11, 2018 · 12 comments
Closed
Labels

Comments

@bwplotka
Copy link
Member

All my downsampled blocks errors with

level=error name=thanos-compactor ts=2018-04-11T15:20:38.938782011Z caller=compact.go:163 msg="compaction failed" err="compact blocks [/compactor-data/compact/300000@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}/01CAG1M6HZPC2FH1TC7G0S8BN8 /compactor-data/compact/300000@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}/01CAG396J6WA6XHX7FK6PME88T /compactor-data/compact/300000@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}/01CAG4SVRYR3130MGTFQXHMWWE /compactor-data/compact/300000@{cluster=\"alpha-eu-1\",environment=\"alpha\",monitor=\"thanos-coremon\",replica=\"thanos-coremon-1\"}/01CAG0Y40ERDHMAKTA591B5Y4V]: write compaction: chunk 8 not found: invalid encoding \"<unknown>\""
@bwplotka
Copy link
Member Author

bwplotka commented Apr 11, 2018

I think it is as "easy" as just TSDB compaction having no idea how to read those blocks, because we have no control over blockReader for Compact (we cannot inject our own):

func (c *LeveledCompactor) Compact(dest string, dirs ...string) (uid ulid.ULID, err error) {
	var (
		blocks []BlockReader
		bs     []*Block
		metas  []*BlockMeta
		uids   []string
	)

	for _, d := range dirs {
		b, err := OpenBlock(d, c.chunkPool)
		if err != nil {
			return uid, err
		}
		defer b.Close()

		meta, err := readMetaFile(d)
		if err != nil {
			return uid, err
		}

		metas = append(metas, meta)
		blocks = append(blocks, b)
		bs = append(bs, b)
		uids = append(uids, meta.ULID.String())
	}

	entropy := rand.New(rand.NewSource(time.Now().UnixNano()))
	uid = ulid.MustNew(ulid.Now(), entropy)

	meta := compactBlockMetas(uid, metas...)
	err = c.write(dest, meta, blocks...)
	if err == nil {
		level.Info(c.logger).Log(
			"msg", "compact blocks",
			"count", len(blocks),
			"mint", meta.MinTime,
			"maxt", meta.MaxTime,
			"ulid", meta.ULID,
			"sources", fmt.Sprintf("%v", uids),
		)
		return uid, nil
	}

I feel like it is just a matter of adding an API for TSDB to allow opening special block readers and pass our newMemBlock reader, but it's too late to be sure about anything ^^

Any thoughts @fabxc?

@bwplotka
Copy link
Member Author

bwplotka commented Apr 12, 2018

Compactor gets chunks by ref and requires encoding which is for TSDB unknown:

func (s *Reader) Chunk(ref uint64) (chunkenc.Chunk, error) {
	var (
		seq = int(ref >> 32)
		off = int((ref << 32) >> 32)
	)
	if seq >= len(s.bs) {
		return nil, errors.Errorf("reference sequence %d out of range", seq)
	}
	b := s.bs[seq]

	if off >= b.Len() {
		return nil, errors.Errorf("offset %d beyond data size %d", off, b.Len())
	}
	// With the minimum chunk length this should never cause us reading
	// over the end of the slice.
	r := b.Range(off, off+binary.MaxVarintLen32)

	l, n := binary.Uvarint(r)
	if n < 0 {
		return nil, fmt.Errorf("reading chunk length failed")
	}
	r = b.Range(off+n, off+n+int(l))

	return s.pool.Get(chunkenc.Encoding(r[0]), r[1:1+l])
}

@bwplotka
Copy link
Member Author

#285 Fixes it.

@Allex1
Copy link
Contributor

Allex1 commented Jun 19, 2019

Hi,
I still get these errors on Thanos 0.5.0 on S3 on a couple of compactor processes:

level=error ts=2019-06-14T18:33:54.886404704Z caller=main.go:181 msg="running command failed" err="error executing compaction: first pass of downsampling failed: downsample block 01D5CAVA4S0S9TF9V9WF2JVRNA to window 300000: get chunk 304942679029: invalid encoding \"<unknown>\""

@cassandracomar
Copy link

I'm also seeing this on 0.7.0. @bwplotka any ideas why it might still be happening?

@bwplotka
Copy link
Member Author

Seeing what exactly? What are the all components versions?

@cassandracomar
Copy link

All thanos components are on 0.7.0.

level=info ts=2019-09-26T22:48:30.642775211Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-09-26T22:48:32.846934222Z caller=compact.go:236 msg="retention policy of raw samples is enabled" duration=720h0m0s
level=info ts=2019-09-26T22:48:32.846983189Z caller=compact.go:239 msg="retention policy of 5 min aggregated samples is enabled" duration=8760h0m0s
level=info ts=2019-09-26T22:48:32.846995692Z caller=compact.go:242 msg="retention policy of 1 hour aggregated samples is enabled" duration=87600h0m0s
level=info ts=2019-09-26T22:48:32.847009602Z caller=compact.go:326 msg="starting compact node"
level=info ts=2019-09-26T22:48:32.847018949Z caller=prober.go:114 msg="changing probe status" status=ready
level=info ts=2019-09-26T22:48:32.847093161Z caller=main.go:363 msg="listening for metrics" address=0.0.0.0:10902
level=info ts=2019-09-26T22:48:32.847163577Z caller=prober.go:143 msg="changing probe status" status=healthy
level=info ts=2019-09-26T22:48:32.864676793Z caller=compact.go:1023 msg="start sync of metas"
level=info ts=2019-09-26T22:48:33.532383667Z caller=compact.go:1029 msg="start of GC"
level=info ts=2019-09-26T22:48:33.534161354Z caller=compact.go:1035 msg="start of compaction"
level=info ts=2019-09-26T22:48:33.737835252Z caller=compact.go:249 msg="compaction iterations done"
level=info ts=2019-09-26T22:48:33.737868383Z caller=compact.go:256 msg="start first pass of downsampling"
level=info ts=2019-09-26T22:55:02.63697195Z caller=downsample.go:243 msg="downloaded block" id=01CV08RB890A58KEZ2NFTPFW46 duration=6m26.317719535s
level=info ts=2019-09-26T22:55:28.943321054Z caller=streamed_block_writer.go:219 msg="finalized downsampled block" mint=1540231200000 maxt=1540814400000 ulid=01DNQWJMAJ2DDHV81N089A841Z resolution=300000
level=warn ts=2019-09-26T22:55:28.964872518Z caller=prober.go:154 msg="changing probe status" status=unhealthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01CV08RB890A58KEZ2NFTPFW46 to window 300000: get chunk 93, series 1681895: invalid encoding \"<unknown>\""
level=error ts=2019-09-26T22:55:28.967755041Z caller=main.go:213 msg="running command failed" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01CV08RB890A58KEZ2NFTPFW46 to window 300000: get chunk 93, series 1681895: invalid encoding \"<unknown>\""

@krasi-georgiev
Copy link
Contributor

@cassandracomar can you provide the step and I will try to replicate.
If you can't replicate, but still have the block causing this error this will also help if you could sent it privately so I can try to find what is causing the error.

@cassandracomar
Copy link

We just used prometheus-operator 0.31.1 to deploy prometheus 2.10.0 with the thanos sidecar at v0.6.0, then deployed the other thanos components with this chart (https://github.com/kiwigrid/helm-charts/tree/master/charts/prometheus-thanos) at v0.6.0 on a brand new cluster and hit the same errors. I can send the block it's complaining about or I can send a script to replicate our setup on a fresh cluster, whatever is easier.

@krasi-georgiev
Copy link
Contributor

'v0.6.0' - did you mean v0.7.0? I haven't looked the change logs, but I assume this should be fixed in 0.7.0 so should try and replicate with the 0.7 version right?

@cassandracomar
Copy link

I can try again with 0.7.0. Just got a report this morning that it happened with a fresh deploy of 0.6.0.

@krasi-georgiev
Copy link
Contributor

I see, yeah lets try to replicate with the 0.7.0

I think for me it would be easier to replicate with just the blocks so once you get this error just ping me on slack thanos-dev to send me those blocks and will try and find the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants