Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Prevent EOF errors in Azure objstore #1469

Merged
merged 1 commit into from
Sep 20, 2019

Conversation

wbh1
Copy link
Contributor

@wbh1 wbh1 commented Aug 28, 2019

closes #1466

Changes

Make Azure objstore's GetRange smarter by recognizing when the requested size will be greater than the remaining length of the file to be downloaded. This prevents EOF errors described in #1466

Verification

Verified compaction, initial download, and querying the store node all work (tested against Azure Blob with ~400GB of data)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm,

Thanks for digging this issue out, but I think this is something different.

Wonder if this is touching this problem: #1331

or Azure being eventual consistent?

pkg/objstore/azure/azure.go Outdated Show resolved Hide resolved
if length > 0 {
// If a length is specified and it won't go past the end of the file,
// then set it as the size
if length > 0 && length <= props.ContentLength()-offset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right.

AFAIK there is no known case in immutable system that Thanos would ask for something that does not exists unless TSDB block is malformed (malformed index, index pointing to not existsing chunks). Is that the case?

Even in this case this change would delegate error to Thanos which would see wrong result from Azure anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run bucket verify over all of my buckets and it reported no errors. That should catch most index issues, right?

These are the logs I'm producing:

level=debug ts=2019-08-28T12:48:46.576445Z caller=azure.go:198 msg="set size to length" contentlength=38275251 size=18794 length=18794 offset=38272364 name=01DKBY3R34HCNHK0Q190CZD8NN/chunks/000001
level=debug ts=2019-08-28T12:48:46.576436Z caller=azure.go:198 msg="set size to length" contentlength=163910930 size=23419 length=23419 offset=163903417 name=01DKBZZ6YZCW1NSM6WMCZZJ37J/chunks/000001
level=error ts=2019-08-28T12:48:46.609712Z caller=bucket.go:608 msg="error preloading samples" err="read range for 0: get range reader: cannot download blob, address: https://XXXXXXX.blob.core.windows.net/thanos-snap/01DKBZZ6YZCW1NSM6WMCZZJ37J/chunks/000001: unexpected EOF"
level=error ts=2019-08-28T12:48:46.657179Z caller=bucket.go:608 msg="error preloading samples" err="read range for 0: get range reader: cannot download blob, address: https://XXXXXXX.blob.core.windows.net/thanos-snap/01DKBY3R34HCNHK0Q190CZD8NN/chunks/000001: unexpected EOF"
level=debug ts=2019-08-28T12:48:46.65731Z caller=bucket.go:806 msg="stats query processed" stats="&{blocksQueried:4 postingsTouched:0 postingsTouchedSizeSum:0 postingsToFetch:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:118210554 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:2701}" err=null

When reading 01DKBZZ6YZCW1NSM6WMCZZJ37J/chunks/000001, for example, the Blob content length is 163,910,930 and the offset is 163,903,417, so there should only be 7,513 bytes left to read but azure.go requests from [163,903,417:(163,903,417+23,419)] which then results in an EOF.

The source of the call is chunkr.preload(samplesLimiter) in the blockSeries function called by func (s *BucketStore) Series in bucket.go.

I'll look more into how the partitioning logic is working in func (r *bucketChunkReader) preload(samplesLimiter *Limiter) today.

Copy link
Contributor Author

@wbh1 wbh1 Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Minio also returns an EOF rather than trying to smartly just send back data up to the end of the file, which makes me think we should handle this in Thanos rather than client lib.

if size > 0 && err == io.ErrUnexpectedEOF {
    // If an EOF happens after reading some but not
    // all the bytes ReadFull returns ErrUnexpectedEOF
    err = io.EOF
}

None of the function calls that I traced my way through seemed to care about any stats from meta.json or anything else when making start/end points for chunks. So, I'm not sure how the other objstores don't run into this issue, too... I would think it's a malformed TSDB block except for the fact that when I run with my changes then I'm actually able to see the data in Thanos Query.

Also, interestingly, I am now able to observe this on other metrics (not just the one that is last) e.g. up where Thanos still displays data because it is able to load 1 or more blocks, but several others fail resulting in gaps sometimes. Maybe other users haven't observed this because errors from blockSeries aren't being logged? (I had to add my own logging for it to find the EOF issue)

Example screenshots:
[v0.6.1]
Screen Shot 2019-08-28 at 10 15 30 AM

[with my changes]
Screen Shot 2019-08-28 at 10 16 35 AM

Thoughts @bwplotka ?

[EDIT]
Thanks to @povilasv, who just fixed the thing I mentioned about run.Group not returning errors from blockSeries in #1468 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @bwplotka, just pinging you for another look.

btw, congrats on the Redhat move! 😄

Copy link

@devalexx devalexx Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka I can confirm that it works for me too

@wbh1 wbh1 marked this pull request as ready for review September 17, 2019 17:20
@wbh1 wbh1 force-pushed the fix-azure-EOF branch 2 times, most recently from 4fdf55d to 3ea2b05 Compare September 17, 2019 20:26
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. It might look indeed like some "over" fetch that instead of returning whatever you have actually returned EOF and nothing (?).

As per https://github.com/thanos-io/thanos/pull/1469/files#r318607547 I might think we actually rather check before the call and trim it to file size we touch.. It's interesting why we try to fetch +23,419 when only 7,513 are left?

Did you maybe get into what's missing here? I think we now exact ranges we ask for (thanks to index), so this is quite unlikely to happen ):

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you are right. We can easily overfetch for chunks:

parts := r.block.partitioner.Partition(len(offsets), func(i int) (start, end uint64) {
			return uint64(offsets[i]), uint64(offsets[i]) + maxChunkSize
		})

Nicely spotted @wbh1!

It would be nice to cap it to file size, but our obj storage does interface not allow to do that. ): The unexpectedEOF logic might work but worried that we will have it in truly "failed scenarios" like connection being dropped suddenly. In this case, we will also hit this case which is now a fail not "an ok" think. Let's think how we can fix this tomorrow.

It's interesting only Azure is hitting this - since Azure gives length it might be indeed a good fix here (:

So I'm happy to merge this (:

@bwplotka bwplotka merged commit 973876f into thanos-io:master Sep 20, 2019
@wbh1 wbh1 deleted the fix-azure-EOF branch September 20, 2019 12:24
GiedriusS pushed a commit that referenced this pull request Oct 28, 2019
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

store: Azure hits EOF when reading last metric in a chunk
3 participants