Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3/GCS: Upload/Delete inconsistency (missing chunk file) #1331

Closed
0robustus1 opened this issue Jul 16, 2019 · 27 comments · Fixed by #1525
Closed

S3/GCS: Upload/Delete inconsistency (missing chunk file) #1331

0robustus1 opened this issue Jul 16, 2019 · 27 comments · Fixed by #1525

Comments

@0robustus1
Copy link
Contributor

Thanos, Prometheus and Golang version used

  • thanos: improbable/thanos:v0.4.0
    • go: 1.12.4
  • prometheus: quay.io/prometheus/prometheus:v2.7.2
    • go: 1.11.5

What happened

We use an on-premise S3 store (in thanos via type: S3 in the object store
config) that experienced availability issues for a 3 - 15 hour period (3 hours
of common connection issues, 15 hours of less common connection issues).
Multiple thanos components (shipper/sidecar, compactor) experienced timeouts
while awaiting responses from the S3 service.

This resulted in the shipper/sidecar not writing complete blocks.
We observed the following types of partially written blocks:

  • block directory present, meta.json present, index and chunks directory missing.
  • block directory present, meta.json present, index present, chunks directory missing.
  • block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.

Those partially written (one might call them corrupted) blocks caused subsequent issues:

  • store components failed queries over a timespan that would contain a partially written block
  • compactor component crashed with non-zero exitcode (log messages before the crash below)

What you expected to happen

  • sidecars do not give up in completing partially written blocks
  • compactors don't crash on partially written blocks (either skipping them or removing them outright)
  • stores don't fail queries on partially written blocks (e.g. by ignoring those blocks)

How to reproduce it (as minimally and precisely as possible):

We have been running into issues trying to build a minimal reproducible scenario.
It would seem that it should be enough to have blocks with the mentioned criteria:

  • block directory present, meta.json present, index and chunks directory missing.
  • block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.

When trying this however we ran into the situation that these blocks did not run
into the compaction plan (see here).

It seems however that once this is considered in the compaction plan and GatherIndexIssueStats is executed the compactor will fail if the index file is not present. If chunks are missing it will fail later in the prometheus tsdb compaction code.

Partial logs to relevant components

block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.

level=error
ts=2019-07-03T13:02:34.85314565Z caller=main.go:182
msg="running command failed"
err="error executing compaction: compaction failed: compaction failed for group 0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}: compact blocks [/var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DE9Q2EWM7Y5JZPB7V4MC4BBC /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DE9XY64NVKB7Y62B9DH5A0AP /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DEA4SXCNKCZ06C83Q45N6C2C /var/thanos/store/compact/0@{prometheus=\"XXX-prometheus-name-XXX\",prometheus_replica=\"prometheus-thanos-system-1\"}/01DEABNMMQFGXKXZMTKJF9T42Z]: write compaction: chunk 8 not found: reference sequence 0 out of range"

Currently this is the only log we have. Should we run into the issue again, i'll make sure to attach more logs of the other components and cases.

@0robustus1
Copy link
Contributor Author

While investigating this we added the following detection mechanisms to a fork used for debugging: 0robustus1/thanos@orig-v0.4.0...0robustus1:partially-written-blocks

@0robustus1 0robustus1 changed the title Partially written blockscause components to misbehave and crash Partially written blocks cause components to misbehave and crash Jul 16, 2019
@GiedriusS
Copy link
Member

While investigating this we added the following detection mechanisms to a fork used for debugging: 0robustus1/thanos@orig-v0.4.0...0robustus1:partially-written-blocks

That's awesome! I wanted to do something like that myself a few months ago: my initial idea was to only check that files in chunks/ are in ascending order and that no file is missing but this is even better! Care to make it into a pull request?

Either way, yes, as you can see we can always definitely improve error checking around this since this can definitely happen because most remote object storage providers use multi-part upload and sometimes uploading some file can fail.

However, the newest versions of Thanos Compact are supposed to delete corrupt blocks which are older than some time #1053. Have you tried that mechanism, does it work?

@0robustus1
Copy link
Contributor Author

I'll prepare a pull-request.

Regarding thanos-compact: I think this wasn't applicable in our scenario as "our" malformed blocks had the meta.json file present, and the mentioned PR only removes these blocks if they do not have a meta.json file, correct?

@bwplotka
Copy link
Member

block directory present, meta.json present, index and chunks directory missing.
block directory present, meta.json present, index present, chunks directory missing.
block directory present, meta.json present, index present, chunks directory present, but not all chunk files being present.

How is that possible? We write meta.json at end 🤔 What kind of unavailability you seen? On read path or write path for S3?

@bwplotka
Copy link
Member

Also doing fixes for 0.4.0 is risky as the codebase improved from that version

@0robustus1
Copy link
Contributor Author

During that incidents we saw it both in writes as well as reads (both "exists" and "upload" operations for example).

I agree it's probably not best to add this as a fix for 0.4.0. I would adjust the PR in such a way that it would fit for current master and not for 0.4.0.

0robustus1 added a commit to 0robustus1/thanos that referenced this issue Jul 19, 2019
see thanos-io#1331

This enables us to identify partially written blocks that
observe these issues:

* missing index
* chunks referenced in index that aren't present in the chunks/ directory of the
  block
* chunks in chunks/ subdirectory of the block that aren't referenced from the
  index (this should not be possible)
@bwplotka
Copy link
Member

bwplotka commented Aug 5, 2019

Potentially related to #1335

@bwplotka
Copy link
Member

bwplotka commented Aug 5, 2019

During that incidents we saw it both in writes as well as reads (both "exists" and "upload" operations for example).

This is rather unlikely as we fail the upload and clean everything in any upload failure. Since we upload meta.json at the end if we spot block without meta and older than X we remove it assuming malformed one.

We use an on-premise S3 store

Is it strongly consistent? E.g if you see the missing item (e.g index only) it is only temporarily? Or maybe write it asynchronous?

@0robustus1
Copy link
Contributor Author

Let me answer the questions posted in #1370 here:

Thanks for that. However before getting into full block scan / detection you propose it would good to fully understand what's wrong with your object storage. Even on full or partial unavailability the worst what can happen is that upload will fail in mid stage. Since meta.json is uploaded at the end, there is no possiblity that you can have missing index but uploaded meta.json. Can we try to figure out first why that happens?

  • Is Prometheus filesystem unstable (e.g NFS?) in your case?

The Prometheus filesystem is a regular ext4 filesystem from a mounted PersistentVolume (in Kubernetes) originating from VSphere.
As such it is quite unlikely to be the cause of this issue (although i cannot prove that without prometheus data from the original incident)

  • Is somehow S3 object storage eventually consistent and there was some super long delay?

The S3 object storage is a Dell ECS cluster addressed via the S3 Protocol. According to the documentation it has strong consistency. In any case this delay would have been longer than 48 hours (which was the retention of the Prometheus at the time) because the block was partially written even days later.

AFAIK write via minio client Go library should synchronous.

Regarding your questions posted above:

During that incidents we saw it both in writes as well as reads (both "exists" and "upload" operations for example).

This is rather unlikely as we fail the upload and clean everything in any upload failure. Since we upload meta.json at the end if we spot block without meta and older than X we remove it assuming malformed one.

It does sound rather unlikely but it was what the following query was reporting:

increase(thanos_objstore_bucket_operation_failures_total{pod=~"$pod"}[1h])
/
increase(thanos_objstore_bucket_operations_total{pod=~"$pod"}[1h])

We use an on-premise S3 store

Is it strongly consistent? E.g if you see the missing item (e.g index only) it is only temporarily? Or maybe write it asynchronous?

Dell ECS claims the following: "Multi-site read/writes with strong consistency simplifies application development." here. And the missing item wasn't temporary, it never appeared. The retention for the prometheus was set to 48h. This would suggest that it would have had 46 hours to upload it successfully.

@infa-ddeore
Copy link

we recently upgraded thanos components to 0.5.0, compactor failed multiple times with the similar error write compaction: chunk 8 not found: reference sequence 0 out of range mentioned in this issue.

chunks directory was absent, I cant reproduce this issue but mentioning this here, my issue could be similar.

aws s3 ls --recursive s3://xxxx-thanos-xxxx/01DGVP20F4Q19DPCQJ1YAPXZE2
2019-08-08 14:58:35   11304700 01DGVP20F4Q19DPCQJ1YAPXZE2/index
2019-08-08 14:58:38        605 01DGVP20F4Q19DPCQJ1YAPXZE2/meta.json

its aws s3 bucket, there were 5-6 folders that I had to delete to recover compactor pod.

@bwplotka
Copy link
Member

bwplotka commented Aug 9, 2019

Cool.

So technically it is impossible that chunk folder upload failed, but meta.json is uploaded. This means that what you observe is a severe bug.

We can check our minio client and double check our side. But is there any log for fail on sidecar? Exactly for this block? Because if not it clearly means that it is some bug in minio client or S3 server and we need to track this down. Either of this return success for the failed operation or it is not synchronous (we think it is synchronous but it is async etc)

For sure it does not happen on GCE.

I think I like @0robustus1 your #1370 here - we can know more in those cases. The sidecar can possibly try to scan the potentially properly uploaded block to double-check if all files are there. With this unknown on interface, it would be even not enough: file might be there but partially uploaded. But it sounds really bad that we have to do that given the success response.

AC:

  • Success from objstore interface API has to mean successful upload (!)
  • We error better partial uploads that sneaked through because of bugs like this on compactor side.

@infa-ddeore you are right write compaction: chunk 8 not found: reference sequence 0 out of range means that chunk files are missing. I think better error would be nice no matter what, so I will try to review #1370

@infa-ddeore
Copy link

here is the compactor error log and aws s3 folder contents, 01DGVP20F4Q19DPCQJ1YAPXZE2 didn't have chunks folder:

level=debug ts=2019-08-08T10:08:24.009985183Z caller=compact.go:824 compactionGroup="0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}" msg="downloaded and verified blocks" blocks="[/var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGV8AHZ48H4DV2N4FY2524QM /var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGVF6973RT49QR0Y9X2YZ8ZD /var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGVP20F4Q19DPCQJ1YAPXZE2]" duration=1.093229103s

level=error ts=2019-08-08T10:08:24.056159144Z caller=main.go:182 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}: compact blocks [/var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGV8AHZ48H4DV2N4FY2524QM /var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGVF6973RT49QR0Y9X2YZ8ZD /var/thanos/store/compact/0@{cluster_name=\"xxxxxxxxxx-yyyyyyy-zzzzzzz\",prometheus=\"lamm/monitoring-prometheus-prometheus\",prometheus_replica=\"prometheus-monitoringaddon-prometheus-prometheus-0\"}/01DGVP20F4Q19DPCQJ1YAPXZE2]: write compaction: chunk 8 not found: reference sequence 0 out of range"
$ aws s3 ls --recursive s3://xxxx-thanos-xxxxxx/01DGVP20F4Q19DPCQJ1YAPXZE2
2019-08-08 14:58:35   11304700 01DGVP20F4Q19DPCQJ1YAPXZE2/index
2019-08-08 14:58:38        605 01DGVP20F4Q19DPCQJ1YAPXZE2/meta.json

@bwplotka
Copy link
Member

bwplotka commented Aug 9, 2019

Well we are more interested in sidecar error log for uploading this project (if any) TBH (:

@infa-ddeore
Copy link

sidecar logs are gone :-( will capture next time if issue occurs again

@bwplotka
Copy link
Member

bwplotka commented Aug 9, 2019

In fact it can be compactor upload failed as well -> but those are using same code path so failed upload for compact would do as well. Since you don't have any this confirms our theory mentioned here: #1331 (comment)

@bwplotka bwplotka changed the title Partially written blocks cause components to misbehave and crash S3: Upload of chunk fails despite of success response on objstore.Upload interface Aug 9, 2019
@caarlos0
Copy link

@infa-ddeore found any workaround to make compaction work again?

@infa-ddeore
Copy link

@infa-ddeore found any workaround to make compaction work again?

compactor worked after deleting the blocks from s3, same issue didn't appear again so I couldn't gather more data.

here is a script if you want to use to check missing chunks folder:

bucket=thanos_bucket_name
data=$(aws s3 ls --recursive s3://$bucket/ | grep -v debug)

for pattern in $(echo "$data" | awk '{print $NF}' | awk -F/ '{print $1}' | sort | uniq); do 
    if ! grep $pattern/chunks <<< "$data" >/dev/null; then 
        echo aws s3 rm --recursive s3://$bucket/$pattern/
    fi
done

@caarlos0
Copy link

got it, you deleted only the blocks it complained about or everything?

@infa-ddeore
Copy link

compactor gave the error mentioned in my above comment, deleted that particular block which had chunks folder missing, it stopped again complaining about another block, then ran the above given bash script to find all blocks with missing chunks folder and deleted them in single shot!

so to answer your question, I deleted only the blocks with missing chunks folder and not everything, this will have the missing metrics in the graph as its the data deletion but I didn't have any choice

You can run thanos bucket inspect before deleting the blocks, that will show how many metrics will be deleted with the block.

@caarlos0
Copy link

cool, yeah just deleted the missing block and it seems to be running now

thanks :)

@infa-ddeore
Copy link

infa-ddeore commented Aug 12, 2019

cool, yeah just deleted the missing block and it seems to be running now

thanks :)

if you can provide more details about what was missing in the bad block folder and the logs from sidecar that shows information about that block, that would help getting more information regarding this issue

@caarlos0
Copy link

I don't have the logs anymore, but it was missing the entire chunks folder :(

@PsychoSid
Copy link

Won't be any help but for the record I experienced this exact issue on one of our setups and recovered in the same way with 0.6.0

@infa-ddeore
Copy link

infa-ddeore commented Aug 29, 2019

got some more s3 folders with missing chunks (one is missing index as well)

sidecar logs doesn't show error:

$ k -n monitoring logs prometheus-monitoring-prometheus-prometheus-0 thanos-sidecar | grep -E '01DJPFF3NKPJSTXRTVQ0HSSJJR|01DJR6CXNGS8ZE39ZJBHCE2TAY|01DJRV03DKQXCFKBN6H6K16S1X|01DK1DN5D9VD2TSV7QBWY4YDHE'
level=info ts=2019-08-29T00:33:17.660202594Z caller=shipper.go:350 msg="upload new block" id=01DJPFF3NKPJSTXRTVQ0HSSJJR
level=info ts=2019-08-29T00:33:33.290836054Z caller=shipper.go:350 msg="upload new block" id=01DJR6CXNGS8ZE39ZJBHCE2TAY
level=info ts=2019-08-29T00:33:39.554012246Z caller=shipper.go:350 msg="upload new block" id=01DJRV03DKQXCFKBN6H6K16S1X
level=info ts=2019-08-29T00:35:06.380745185Z caller=shipper.go:350 msg="upload new block" id=01DK1DN5D9VD2TSV7QBWY4YDHE

s3 has missing chunks and one block missing index:

$ for f in 01DJPFF3NKPJSTXRTVQ0HSSJJR 01DJR6CXNGS8ZE39ZJBHCE2TAY 01DJRV03DKQXCFKBN6H6K16S1X 01DK1DN5D9VD2TSV7QBWY4YDHE; do echo --$f---; aws s3 ls s3://thanos-bucket/$f/; done
--01DJPFF3NKPJSTXRTVQ0HSSJJR---
2019-08-29 06:03:20        604 meta.json
--01DJR6CXNGS8ZE39ZJBHCE2TAY---
2019-08-29 06:03:35   13833572 index
2019-08-29 06:03:36        604 meta.json
--01DJRV03DKQXCFKBN6H6K16S1X---
2019-08-29 06:03:42   13796100 index
2019-08-29 06:03:42        604 meta.json
--01DK1DN5D9VD2TSV7QBWY4YDHE---
2019-08-29 06:05:08   13816104 index
2019-08-29 06:05:26        604 meta.json

@bwplotka do you want me to check something more?

one more observation, pod restart time matches the s3 folder's timestamp, looks like something messes up during the restart.

prometheus-monitoring-prometheus-prometheus-0       4/4     Running   0          151m

@bwplotka bwplotka changed the title S3: Upload of chunk fails despite of success response on objstore.Upload interface S3/GCS: Handling inconsistency (missing chunk file) Sep 16, 2019
@bwplotka
Copy link
Member

Additional observations (maybe unrelated issue) suggests temporary inconsistency (cc @SuperQ): https://cloud-native.slack.com/archives/CK5RSSC10/p1568459542001900

$ gsutil ls -la gs://gitlab-gprd-prometheus/01DKEQC65JRR5M2CMZYZ1J1HGZ/chunks/000001
 536870377  2019-09-03T17:02:50Z  gs://gitlab-gprd-prometheus/01DKEQC65JRR5M2CMZYZ1J1HGZ/chunks/000001#1567530170397076  metageneration=1
TOTAL: 1 objects, 536870377 bytes (512 MiB)
$ gsutil ls -l gs://gitlab-gprd-prometheus/01DKEQC65JRR5M2CMZYZ1J1HGZ/chunks/000001
CommandException: One or more URLs matched no objects.

@bwplotka bwplotka changed the title S3/GCS: Handling inconsistency (missing chunk file) S3/GCS: Upload inconsistency (missing chunk file) Sep 16, 2019
@bwplotka
Copy link
Member

I think I found the bug guys.

The problem is most likely here:

// DeleteDir removes all objects prefixed with dir from the bucket.

Thanos is resilient on partial uploads in most cases. It's done based on small meta.json file. If it's present and block has more than X minutes. We consider this block ready to be used. Delay is for eventual consistency buckets. If there is no meta.json after X minutes we assumed it is partial upload and compactor removes the block.

Now this works well as we always upload meta.json at the end. However, we don't do deletions in a proper way. In the linked code we delete in lexicographical order. This means chunks go first, then index then meta.. If we restart compactor or sidecar in the middle of this we have choked Compactor.

Fixing this now.

This also means that NONE of those blocks that were blocking compactor were not important e.g removing them should drop no metrics overall. Let me know if that makes sense (:

And thanks all the reports that helped us to identify problem - especially #1331 (comment)

bwplotka added a commit that referenced this issue Sep 16, 2019
Fixes #1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Sep 16, 2019
Fixes #1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member

Fix: #1525

bwplotka added a commit that referenced this issue Sep 16, 2019
Fixes #1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Sep 16, 2019
…1525)

Fixes #1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka bwplotka changed the title S3/GCS: Upload inconsistency (missing chunk file) S3/GCS: Upload/Delete inconsistency (missing chunk file) Sep 17, 2019
wbh1 pushed a commit to wbh1/thanos that referenced this issue Sep 17, 2019
…hanos-io#1525)

Fixes thanos-io#1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
brancz pushed a commit to brancz/objstore that referenced this issue Jan 28, 2022
…(#1525)

Fixes thanos-io/thanos#1331

Problem that we are fixing is explained in the linked issue.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants