Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objstore: Azure: unusually high number of GetBlobProperties calls with ClientOtherError/404 responses #6412

Open
thewisenerd opened this issue Jun 4, 2023 · 10 comments

Comments

@thewisenerd
Copy link

Thanos, Prometheus and Golang version used: thanos=0.26.0, quay.io images

Object Storage Provider: Azure

What happened:
ever-increasing number of GetBlobProperties call month-over-month, with most of them (95%+) resulting in ClientOtherError/404 responses

What you expected to happen:
a fairly lower number of GetBlobProperties calls..

How to reproduce it (as minimally and precisely as possible):

  • configure objstore/Azure on a storage account with "Enable Hierarchical namespace" enabled
  • wait for compactor to delete block directories after compaction
  • wait for compactor on next run, and every subsequent run to show these directories under found partially uploaded block and deleted partially uploaded block

Full logs to relevant components:

level=info ts=2023-06-04T14:03:08.553692356Z caller=clean.go:49 msg="found partially uploaded block; marking for deletion" block=01GXX4BC14YSF36NA9G2210XAE
level=info ts=2023-06-04T14:03:08.658578215Z caller=clean.go:59 msg="deleted aborted partial upload" block=01GXX4BC14YSF36NA9G2210XAE thresholdAge=48h0m0s

Anything else we need to know:

the notes from our internal investigation,

  • if “Enable Hierarchical namespace” is enabled, the block directory does not get removed completely
  • the directories {ulid}/chunks/ and {ulid}/ remain even after deletion of all files within {ulid}/
  • these end up in BestEffortCleanAbortedPartialUploads on the next run due to a missing {ulid}/meta.json
  • BestEffortCleanAbortedPartialUploads is unable to delete these directories since deleteDirRec invokes (b *Bucket) Iter and does not attempt deleting the directory itself
@thewisenerd
Copy link
Author

i realize the thanos version is quite old, and the objstore module split and sdk upgrade (0.29.0+) happened; however, please do not ask me to upgrade to 0.29.0 and check if that fixes the issue, we are not in a position to do that currently.

i can attempt to setup thanos locally and see if I can reproduce the issue on 0.29.0+ but no guarantees on when I can get back with the results.

@ahurtaud
Copy link
Contributor

ahurtaud commented Jun 5, 2023

Hello, not using "hierarchical namespace" here.
Using latest thanos version.
The blocks ULID get removed properly by the compactor, however I think the 404 is the only way the compactor can check if deletion-mark.json exist or not per block ULID.
Screenshot 2023-06-05 at 12 35 14

We are not considering this as an error. Also this issue should be moved to https://github.com/thanos-io/objstore

@bck01215
Copy link

image
We also get these errors at an apparent higher rate. We are on the latest thanos version. 0.32.5

These errors have been around from before I came on board but never caused a noticeable impact to my knowledge.
image

@thewisenerd
Copy link
Author

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

@bck01215
Copy link

@bck01215 any comment on whether the Azure storage account has "Hierarchical namespace" enabled?

That is a difference. I do not have that enabled

@Tiduster
Copy link

Hi all,

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage.
This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

Best regards,

@bck01215
Copy link

@bck01215 : we have the same issue with currently 3.8M+ calls per month on the Azure Storage. This is costing us almost 1k€ / month just in API "GetBlobProperties" calls.

"never caused a noticeable impact to my knowledge"

Can you look at your storage cost to see if you have the same issue?

@Tiduster Unfortunately, I don't have access to our billing info. We're in the process of migrating to an on prem s3 server. Before this I tried to increase the timeouts in the http configs, but that did not resolve the issue

@bck01215
Copy link

bck01215 commented Jan 23, 2024

After reaching out to the billing team I confirmed the failed requests are affecting our billing. I was also able to confirm the source is coming from block.BaseFetcher caller=fetcher.go:487. This only seems to be occurring from the store (every 3 minutes) and the compactor.

deletion-mark.json After turning on verbose logging it looks like all the requests go to this endpoint. This seems to confirm what @ahurtaud is saying. The downside to this seems to be a huge uptick in costs due to failed requests. Unsure why my rate of failures would be so much higher than his however. Perhaps he scheduled his compactor to run less frequently?

@6fears7
Copy link

6fears7 commented Jan 23, 2024

It seems like #2565 explored the 404's related to the deletion-mark.json as well, with users pointing to Azure's internal lib handling the error notification as seen here in Azure's sdk, though this addresses the symptom and not the issue of very high GetBlobs for deletion.

@Tiduster
Copy link

Thank you very much, @bck01215, for verifying your cost figures. We have experienced an exponential cost increase over the past few months on our end.

Here's what we've done:

  • We followed this tutorial to enhance the performance of the compactor: https://thanos.io/tip/operating/compactor-backlog.md/
  • We noticed that if the compactor lags, it retries a significant number of queries in the storage account, which substantially increases the overall cost.
  • We conducted a thorough purge of the folder and discovered orphan chunks within the storage account that did not comply with our retention policy.
  • We also upgraded our stack to the latest Thanos version; previously, we were using version 0.29.
  • After the cleanup and the increase in the compactor's computing capacity, we observed a significant decrease in cost (and API calls). We went from spending 25€ per day to just 0.8€.
  • We are now monitoring several metrics to determine if the cost will escalate again, as we are attempting to extend our retention duration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants