-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on network failures (e.g uploads) #318
Comments
Especially funny is the single error during sync metas that causes compactor to retry WHOLE sync.
Can see for @TimSimmons logs that it happens quite often. We should retry just the problematic thing. |
This happens both for downloads and uploads (of compacted blocks) for me. And I also see the same timeouts when uploading from the sidecars, so I think this issue applies to all components that communicates with the block store. With a large enough number of thanos sidecars this issue can be quite bad as once you fall behind the number of files up/downloaded gets large which means higher chance of hitting the issue which may put you even further behind and so on. |
yup, exactly. |
Add backoff reply for a single object storage query request, except Range and Iter methods. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff reply for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff retry for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff retry for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff retry for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff retry for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Add backoff retry for a single object storage request, except Range and Iter. Error handler splits errors on net/http and others, and replies the request to the object storage for the former. Fixes thanos-io#318
Ok this is interesting as s3 client really have retries: https://sourcegraph.com/github.com/minio/minio-go@master/-/blob/api.go#L524:17 Maybe it's worth to reach them? |
We double checked and retries are already implemented in minio and GCS client. For each client we need to double check and add if missing (per client). |
@bwplotka this still seems to happen in v0.3.1. The behavior I see is that the timeout occurs, not exactly sure whether the retry is triggered within minio or not, but the compactor exits and restarts. I'd assume that on restart it's cleaning the compaction directory and effectively starting from 0 again |
@bwplotka we are observing Compactor is running without Setup:
Logs
thanos-compactor-1552487171-vwdk5
thanos-compactor-1552487171-8t5ff
thanos-compactor-1552487171-qg22k
thanos-compactor-1552487171-p8xvm
thanos-compactor-1552487171-gxbx5
thanos-compactor-1552487171-w4xdp
|
So this is essentially connected to minio library.. If you are getting timeout seems like we should look on the reasons why..Are blocks too big? Is there anyway we can adjust minio library (https://github.com/minio/minio-go) to improve that? Retry is already in place, minio should handle retries. But if you are getting timeout for retries even... Not sure if masking your issue with another retry is a good solution here (: |
one way is to actually grab a single bigger block that fails constantly and try upload it on your own using |
I'll give that a try. I believe the some directories from the compaction are large, > 100 GBs. I'll have to do some digging. |
@bwplotka in our case these are all different blocks each time. It does succeed, but at times after a handful of cronjob restarts caused by |
See this issue here, but what's the point of retrying if the underlying client provider lib retries for us?
The only problem is when the library we use has this logic broken, I think we should propagate this issue to them. Double retrying is not a solution. |
Oh, sorry, haven't seen this since it was closed. Could we rename the title to be a bit more generic because this affects not only compactor but sidecar as well? :P Yes, I agree that this should be delegated to the underlying libraries that we use but perhaps we could think of some kind even smarter solution like double checking what (if any) files were uploaded to remote storage, and to retry uploading only those files if they are still present on the disk. |
I would follow up for every issue to the underlying provider and make them better. If we will be really hit by this we can still evaluate that bit, but in perfect world (open source world) we should not do it unless provider states that. E.g how we can tell if the error is even retriable? It does not make sense to retry always (500, 403,404 etc) |
Related: #934 |
just rolled back to the build with v0.20, will see, how it works for some reason, I see uploaded blocks with corrupted state (missing files, could be index file or chunk files) Example of one block with index file being absent:
Example of logs from radosgw:[11/Apr/2019:01:03:40 +0000] "HEAD /thanos/01D84YH6M4MPG0JZ6M9C411B72/meta.json HTTP/1.1" 404 0 - Minio (linux; amd64) minio-go/v6.0.16 thanos-sidecar/0.3.2 (go1.12) |
Also have the similar issue with sidecar and compact release v0.3.2, s3 provider. |
Same issue with rc release and 400+Gb block upload. Compactor fails with "net/http: timeout await |
Guys, can you make sure to mention:
Otherwise it is not much helpful ): Ideally we would like to focus on each provider separatedly |
|
This is also happening constantly to me, on S3, with version 0.3.1. I've spent some time today debugging the issue and I believe it might have been caused by #323 : likely the 15 seconds timeout that were set in that PR are not enough for large blocks. I'm testing a custom version in which I've increased the timeout to 2 minutes (🤷♂️ 😄) and so far I haven't seen any issues in a couple hours, where it used to fail every 5-10 minutes. I'll leave a few compacting processes running over the night and will report back tomorrow with the results. |
All the processes are still working correctly after 12 hours. |
I haven't experienced a single error in the last 4 days. @bwplotka I'd be happy to contribute a patch for the |
The headers are the first thing sent, admittedly the 10s we currently use are a bit little, but if you don't get them within two whole minutes I'd say you can safely assume they wont come later. |
I would'n mind trying your approach, as I'm facing exactly the same issue. Could you share your changes ? Gracias. |
I see the s3 header timeout issue with 0.3.2 and 0.4.0. One thing I noticed when I was trying 0.4.0 is that when the timeout happens, the thanos compactor process exits and needs to be restarted. With 0.3.2 it does not exit, and just loops and tries again. Is this an expected changed in behaviour in 0.4.0? |
#1094 which was merged before the 0.5.0 release doesn't seem to fix it for us |
@Allex1 hi, would you like to try with v0.10 rc? |
@daixiang0 I haven't seen this error since upgrading to v0.8.1 |
@bwplotka seems we can close it safely. |
Not critical since compactor just restarted and continued just fine, but can be annoying.
The text was updated successfully, but these errors were encountered: