-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tq: teach Batch() API to retry itself after io.EOF's #2516
Conversation
I think we should limit the number of retries to
I think all idempotent requests should be repeated. AFAIK pretty much all LFS requests are idempotent but I don't have enough knowledge about the Git LFS server API to know sure. |
Agreed, use consistent rules to retry requests (such as |
Unfortunately I cannot verify this fix as I always run into #2439 😢 |
@larsxschneider @technoweenie I just updated this PR with some new changes, and I would love to get both of your feedback on the changes. I think that it is ready to merge. Here's what changed:
c.Do(lfsapi.WithRetries(req, n))
|
FYI @pluehne is testing this right now 👍 |
We merged this PR into master and ran tests on Linux and macOS. On both systems we saw the following problem:
|
} | ||
|
||
c.traceResponse(req, tracedReq, nil) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where the potentially nil res
can come from. If it fails retries enough times, res
is never set.
Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?
Good question -- there shouldn't be any interference as-is, since clients of the lfsapi
package have to explicitly opt-in for more than 1 request per Do()
call. Since tq
manages its own retries, omitting WithRetries()
is sufficient to keep behavior the same.
lfsapi/client.go
Outdated
|
||
var res *http.Response | ||
|
||
for i := 0; i < retries; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retries
might a bit misleading here. The first "retry" is the actual request, no? Therefore retries
should always be greater 0
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- I like the name retries, but I think the current interpretation of it is a little bit off. Let's treat 'Retries' as the number of additional requests to make for a failed request/response cycle, and instead calculate retries
(lowercase r
) as:
retries := tools.MaxInt(0, Retries(req)) + 1
@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err | |||
tracerx.Printf("api: batch %d files", len(bReq.Objects)) | |||
|
|||
req = c.LogRequest(req, "lfs.batch") | |||
res, err := c.DoWithAuth(remote, req) | |||
res, err := c.DoWithAuth(remote, lfsapi.WithRetries(req, c.MaxRetries)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is wrong here. c.MaxRetries
seems to be always 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttaylorr Any idea about this? c.MaxRetries
is still 0
for me with the latest change although I have configured lfs.transfer.maxretries=10
. Any idea why?
@ttaylorr @technoweenie Any update here? Can I help with something? |
@larsxschneider I think I found the spot where Here are some builds that include 7722c2f for you to try: git-lfs-darwin-386-2.3.0-pre.tar.gz |
@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err | |||
tracerx.Printf("api: batch %d files", len(bReq.Objects)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsxschneider I am unable to reproduce this issue locally:
~/D/repo (master!) $ GIT_TRACE=1 git lfs push --all origin master
# ...
trace git-lfs: tq: sending batch of size 100
trace git-lfs: api: batch 100 files (8 retries)
Are you sure that you don't have any extra lfs.transfer.maxretries
entries laying around anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I see trace git-lfs: api: batch 1 files (retries 10)
👍
@@ -40,6 +40,9 @@ func (m *Manifest) ConcurrentTransfers() int { | |||
} | |||
|
|||
func (m *Manifest) batchClient() *tqClient { | |||
if r := m.MaxRetries(); r > 0 { | |||
m.tqClient.MaxRetries = r | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious: why don't we do...
m.tqClient.MaxRetries = m.MaxRetries()
... here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry logic looks good to me. However, I haven't tested this in production, yet.
@ttaylorr @technoweenie Woohoo! I wasn't able to recreate the issue with this 🎉 👍 |
This pull request resolves an issue pointed out by @larsxschneider and @terrorobe in #2314, wherein a mysterious
io.EOF
error would appear after issuing a batch API request.An EOF can occur given the following conditions:
To address this, we treat
io.EOF
's as unexpected, and retry the request until successful, or a non-io.EOF
error is returned.I'd like to discuss a few other potential improvements:
net/http.Hijacker
, a type that will allow us access to the TCP connection underlying a givennet/http.(*Request)
instance, but this is not implemented bynet/http/httptest
. I see a few alternatives: we could implement a custom listener that resolves this by embedding thenet/tcp.(*Conn)
in the request's context. We could also write our own server that does implementhttp.Hijacker
.I would love recommendations on the above.
Closes: #2314.
/cc @git-lfs/core
/cc @technoweenie @larsxschneider @terrorobe for specific thoughts