tq: teach Batch() API to retry itself after io.EOF's #2516

ttaylorr · 2017-08-18T00:59:14Z

This pull request resolves an issue pointed out by @larsxschneider and @terrorobe in #2314, wherein a mysterious io.EOF error would appear after issuing a batch API request.

An EOF can occur given the following conditions:

LFS opens a HTTP connection to an LFS-compliant API server, with keep-alive.
The LFS server, or a proxy fronting connections to it times out the connection between batch requests, an io.EOF will occur.

To address this, we treat io.EOF's as unexpected, and retry the request until successful, or a non-io.EOF error is returned.

I'd like to discuss a few other potential improvements:

Adding an integration test for the io.EOF behavior: while researching this, I found net/http.Hijacker, a type that will allow us access to the TCP connection underlying a given net/http.(*Request) instance, but this is not implemented by net/http/httptest. I see a few alternatives: we could implement a custom listener that resolves this by embedding the net/tcp.(*Conn) in the request's context. We could also write our own server that does implement http.Hijacker.
Should there be a limit to the number of times we retry batch requests? If so, should this be configurable, or a constant?
Should we retry all requests, or just batch API ones?

I would love recommendations on the above.

Closes: #2314.

/cc @git-lfs/core
/cc @technoweenie @larsxschneider @terrorobe for specific thoughts

larsxschneider · 2017-08-18T07:26:28Z

Should there be a limit to the number of times we retry batch requests? If so, should this be configurable, or a constant?

I think we should limit the number of retries to lfs.transfer.maxretries. That's what I would expect as a user and it would simplify the configuration. I wouldn't perform an unlimited number of retires because this could make the application appear hanging forever.

Should we retry all requests, or just batch API ones?

I think all idempotent requests should be repeated. AFAIK pretty much all LFS requests are idempotent but I don't have enough knowledge about the Git LFS server API to know sure.

technoweenie · 2017-08-18T15:16:54Z

Agreed, use consistent rules to retry requests (such as lfs.transfer.maxretries), and retry any idempotent requests. I think the various transfer adapters use idempotent HTTP requests, but that may not necessarily always be the case.

larsxschneider · 2017-08-22T10:30:13Z

Unfortunately I cannot verify this fix as I always run into #2439 😢

ttaylorr · 2017-08-24T22:00:10Z

@larsxschneider @technoweenie I just updated this PR with some new changes, and I would love to get both of your feedback on the changes. I think that it is ready to merge. Here's what changed:

Request retrying moved to the lfsapi.(*Client) type, and works on all requests: NTLM, authenticated, with redirects, and etc. It is opt-in via:

c.Do(lfsapi.WithRetries(req, n))

[...] where the WithRetries helper annotates the request's Context() with a unique key whose corresponding value is the number of retries that should be performed on it in the case of a failure.
Taught the tq.(*Client) to retry the batch API requests the same number of times as lfs.transfer.maxretries.
More closely simulated an io.EOF error by running a net/http/httptest.(*Server) which teaches its http.(*Request) instances to implement net/http.Hijacker.

larsxschneider · 2017-08-28T22:13:52Z

FYI @pluehne is testing this right now 👍

larsxschneider · 2017-08-31T12:24:47Z

We merged this PR into master and ran tests on Linux and macOS. On both systems we saw the following problem:

trace git-lfs: fetch some/file/in/lfs.large [4a3b7cbd5c9d9c9317e50f0369e05a6f71e7971be332542bee990a2dcf892e86]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12fc220]

goroutine 66 [running]:
github.com/git-lfs/git-lfs/lfsapi.(*Client).doWithRedirects(0xc420138200, 0xc4204c28d0, 0xc42000ab00, 0x0, 0x0, 0x0, 0xc420490030, 0x14d731d, 0x10691a2)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/client.go:158 +0x200
github.com/git-lfs/git-lfs/lfsapi.(*Client).Do(0xc420138200, 0xc42000ab00, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/client.go:77 +0xfd
github.com/git-lfs/git-lfs/lfsapi.(*Client).doWithCreds(0xc420138200, 0xc42000ab00, 0x16ef180, 0xc420192450, 0x0, 0x0, 0x14d731d, 0x4, 0xc42000ab00, 0xc42039c370, ...)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/auth.go:74 +0x71
github.com/git-lfs/git-lfs/lfsapi.(*Client).DoWithAuth(0xc420138200, 0x14d7c30, 0x6, 0xc42000ab00, 0xc42000aa00, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/auth.go:44 +0x182
github.com/git-lfs/git-lfs/tq.(*tqClient).Batch(0xc420192580, 0x14d7c30, 0x6, 0xc4204bc500, 0x1, 0xc4201e7c20, 0x64)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/api.go:61 +0x375
github.com/git-lfs/git-lfs/tq.Batch(0xc4201a61e0, 0x1, 0x14d7c30, 0x6, 0xc420076a80, 0x64, 0x64, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/api.go:37 +0x15d
github.com/git-lfs/git-lfs/tq.(*TransferQueue).enqueueAndCollectRetriesFor(0xc420138300, 0xc4201f8000, 0x64, 0x64, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:426 +0x14a4
github.com/git-lfs/git-lfs/tq.(*TransferQueue).collectBatches.func1(0xc42016e460, 0xc420138300, 0xc4201f6000, 0xc42001c0c0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:348 +0x54
created by github.com/git-lfs/git-lfs/tq.(*TransferQueue).collectBatches
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:354 +0x36f

technoweenie · 2017-08-31T16:48:46Z

lfsapi/client.go

+		}
+
+		c.traceResponse(req, tracedReq, nil)
+	}


I think this is where the potentially nil res can come from. If it fails retries enough times, res is never set.

Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?

Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?

Good question -- there shouldn't be any interference as-is, since clients of the lfsapi package have to explicitly opt-in for more than 1 request per Do() call. Since tq manages its own retries, omitting WithRetries() is sufficient to keep behavior the same.

larsxschneider · 2017-09-03T20:52:57Z

lfsapi/client.go

+
+	var res *http.Response
+
+	for i := 0; i < retries; i++ {


retries might a bit misleading here. The first "retry" is the actual request, no? Therefore retries should always be greater 0, no?

Good point -- I like the name retries, but I think the current interpretation of it is a little bit off. Let's treat 'Retries' as the number of additional requests to make for a failed request/response cycle, and instead calculate retries (lowercase r) as:

retries := tools.MaxInt(0, Retries(req)) + 1

larsxschneider · 2017-09-03T21:10:48Z

tq/api.go

@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err
 	tracerx.Printf("api: batch %d files", len(bReq.Objects))

 	req = c.LogRequest(req, "lfs.batch")
-	res, err := c.DoWithAuth(remote, req)
+	res, err := c.DoWithAuth(remote, lfsapi.WithRetries(req, c.MaxRetries))


Something is wrong here. c.MaxRetries seems to be always 0.

@ttaylorr Any idea about this? c.MaxRetries is still 0 for me with the latest change although I have configured lfs.transfer.maxretries=10. Any idea why?

larsxschneider · 2017-09-07T13:53:13Z

@ttaylorr @technoweenie Any update here? Can I help with something?

ttaylorr · 2017-09-11T14:48:30Z

@larsxschneider I think I found the spot where r.MaxRetries wasn't being carried through in 7722c2f. It's unfortunate that there are a few different layers where r.MaxRetries means something, so we have to take extra care to make sure that that value gets propagated correctly.

Here are some builds that include 7722c2f for you to try:

git-lfs-darwin-386-2.3.0-pre.tar.gz
git-lfs-darwin-amd64-2.3.0-pre.tar.gz
git-lfs-freebsd-386-2.3.0-pre.tar.gz
git-lfs-freebsd-amd64-2.3.0-pre.tar.gz
git-lfs-linux-386-2.3.0-pre.tar.gz
git-lfs-linux-amd64-2.3.0-pre.tar.gz
git-lfs-windows-386-2.3.0-pre.zip
git-lfs-windows-amd64-2.3.0-pre.zip

larsxschneider · 2017-09-11T16:06:16Z

tq/api.go

@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err
 	tracerx.Printf("api: batch %d files", len(bReq.Objects))


@ttaylorr I've changed that line to:

tracerx.Printf("api: batch %d files (retries %d)", len(bReq.Objects), c.MaxRetries)

.... and I still get 0 retries. Even with 7722c2f !

@larsxschneider I am unable to reproduce this issue locally:

~/D/repo (master!) $ GIT_TRACE=1 git lfs push --all origin master # ... trace git-lfs: tq: sending batch of size 100 trace git-lfs: api: batch 100 files (8 retries)

Are you sure that you don't have any extra lfs.transfer.maxretries entries laying around anywhere?

You are right. I see trace git-lfs: api: batch 1 files (retries 10) 👍

larsxschneider · 2017-09-13T16:48:29Z

tq/manifest.go

@@ -40,6 +40,9 @@ func (m *Manifest) ConcurrentTransfers() int {
 }

 func (m *Manifest) batchClient() *tqClient {
+	if r := m.MaxRetries(); r > 0 {
+		m.tqClient.MaxRetries = r
+	}


I am curious: why don't we do...

m.tqClient.MaxRetries = m.MaxRetries()

... here?

larsxschneider

The retry logic looks good to me. However, I haven't tested this in production, yet.

larsxschneider · 2017-09-15T13:10:07Z

@ttaylorr @technoweenie Woohoo! I wasn't able to recreate the issue with this 🎉 👍

tq: teach Batch() API to retry itself after io.EOF's

4fc2dc5

ttaylorr added the review label Aug 18, 2017

ttaylorr added this to the v2.3.0 milestone Aug 18, 2017

ttaylorr requested review from technoweenie and larsxschneider August 18, 2017 00:59

ttaylorr added 10 commits August 24, 2017 17:22

lfsapi: teach 'newRequestForRetry' to copy request context

9b9fcca

lfsapi: introduce WithRetries()

1432894

lfsapi: introduce Retries

f60dc37

lfsapi: test {With,Get}Retries()

8fa1985

lfsapi: teach lfsapi.Client to retry requests

e15e149

tq: revert inline request retrying

b7bc0c7

tq: send MaxRetries to client

a3c65a0

tq: retry requests a given number of times

ad67ed6

Merge branch 'master' into batch-retry-eof

d78f32a

lfsapi: ensure that 'retries' is a positive integer

be610fe

technoweenie reviewed Aug 31, 2017

View reviewed changes

larsxschneider reviewed Sep 3, 2017

View reviewed changes

ttaylorr added 4 commits September 7, 2017 11:04

lfsapi: make 'retries' the number of additional requests

fb0260b

Merge branch 'master' into batch-retry-eof

99b11c7

tq/manifest: assign MaxRetries when it is non-zero

7722c2f

Merge branch 'master' into batch-retry-eof

bdf5911

larsxschneider reviewed Sep 11, 2017

View reviewed changes

larsxschneider reviewed Sep 13, 2017

View reviewed changes

larsxschneider approved these changes Sep 13, 2017

View reviewed changes

ttaylorr merged commit a0b01f8 into master Sep 13, 2017

ttaylorr deleted the batch-retry-eof branch September 13, 2017 21:27

ttaylorr mentioned this pull request May 29, 2018

403 Responses Can Cause the LFS Client to Loop Infinitely #993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tq: teach Batch() API to retry itself after io.EOF's #2516

tq: teach Batch() API to retry itself after io.EOF's #2516

ttaylorr commented Aug 18, 2017

larsxschneider commented Aug 18, 2017

technoweenie commented Aug 18, 2017

larsxschneider commented Aug 22, 2017

ttaylorr commented Aug 24, 2017

larsxschneider commented Aug 28, 2017

larsxschneider commented Aug 31, 2017

technoweenie Aug 31, 2017

ttaylorr Sep 7, 2017

larsxschneider Sep 3, 2017

ttaylorr Sep 7, 2017

larsxschneider Sep 3, 2017

larsxschneider Sep 11, 2017

larsxschneider commented Sep 7, 2017

ttaylorr commented Sep 11, 2017

larsxschneider Sep 11, 2017

ttaylorr Sep 13, 2017

larsxschneider Sep 13, 2017

larsxschneider Sep 13, 2017

larsxschneider left a comment

larsxschneider commented Sep 15, 2017

		@@ -57,7 +58,7 @@ func (c tqClient) Batch(remote string, bReq batchRequest) (*BatchResponse, err
		tracerx.Printf("api: batch %d files", len(bReq.Objects))

tq: teach Batch() API to retry itself after io.EOF's #2516

tq: teach Batch() API to retry itself after io.EOF's #2516

Conversation

ttaylorr commented Aug 18, 2017

larsxschneider commented Aug 18, 2017

technoweenie commented Aug 18, 2017

larsxschneider commented Aug 22, 2017

ttaylorr commented Aug 24, 2017

larsxschneider commented Aug 28, 2017

larsxschneider commented Aug 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsxschneider commented Sep 7, 2017

ttaylorr commented Sep 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsxschneider left a comment

Choose a reason for hiding this comment

larsxschneider commented Sep 15, 2017