Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tq: teach Batch() API to retry itself after io.EOF's #2516

Merged
merged 15 commits into from
Sep 13, 2017
Merged

Conversation

ttaylorr
Copy link
Contributor

This pull request resolves an issue pointed out by @larsxschneider and @terrorobe in #2314, wherein a mysterious io.EOF error would appear after issuing a batch API request.

An EOF can occur given the following conditions:

  1. LFS opens a HTTP connection to an LFS-compliant API server, with keep-alive.
  2. The LFS server, or a proxy fronting connections to it times out the connection between batch requests, an io.EOF will occur.

To address this, we treat io.EOF's as unexpected, and retry the request until successful, or a non-io.EOF error is returned.

I'd like to discuss a few other potential improvements:

  • Adding an integration test for the io.EOF behavior: while researching this, I found net/http.Hijacker, a type that will allow us access to the TCP connection underlying a given net/http.(*Request) instance, but this is not implemented by net/http/httptest. I see a few alternatives: we could implement a custom listener that resolves this by embedding the net/tcp.(*Conn) in the request's context. We could also write our own server that does implement http.Hijacker.
  • Should there be a limit to the number of times we retry batch requests? If so, should this be configurable, or a constant?
  • Should we retry all requests, or just batch API ones?

I would love recommendations on the above.

Closes: #2314.


/cc @git-lfs/core
/cc @technoweenie @larsxschneider @terrorobe for specific thoughts

@ttaylorr ttaylorr added this to the v2.3.0 milestone Aug 18, 2017
@larsxschneider
Copy link
Member

Should there be a limit to the number of times we retry batch requests? If so, should this be configurable, or a constant?

I think we should limit the number of retries to lfs.transfer.maxretries. That's what I would expect as a user and it would simplify the configuration. I wouldn't perform an unlimited number of retires because this could make the application appear hanging forever.

Should we retry all requests, or just batch API ones?

I think all idempotent requests should be repeated. AFAIK pretty much all LFS requests are idempotent but I don't have enough knowledge about the Git LFS server API to know sure.

@technoweenie
Copy link
Contributor

Agreed, use consistent rules to retry requests (such as lfs.transfer.maxretries), and retry any idempotent requests. I think the various transfer adapters use idempotent HTTP requests, but that may not necessarily always be the case.

@larsxschneider
Copy link
Member

Unfortunately I cannot verify this fix as I always run into #2439 😢

@ttaylorr
Copy link
Contributor Author

@larsxschneider @technoweenie I just updated this PR with some new changes, and I would love to get both of your feedback on the changes. I think that it is ready to merge. Here's what changed:

  1. Request retrying moved to the lfsapi.(*Client) type, and works on all requests: NTLM, authenticated, with redirects, and etc. It is opt-in via:
c.Do(lfsapi.WithRetries(req, n))
  1. [...] where the WithRetries helper annotates the request's Context() with a unique key whose corresponding value is the number of retries that should be performed on it in the case of a failure.
  2. Taught the tq.(*Client) to retry the batch API requests the same number of times as lfs.transfer.maxretries.
  3. More closely simulated an io.EOF error by running a net/http/httptest.(*Server) which teaches its http.(*Request) instances to implement net/http.Hijacker.

@larsxschneider
Copy link
Member

FYI @pluehne is testing this right now 👍

@larsxschneider
Copy link
Member

We merged this PR into master and ran tests on Linux and macOS. On both systems we saw the following problem:

trace git-lfs: fetch some/file/in/lfs.large [4a3b7cbd5c9d9c9317e50f0369e05a6f71e7971be332542bee990a2dcf892e86]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12fc220]

goroutine 66 [running]:
github.com/git-lfs/git-lfs/lfsapi.(*Client).doWithRedirects(0xc420138200, 0xc4204c28d0, 0xc42000ab00, 0x0, 0x0, 0x0, 0xc420490030, 0x14d731d, 0x10691a2)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/client.go:158 +0x200
github.com/git-lfs/git-lfs/lfsapi.(*Client).Do(0xc420138200, 0xc42000ab00, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/client.go:77 +0xfd
github.com/git-lfs/git-lfs/lfsapi.(*Client).doWithCreds(0xc420138200, 0xc42000ab00, 0x16ef180, 0xc420192450, 0x0, 0x0, 0x14d731d, 0x4, 0xc42000ab00, 0xc42039c370, ...)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/auth.go:74 +0x71
github.com/git-lfs/git-lfs/lfsapi.(*Client).DoWithAuth(0xc420138200, 0x14d7c30, 0x6, 0xc42000ab00, 0xc42000aa00, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/lfsapi/auth.go:44 +0x182
github.com/git-lfs/git-lfs/tq.(*tqClient).Batch(0xc420192580, 0x14d7c30, 0x6, 0xc4204bc500, 0x1, 0xc4201e7c20, 0x64)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/api.go:61 +0x375
github.com/git-lfs/git-lfs/tq.Batch(0xc4201a61e0, 0x1, 0x14d7c30, 0x6, 0xc420076a80, 0x64, 0x64, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/api.go:37 +0x15d
github.com/git-lfs/git-lfs/tq.(*TransferQueue).enqueueAndCollectRetriesFor(0xc420138300, 0xc4201f8000, 0x64, 0x64, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:426 +0x14a4
github.com/git-lfs/git-lfs/tq.(*TransferQueue).collectBatches.func1(0xc42016e460, 0xc420138300, 0xc4201f6000, 0xc42001c0c0)
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:348 +0x54
created by github.com/git-lfs/git-lfs/tq.(*TransferQueue).collectBatches
	/Users/lars/Code/go-Workspace/src/github.com/git-lfs/git-lfs/tq/transfer_queue.go:354 +0x36f

}

c.traceResponse(req, tracedReq, nil)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where the potentially nil res can come from. If it fails retries enough times, res is never set.

Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right place to put the retry logic though? How would it interfere with the transfer adapter retries?

Good question -- there shouldn't be any interference as-is, since clients of the lfsapi package have to explicitly opt-in for more than 1 request per Do() call. Since tq manages its own retries, omitting WithRetries() is sufficient to keep behavior the same.

lfsapi/client.go Outdated

var res *http.Response

for i := 0; i < retries; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries might a bit misleading here. The first "retry" is the actual request, no? Therefore retries should always be greater 0, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- I like the name retries, but I think the current interpretation of it is a little bit off. Let's treat 'Retries' as the number of additional requests to make for a failed request/response cycle, and instead calculate retries (lowercase r) as:

retries := tools.MaxInt(0, Retries(req)) + 1

@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err
tracerx.Printf("api: batch %d files", len(bReq.Objects))

req = c.LogRequest(req, "lfs.batch")
res, err := c.DoWithAuth(remote, req)
res, err := c.DoWithAuth(remote, lfsapi.WithRetries(req, c.MaxRetries))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong here. c.MaxRetries seems to be always 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttaylorr Any idea about this? c.MaxRetries is still 0 for me with the latest change although I have configured lfs.transfer.maxretries=10. Any idea why?

@larsxschneider
Copy link
Member

@ttaylorr @technoweenie Any update here? Can I help with something?

@ttaylorr
Copy link
Contributor Author

@larsxschneider I think I found the spot where r.MaxRetries wasn't being carried through in 7722c2f. It's unfortunate that there are a few different layers where r.MaxRetries means something, so we have to take extra care to make sure that that value gets propagated correctly.

Here are some builds that include 7722c2f for you to try:

git-lfs-darwin-386-2.3.0-pre.tar.gz
git-lfs-darwin-amd64-2.3.0-pre.tar.gz
git-lfs-freebsd-386-2.3.0-pre.tar.gz
git-lfs-freebsd-amd64-2.3.0-pre.tar.gz
git-lfs-linux-386-2.3.0-pre.tar.gz
git-lfs-linux-amd64-2.3.0-pre.tar.gz
git-lfs-windows-386-2.3.0-pre.zip
git-lfs-windows-amd64-2.3.0-pre.zip

@@ -57,7 +58,7 @@ func (c *tqClient) Batch(remote string, bReq *batchRequest) (*BatchResponse, err
tracerx.Printf("api: batch %d files", len(bReq.Objects))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttaylorr I've changed that line to:

 tracerx.Printf("api: batch %d files (retries %d)", len(bReq.Objects), c.MaxRetries)

.... and I still get 0 retries. Even with 7722c2f !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsxschneider I am unable to reproduce this issue locally:

~/D/repo (master!) $ GIT_TRACE=1 git lfs push --all origin master
# ...
trace git-lfs: tq: sending batch of size 100
trace git-lfs: api: batch 100 files (8 retries)

Are you sure that you don't have any extra lfs.transfer.maxretries entries laying around anywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I see trace git-lfs: api: batch 1 files (retries 10) 👍

@@ -40,6 +40,9 @@ func (m *Manifest) ConcurrentTransfers() int {
}

func (m *Manifest) batchClient() *tqClient {
if r := m.MaxRetries(); r > 0 {
m.tqClient.MaxRetries = r
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious: why don't we do...

m.tqClient.MaxRetries = m.MaxRetries()

... here?

Copy link
Member

@larsxschneider larsxschneider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic looks good to me. However, I haven't tested this in production, yet.

@ttaylorr ttaylorr merged commit a0b01f8 into master Sep 13, 2017
@ttaylorr ttaylorr deleted the batch-retry-eof branch September 13, 2017 21:27
@larsxschneider
Copy link
Member

@ttaylorr @technoweenie Woohoo! I wasn't able to recreate the issue with this 🎉 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

batch response: Post .../objects/batch: EOF
3 participants