Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluesky publish sometimes fails blob upload #1670

Closed
JoelOtter opened this issue Feb 21, 2024 · 26 comments · Fixed by snarfed/lexrpc#10
Closed

Bluesky publish sometimes fails blob upload #1670

JoelOtter opened this issue Feb 21, 2024 · 26 comments · Fixed by snarfed/lexrpc#10
Labels

Comments

@JoelOtter
Copy link
Contributor

Need to look into this properly but putting it here so it's noted.

D 2024-02-21 20:47:28.922451+00:00 [com.atproto.repo.uploadBlob](http://com.atproto.repo.uploadblob/): {} <[urllib3.response.HTTPResponse](http://urllib3.response.httpresponse/) object at 0x3e169542add0>
D 2024-02-21 20:47:28.923977+00:00 Running requests.<function post at 0x3e16aa2cafc0> https://bsky.social/xrpc/com.atproto.repo.uploadBlob <[urllib3.response.HTTPResponse](http://urllib3.response.httpresponse/) object at 0x3e169542add0>  {'User-Agent': 'Bridgy (https://brid.gy/about)', 'Content-Type': 'image/jpeg', 'Authorization': '...'}
D 2024-02-21 20:47:29.168423+00:00 Got: {"error":"InvalidRequest","message":"A request body is expected but none was provided"}
W 2024-02-21 20:47:29.168739+00:00 Error 400, response body: '{"error":"InvalidRequest","message":"A request body is expected but none was provided"}'
I 2024-02-21 20:47:29.236916+00:00 Returning 400: {'error': 'Error: {"error":"InvalidRequest","message":"A request body is expected but none was provided"} 400 Client Error: Bad Request for url: https://bsky.social/xrpc/com.atproto.repo.uploadBlob'} {'response': <Response 212 bytes [200 OK]>}
E 2024-02-21 20:47:29.237247+00:00 <class '[werkzeug.exceptions.BadRequest](http://werkzeug.exceptions.badrequest/)'>: 400 Bad Request: {'error': 'Error: {"error":"InvalidRequest","message":"A request body is expected but none was provided"} 400 Client Error: Bad Request for url: https://bsky.social/xrpc/com.atproto.repo.uploadBlob'}

Getting A request body is expected but none was provided back from Bluesky which is pretty odd. This seems quite random, a retry will pretty much always fix it.

@snarfed
Copy link
Owner

snarfed commented Feb 21, 2024

Ugh, transient failures are the worst. Thanks for reporting and starting the sleuthing!

@JoelOtter
Copy link
Contributor Author

So the reason a retry always fixes it is because this only happens when it has to re-fetch the token using the app password. Haven't found the root-cause in the code but hopefully it won't be too hard to track down.

@JoelOtter
Copy link
Contributor Author

Hmm. So looking at these logs and reading I'm wondering if there's actually a race condition between doing the reauth (as it's a callback...?) and trying to do the upload. Though, it does appear like there might be a retry happening too, I can't find where that happens in the code. Any pointers? My Python's got rusty again 😆

@snarfed
Copy link
Owner

snarfed commented Apr 25, 2024

Hmm! Thanks for the sleuthing. There shouldn't be any concurrency here, each HTTP request is single threaded and uses its own lexrpc client. You're right about the automatic refresh: when a call fails auth, the lexrpc client refreshes the token, then reruns it. There's probably a bug with that and calls like uploadBlob with binary payloads.

@JoelOtter
Copy link
Contributor Author

Yeah so after some digging I think the issue lies within lexrpc, or at least the fact we're passing the binary blob as a stream. I don't know exactly how these streams are implemented in Bridgy (or Python, honestly) but if they're anything like how they work in other languages, the stream is being fully consumed on the first request, and then when it makes the retry it's doing so with an empty stream.

A quick fix here might be to change it so we don't use a streaming request when fetching the image's binary data. This would mean keeping the whole image in memory, which we may not want to do, though then again perhaps we're doing that anyway? I'm unsure.

@snarfed
Copy link
Owner

snarfed commented Apr 29, 2024

Yeah so after some digging I think the issue lies within lexrpc, or at least the fact we're passing the binary blob as a stream...the stream is being fully consumed on the first request, and then when it makes the retry it's doing so with an empty stream.

That sounds exactly right. If so, the fix should be straightforward. I'll look.

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

I'm trying to reproduce this in snarfed/granary#721, but I haven't managed to quite yet. That test passes. 😕

snarfed added a commit to snarfed/lexrpc that referenced this issue Apr 30, 2024
@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

If your hunch is right, I think snarfed/lexrpc@407734c might fix it, but I'm still struggling to come up with a test case that reproduces the failure, in either granary or lexrpc.

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

Aha, figured it out, I needed to actually read from the input stream in the test. Did that in snarfed/granary@4cd86e6 and now the test is failing as expected.

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

OK @JoelOtter I've deployed the lexrpc fix, feel free to retry. 🤞 And thanks again for the sleuthing!

@JoelOtter
Copy link
Contributor Author

JoelOtter commented Apr 30, 2024

Unfortunately I think this has broken normal upload: https://brid.gy/log?module=default&start_time=1714512353&key=agdicmlkLWd5cloLEg1QdWJsaXNoZWRQYWdlIjNodHRwczovL3d3dy5qb2Vsb3R0ZXIuY29tL25vdGVzLzIwMjQvMDQvMzAtZnJpZW5kcy8MCxIHUHVibGlzaBiAgISU0rqRCQw

Retry doesn't work - I can observe this behaviour using the preview form too!

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

Ugh, I should have tested myself. Looking.

@snarfed snarfed reopened this Apr 30, 2024
@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

For now, I've rolled back to the last version.

@JoelOtter
Copy link
Contributor Author

Yep, working as before :)

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

Weird though, I don't get the failure. It complains that something isn't seekable, but the only place we added a seek is after an auth failure and refreshSession call, which didn't happen here...?

https://github.com/snarfed/lexrpc/blob/d254aa0c263aab13c66c5d8c243a765cacd0f087/lexrpc/client.py#L216

@JoelOtter
Copy link
Contributor Author

Huh, yeah, that is weird.

@JoelOtter
Copy link
Contributor Author

Does the underlying implementation of the streaming query use seek and that doesn't work with BufferedRandom (which is always used regardless here)?

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

Ah, no, BufferedRandom itself checks that the stream you give it is seekable itself. Argh, that defeats the purpose of the fix. OK so we need a different way to buffer it and reset the stream.

https://github.com/python/cpython/blob/7fabcc727dee52a3e0dfe4f903ad414e93cf2dc9/Lib/_pyio.py#L1402

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

🤷

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2024

Maybe we just cut out the fancy streaming and always read it into memory?

snarfed added a commit to snarfed/lexrpc that referenced this issue May 1, 2024
… the input stream"

This reverts commit 395555a.

this turned out to be broken. will fix. cc snarfed/bridgy#1670
@JoelOtter
Copy link
Contributor Author

I think if you’re buffering the stream and resetting it that is by necessity reading the whole thing into memory anyway right?

The only alternative to downloading the whole thing - which I would prefer not to do as it is a potential abuse vector - would be to refactor the lexrpc code such that it takes a function (or perhaps some special Python thing I’m not aware of similar to ‘__getattr’) which would allow this to effectively re-request the stream. That or we invert it such that the retry is user driven but that’s a big ol refactor and loses some neat functionality.

@snarfed
Copy link
Owner

snarfed commented May 1, 2024

I think if you’re buffering the stream and resetting it that is by necessity reading the whole thing into memory anyway right?

When we have to refresh a token, yes, but not otherwise. I don't have a good sense of how often a publish will need to refresh the token, and therefore buffer the whole file in memory.

Fortunately if images average 1-5MB, that's only ~1-2% of our memory budget, and Bluesky publishes with images are rare, and very obscure and unlikely as an abuse vector, so I'm not worried. Buffering in memory is fine, I'll just do that.

@JoelOtter
Copy link
Contributor Author

Fair enough. :) I'd still suggest maybe setting some sensible limit like 25MB based on the Content-Length when it fetches the image from the source but before it streams in the rest of the chunks - it's trivial to generate a many-GB PNG, and this recent AWS stuff has me on edge, and I'd feel awful if you got hit with an enormous bill!

snarfed added a commit to snarfed/lexrpc that referenced this issue May 2, 2024
@snarfed
Copy link
Owner

snarfed commented May 2, 2024

One difficulty with using Content-Length is that it sometimes doesn't exist, eg Flickr doesn't serve it at all. No joke. 😠 #944

Fortunately the failure mode here in the too-big case is just that an instance runs out of memory, drains its in progress requests, and then restarts. Not a big deal.

@snarfed
Copy link
Owner

snarfed commented May 2, 2024

OK I've deployed the fix that just always reads into memory. Once mor time with feeling, feel free to try again!

@JoelOtter
Copy link
Contributor Author

Not seen it fail since that change so I think we can call this resolved. :) Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants