-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluesky publish sometimes fails blob upload #1670
Comments
Ugh, transient failures are the worst. Thanks for reporting and starting the sleuthing! |
So the reason a retry always fixes it is because this only happens when it has to re-fetch the token using the app password. Haven't found the root-cause in the code but hopefully it won't be too hard to track down. |
Hmm. So looking at these logs and reading I'm wondering if there's actually a race condition between doing the reauth (as it's a callback...?) and trying to do the upload. Though, it does appear like there might be a retry happening too, I can't find where that happens in the code. Any pointers? My Python's got rusty again 😆 |
Hmm! Thanks for the sleuthing. There shouldn't be any concurrency here, each HTTP request is single threaded and uses its own lexrpc client. You're right about the automatic refresh: when a call fails auth, the lexrpc client refreshes the token, then reruns it. There's probably a bug with that and calls like |
Yeah so after some digging I think the issue lies within lexrpc, or at least the fact we're passing the binary blob as a stream. I don't know exactly how these streams are implemented in Bridgy (or Python, honestly) but if they're anything like how they work in other languages, the stream is being fully consumed on the first request, and then when it makes the retry it's doing so with an empty stream. A quick fix here might be to change it so we don't use a streaming request when fetching the image's binary data. This would mean keeping the whole image in memory, which we may not want to do, though then again perhaps we're doing that anyway? I'm unsure. |
That sounds exactly right. If so, the fix should be straightforward. I'll look. |
I'm trying to reproduce this in snarfed/granary#721, but I haven't managed to quite yet. That test passes. 😕 |
…reset the input stream for snarfed/bridgy#1670 maybe?
If your hunch is right, I think snarfed/lexrpc@407734c might fix it, but I'm still struggling to come up with a test case that reproduces the failure, in either granary or lexrpc. |
Aha, figured it out, I needed to actually read from the input stream in the test. Did that in snarfed/granary@4cd86e6 and now the test is failing as expected. |
…ut stream for snarfed/bridgy#1670 maybe?
OK @JoelOtter I've deployed the lexrpc fix, feel free to retry. 🤞 And thanks again for the sleuthing! |
Unfortunately I think this has broken normal upload: https://brid.gy/log?module=default&start_time=1714512353&key=agdicmlkLWd5cloLEg1QdWJsaXNoZWRQYWdlIjNodHRwczovL3d3dy5qb2Vsb3R0ZXIuY29tL25vdGVzLzIwMjQvMDQvMzAtZnJpZW5kcy8MCxIHUHVibGlzaBiAgISU0rqRCQw Retry doesn't work - I can observe this behaviour using the preview form too! |
Ugh, I should have tested myself. Looking. |
For now, I've rolled back to the last version. |
Yep, working as before :) |
Weird though, I don't get the failure. It complains that something isn't seekable, but the only place we added a seek is after an auth failure and |
Huh, yeah, that is weird. |
Does the underlying implementation of the streaming query use seek and that doesn't work with BufferedRandom (which is always used regardless here)? |
Ah, no, BufferedRandom itself checks that the stream you give it is seekable itself. Argh, that defeats the purpose of the fix. OK so we need a different way to buffer it and reset the stream. |
🤷 |
Maybe we just cut out the fancy streaming and always read it into memory? |
… the input stream" This reverts commit 395555a. this turned out to be broken. will fix. cc snarfed/bridgy#1670
I think if you’re buffering the stream and resetting it that is by necessity reading the whole thing into memory anyway right? The only alternative to downloading the whole thing - which I would prefer not to do as it is a potential abuse vector - would be to refactor the lexrpc code such that it takes a function (or perhaps some special Python thing I’m not aware of similar to ‘__getattr’) which would allow this to effectively re-request the stream. That or we invert it such that the retry is user driven but that’s a big ol refactor and loses some neat functionality. |
When we have to refresh a token, yes, but not otherwise. I don't have a good sense of how often a publish will need to refresh the token, and therefore buffer the whole file in memory. Fortunately if images average 1-5MB, that's only ~1-2% of our memory budget, and Bluesky publishes with images are rare, and very obscure and unlikely as an abuse vector, so I'm not worried. Buffering in memory is fine, I'll just do that. |
Fair enough. :) I'd still suggest maybe setting some sensible limit like 25MB based on the Content-Length when it fetches the image from the source but before it streams in the rest of the chunks - it's trivial to generate a many-GB PNG, and this recent AWS stuff has me on edge, and I'd feel awful if you got hit with an enormous bill! |
attempt 2 at fixing snarfed/bridgy#1670
One difficulty with using Fortunately the failure mode here in the too-big case is just that an instance runs out of memory, drains its in progress requests, and then restarts. Not a big deal. |
OK I've deployed the fix that just always reads into memory. Once mor time with feeling, feel free to try again! |
Not seen it fail since that change so I think we can call this resolved. :) Thank you! |
Need to look into this properly but putting it here so it's noted.
Getting
A request body is expected but none was provided
back from Bluesky which is pretty odd. This seems quite random, a retry will pretty much always fix it.The text was updated successfully, but these errors were encountered: