-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to publish package (500 Internal Server Error) #95
Comments
@Diaoul, @ymartin59, any thoughts on this? EDIT: So apparently, the actual error in the application log was this:
Based on the above the actual cause of the (500) Internal Server Error was an unhandled exception for |
This leaves the question of the original issue of the uploaded file from the stream not being present at the expected path during publishing. From examining the server code my understanding of the publishing process is as follows:
Now the error seen: |
I suspect that there may have been some sort of corruption when the first package from lidarr v11 was uploaded. The admin interface reports the following: However, when we look at the webpage it shows the following: This version should not be visible since it is disabled in the admin interface. The application icon in the webpage was also not visible (before v12 was uploaded), which suggests that the code which runs when the new Version object is present was not initially run. |
I was publishing some other packages and I see the same corruption happening with prowlarr v6. The admin interface reports the following: However, the webpage at Note the missing icon as I indicated above with lidarr v11 when no newer version is present. I believe this happens because this code is not being executed: Lines 260 to 263 in 76951e0
|
Yes there must've been a failure at a time that did create the version but not the appropriate directories, causing subsequent ones to fail. |
Thanks for your feedback. Before deleting the version can you check:
If the checks above confirms the theory then I’d request that the prowlarr v6 be deleted. I don’t think I have permissions to do that. |
hey @Diaoul, any feedback on the above? In the meantime, would you be so kind as to delete this version: |
@mreid-tt I have deleted lidarr v11 for you, please let me know of any others that you like to delete |
thanks @publicarray, I was hoping we could check the application logs and the file system for prowlarr to confirm the theory above. But if this is a hassle we can just delete the problematic version: |
@mreid-tt the folder was never created...
|
I have deleted the broken version |
We had problems uploading before (due to load) but never like this (missing folders/files)... |
Thanks for the confirmation.
Great! I'll proceed with a fresh build and upload. |
Hmm, okay. Perhaps a possible fix would be to adjust the code to verify that a folder is created (for both the package and version functions) before committing a new entry to the database. |
my2 cents but maybe only upload one file at a time. IIRC phython/postgresql are bad at working with multiple upload requests at the same time. This smells like a race condition |
That makes sense but with a GitHub build how would we do that? All the archs build concurrently. |
hm maybe build them without publishing on github and download the result. Then manually create the upload request with httpie. I know it's a workaround but we can also test if, for example you upload all packages at once if this causes this problem again. |
I think you may be correct. I did the new run and got this in one of the logs:
But then right after, I get the following in all the other archs:
We may have the same situation again. If you can, please remove prowlarr v6 again because I suspect we have re-created the same situation. I will try a manual upload one file at a time once you have.
Okay, thanks. I'll retry, one at a time now. |
we got one :)
|
Sending one at a time worked! All archs successful: https://github.com/mreid-tt/spksrc/actions/runs/4819585180 EDIT: I'll look into making a PR to verify that the folder has been created for both package and version. Running parallel processes does not seem to be an issue once the first one correctly creates the folder. |
Hi @Diaoul, now that we have conclusively proven that the issue is rooted in the version folder not being created, this suggests that this code block is not executed: Lines 260 to 263 in 66c4ac6
If this is the case, how could we make the function more resilient? Should the
Why might this be the case? Are the filesystem actions for directory creation and Interested in your thoughts before I create a PR so that the solution is effective. |
My guess is that if a version exist in the db, but the transaction isn't finished yet and another one is created at the same time we get this condition. Not sure why not at least one mkdir command is executed though. Maybe both transactions exit early because a version exists already? |
I also noted that if there is an exception, files are removed so what happens when a file is created and at the same time removed by a different request? |
Ahh! That's a good point, you are referring to this part of the code: Lines 265 to 272 in 66c4ac6
So, it's possible it was created just fine but there was some sort of exception (perhaps even in creating the icons or saving the build), then the entire version folder would be dumped. If this is the case then the version should also be removed from the database (or not added if this is done later). Need to better understand the database interactions to confirm. |
Yea. Imagine 2 requests. Both don't see a version so they both try to create one on the BD. The db will reject one as it already exist (consistency in the db). So one succeeds and one fails (creating an exception) removing the folder but the DB is updated to have a version. Just a theory. Filtering the exception for 'version already exist' and not removing files should prevent this. |
So I shared our problem with ChatGPT and it suggested some approaches which I have included in a proposed fix #100. I would appreciate a review in that PR. |
Thinking a bit about this, I am not sure why we need a clean up the directory structure 🤷 We could very well make sure the path always exists. |
I think I agree with @Diaoul here, using locks just overcomplicates things. I think just checking if there is an exception like "version already exists" and continuing the code anyway rather than cleaning up the folders is enough in this case. I'd like to mention that what I think is happening is a hypothesis and not proven yet. There could have been an issue with the redis or database connection. Unfortunately, there aren't time codes in the app logs. (unlikely the cause since the version exists in the database)
edit: never mind, I can get the timestamp from docker. |
Thanks for the additional perspectives. Taking a step back, I believe the issue is really about the filesystem and not the database. Re-looking at the code we have these operations to create folders: Line 259 in 66c4ac6
My understanding is that if the folder already exists, it will raise a |
I actually like the elegance of cleaning up the directory structure on an exception so you maintain a clean filesystem. We could adjust the code to something like this:
But I really believe that the exception on the try block is more likely caused by the folder already existing which can easily be mitigated with the existing code structure. |
@publicarray, looking back at the application logs you kindly provided, I believe this new theory is correct. In the logs we first see this error:
Immediately after that we have these errors and similar:
I believe this supports the theory that:
|
I am not sure why we would end up publishing multiple packages at the same time, maybe we can fix that instead? The code is full of race conditions and absolutely not ready to handle parallel uploads of the same packages. Also worth noting that it seems it's the first time it occurs in close to 8 years (I've lost count) of this code being in production. Unless we're confident the code will not break elsewhere when introducing a fix, I would be OK with a one-time procedure to manually fix this and document it somewhere 😬 |
From my understanding, I believe we have been doing this for over two years based on the age of the wiki page... Publish a package from Github. To my knowledge there isn't a way to tell the GitHub workflow to not run in parallel. As such, once we use it, there is the potential to be sending multiple builds.
I understand the hesitation in making such a fundamental change but I'd like to share another perspective:
|
Thanks for the additional context, it makes sense to try with this then 👍 |
For some history. The first Ci used travis and CircleCI processed one arch at a time. Also it was expected that every dev build on their own devices before publishing so that they would test the build locally as well (that hasn't changed) but with GitHub actions it became easier and we implemented the publishing part as well. So if you build locally and test the one or two architectures you don't need to build all of them to publish. Lowering the barrier and time spend building a bit. |
When attempting to publish a package the following is reported:
This appears to be happening consistently for moderately sized SPK files (50-75 MB) and does not appear to be based on time of day (server load).
The text was updated successfully, but these errors were encountered: