-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible race condition when reading retry file during housekeeping intervals #997
Comments
when we get a connection, the instance number is supposed to change... so the retry_file would be different. It's something to do with ensuring the instance number is properly propagated on startup. |
but hang on... retrying a download on an AM message doesn't make any sense. likely should turn off retry (remove the callback) for AM receiver.) They only reason to retry a download is because you lost the connection half-way... there is no recovering. |
Retry does work because we're downloading from the sarracenia in-line message contents that the AM sr3 plugin creates. I've already seen AM create a sarracenia message and then retry it for download. While true that the AM socket won't resend the file contents, the sarracenia message will save it and can append it to the retry queue if ever the download fails. |
I get that it "works" but it's not really a retry... I mean you are "downloading" what you already have in the message. I don't understand how a first "download" could fail, and a retry would succeed, oh... I guess disk full might do it. Do you know why the initial downloads failed, triggering the retry logic? |
When there's a retry with sr3's AM plugin, it's always caused by the same error. I've already noticed it in the past but ignored it as I thought retry is doing a good enough job to resend the files as is.
The error always happens at the beginning of the forked process when the first message is created and tried to publish downstream. I think it fails because the amqp connection is the same as the first instance and crashes as a result... not sure Once a new batch of messages appears, sarracenia does the right thing and re-establishes the connection downstream
|
Me and @petersilva did some work and were able to fix the AMQP error I was encountering by moving the retry |
This is pretty serious... I think it is a blocker for replacing the old MM systems. |
…rror handling of broken sockets
To clarify... the problem wasn't the when the child was being initialized, it was that all children "thought" they were instance 1, because they were inheriiting their initialization from instance 1 (the listener.) The ps command output showed they were instance 1, and when choosing to create instance files, it would be the wrong ones. The solution was to have the children to an os.execl() ... that is to re-launch the process with the correct instance number assigned, so the logs and pid files etc.. all get properly assigned. This eliminates race conditions because there is only one instance 1. The shutdown also didn't quite work... the socket.recv() is blocking. Added timeout processing and short timeouts so the code to exit works properly. |
PR #1005 solves the original race condition |
I saw over the weekend that the AM server crashed due to the following error.
What I was able to find in the logs is that the retry file will get opened during a flow algorithm passage by the process that will soon die.
And just before, another process will open the retry file for read also
What I'm thinking is that there is some kind a race condition happening where the process that dies accesses the retry file just after the first one, which causes the crash.
I think this is normally avoided with multiple instances by denominating a retry file for each instance #. But with AM server, we fork from the same process so they all use the same retry file in consequence.
FYI, this only crashed the instance affected and it was able recover one second after. However, we got unlucky and lost 7/8 bulletins due to that crash as there were lots of files being transferred on the socket at the time of the crash.
The text was updated successfully, but these errors were encountered: