-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various Small Fixes #355
Various Small Fixes #355
Conversation
I still think this is a "red herring" and your test results with the new code changes point to a different problem unrelated to the Lock file. |
Testing latest commits now :) |
Good job, bud. So the success is repeatable (i.e. after every reboot now)? Talk to you in the evening. Have a good night, bud. |
Latest result without me mis-identifying the firmware:
|
I'll take a closer look & review in the evening. |
Goodnight buddy! |
Apologies for the delay. It's been a very busy day... I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second. Would you mind explaining how you see this PR solving that particular problem? |
No worries buddy! I totally understand. Basically you got it, you pointed me in the right direction as soon as you said "you sure there isn't 2 executed?" The services-start script had 2 calls of MerlinAU inside, the first which is the regular addcronjob call inside services-start when the router reboots, etc normally to add the cron jobs. And the second call when we actually do an upgrade which is added to handle the post reboot email, etc. That second call was the one giving us the lock error, and failing to send the email, but in services-start the very first call (in order) was the addcronjob call. So while I couldn't see anything in the syslogs indicating that call existed. And I couldn't connect via SSH when that call was run. (Too early in the boot process to connect) I took a guess that the first call for addcronjob was running first and creating the lock file, and then milliseconds later services-start would call the next MerlinAU background job to do the post upgrade email and it would encounter the lock file. It became a race condition between the 2 calls since they are both backgrounds jobs. The random times it worked would be just because the second call sending the email before the first call created the lock file, but if the first call created the lock file first then the second call wouldn't send the email! This explains why when we added more code for WAN check, and more delay and debugging to the second call to troubleshoot, it actually slowed it down more compared to the first, so it would happen more consistently and caused the problem to get "worse" instead of being 50/50 or 60/40. The extra code was causing it to "lose the race". The solution tested 4 times now, is just remove the addcronjob hook from services-start when we do an upgrade, and add it back when we are done in the post upgrade hook. I must say this problem is the first time in encountered something like this before, so I really appreciated having you run through the logic with. |
OK, thanks for taking the time to explain & clarify. So yeah, I see the possibility of the "race condition" when more than one call to MerlinAU is added to the "services-start" script. That must be dealt with to avoid early termination. However, upon further review of the code, I see 2 issues with the current solution:
Solutions can be found for the above issues, but I think I'd like to modify the File Lock function and add code to handle the execution of 2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern, separated by a few seconds. This would work even for other future calls we may add in the "services-start" script, so it would be a more "general" solution. |
I actually though about point 2, but didn't think about point 1. I was going to add some conditional to address point 2. But I like your idea more :) |
I believe this is also why when I added a sleep of 120 seconds, it worked. I was manually creating a staggered pattern :P |
Added a few more small fixes to review. |
OK, I'm back online. My apologies for the delay. Yesterday, we went to my brother's home to celebrate one of my nieces birthday, and we came back rather late - it was past midnight already - and I just crashed on the bed and went to sleep right away. Anyway, I'll start to review the PR now. Man, a lot of messages to go through, LOL ;>) |
No worries buddy, I'm around. No rush! Ignore most of my blabber on the one we were troubleshooting it ended up being the TLDR is the new WAN function was causing the issue on the node. Been tested multiple times with success, we know the issue was the multiple sessions causing the lock file to get screwey so I just removed the new function and all worked perfectly!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved and good to go for production!!
@ExtremeFiretop, |
Question, was there anything else you wanted to touch up or add? |
Gotcha! I'm heading out for 30 minutes anyways, I'll be back to chat in a few! |
Yep, working on it... |
Re-creation of PR: #352
Testing now.