-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Post Update Release Lock Email Issue #352
Conversation
Untested at this time.
Suggested fix so far:
|
This is untested still; just a theory on a potential solution to the problem. Now the reason I have trouble believing it’s the WAN connection is due to the check for “nvram get ntp_ready” which already exists. This is the reason I included a check for AMTM as well and increased the delays. Your opinion is appreciated. |
Upon further review; I believe I found the root cause. Which is crrently only found after the 3 minute wait; if it reboots itself after 1 or 2 minutes; then the release lock hasn't been done and the email won't send upon reboot. Since the lock file only goes stale after 10 minutes. |
Yes, that's correct. The "services-start" script could be executed before WAN connection is active, and I thought checking for the "ntp_ready" NVRAM var would be sufficient, but perhaps that there might be cases (outliers?) where that's not enough. |
I don't think the "wan-event" script would have been a good spot to hook our post-reboot script calls.
Yeah, that would help.
The AMTM email config file is in the JFFS partition filesystem which is mounted very early on before services can ever be started. If the JFFS filesystem is not mounted or is mounted as read-only, the user has bigger problems to deal with. |
Yes, that's what I thought too, but there migh be some edge cases where that check might not be sufficient. |
No, the Lock file cannot possibly be the root cause because the file is located in the temporary virtual disk (i.e. tmpfs is in RAM) where everything is wiped during a reboot so the Lock file gets deleted and not found after reboot. |
it happens pretty consistently after a reboot to what I can tell; I've only tested once but going back I can see that I haven't received the "successful" email for the last production release either. |
I'm not sure of a better location; it's either services-start (after everything has started) or a wan event (once WAN has started); or when AMTM is available (JFFS Available). As I mentioned we can't use post-mount for obvious reasons. If we want to move the hook call to "wan" we can specifically use wan-event and for AMTM; it would probably need to be "init-start?" The theory/benefit to moving it to one of those is that we would be sure those steps would be actioned (JFFS Loaded and WAN Connected) and we can also rule out services-start not being triggered due to a possible false error with a service at boot time. (AKA services-start is only triggered once ALL other services are started) which to me means if ddns failure happened (at boot) due to some firmware error, that it would stop the email from sending. (But we we don't rely on ddns service for example)
I see you included those changes in your PR :) Happy that stuck around! Although I noticed you didn't increase the sleep timers from 30 to 60. I figured it didn't hurt to add an additional static 30 seconds to the wait after a connection is ready. Lots of the time with AIMesh I notice the primary connection "flip-flops" while the nodes are reconnecting post reboot. As in, the primary is first to come up and connect to WAN, I can test connectivity from the primary or my desktop and confirm access; and then as the nodes come online and connect to the primary we have little "blips" of no network connectivity at all. (Even on the primary) until all the nodes are connected, this gets worse the more nodes you have. It's quick; both my nodes connect within seconds; but that's seconds of network "unavailable" for 2 nodes.
Fair enough; agreed; I just didn't want us to try and call those files and exit pre-maturely if they weren't available. |
I see this is the route you took in your PR :) I'll review it in a moment |
Re-tested and confirmed this morning. Actually overnight I was thinking about this, and /opt/etc/init.d/rc.unslung is called with a "start" it loads a bunch of Environment Variables and Paths for AMTM to function, is it possible that we need to validate those are ready/available? I've had a project before where post-reboot the environment variables didn't load and caused odd behavior which is why I thought of it. After a reboot, the environment in which the script runs may not have all the necessary environment variables, paths, or functions loaded right? We can also check the profile environment variables? [ -f /etc/profile ] && source /etc/profile |
That's not how the "services-start" script works. The script is called in a non-blocking mode after all built-in services have been called and returned, regardless of success or failure. Remember that users don't always have all built-in services enabled (e.g. FTP, Samba, NFS, AiCloud, UPnP Media Server, etc.) so when called they do not have to succeed.
Got you!! |
I'm not sure what you're referring to here. Our script does not have any Entware dependencies at all, whether Entware environment vars or paths. For email notifications, we need only a valid AMTM email configuration file - nothing else from AMTM.
The /etc/profile is used only for interactive shell sessions. When the script is executed following a post-reboot setup, it's launched in a non-interactive shell process. |
Right; but what does AMTM need to send the email? It likely relies/uses some of the environment variables; which we must wait until they are loaded for it to function; correct?
Exactly!!! When I run this line in the interactive console:
it sent. But when it runs early and non-interactively; it crashes and creates a weird file I mentioned in your PR. |
Genius, thank you for clarifying! |
Cough Cough ;)
|
AMTM does not send the email; our script does all the work. I don't want to repeat myself but:
Again, all we need is to read a text file where variables have been defined by AMTM - that's it. Our script does all the heavy lifting to send emails. We don't rely on or depend on any other third-party script or environment vars & paths. |
I think you're looking at a "red herring." One clue is that the Lock file "age" is ZERO seconds, which means it was just created right before the MerlinAU script shown in the log started to execute. The file "age" would be at least 30 seconds if the Lock file had been created by a previous execution that went through the F/W Update process and then the router rebooted. That takes time - much longer than ZERO seconds. Are you absolutely sure that there is not another instance of the script that was launched before the instance that got terminated? I would triple-check and make sure that the services-start script is not launching more than one instance of the script. I really think something else is going on and it's not related to the Lock file. |
I think you just solved this for us! |
Untested at this time.
Currently there is an issue where the email does not always send post-reboot after an upgrade.
It was initially reported by Tom (visortgw) and I was able to replicate it.
Essentially to me the issue in short so far seems to be the services-start can be triggered before an active internet connection; or before amtm is ready, which is required to send the email.