Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Small Fixes #355

Merged
merged 33 commits into from
Nov 18, 2024
Merged

Various Small Fixes #355

merged 33 commits into from
Nov 18, 2024

Conversation

ExtremeFiretop
Copy link
Owner

@ExtremeFiretop ExtremeFiretop commented Nov 14, 2024

Re-creation of PR: #352

Testing now.

@ExtremeFiretop ExtremeFiretop added the bug Something isn't working label Nov 14, 2024
@Martinski4GitHub
Copy link
Collaborator

Still the same issue?

Nov 14 02:18:13 scMerlin: Waiting for NTP to sync...
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/jffs/configs/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: *WARNING*: Temporary Backup directory [/jffs/configs/SavedUserIcons]
Nov 14 02:18:13 kernel: link up LAN1
Nov 14 02:18:13 [MerlinAU.sh] 5110: **ERROR**: The shell script 'MerlinAU.sh' is already running [Lock file: 0 secs.]
Nov 14 02:18:13 [MerlinAU.sh] 5110: Exiting...

is it possible we are having issues when initialization causing the AcquireLock function to fail somehow?

I still think this is a "red herring" and your test results with the new code changes point to a different problem unrelated to the Lock file.

@ExtremeFiretop
Copy link
Owner Author

ExtremeFiretop commented Nov 14, 2024

@Martinski4GitHub

Testing latest commits now :)

@ExtremeFiretop
Copy link
Owner Author

@Martinski4GitHub

This is confirmed working now! :)

image
(Ignore the error in the email; the actual firmware was production and I entered the name for the beta in the offline prompt)

Nov 14 04:30:15 custom_script: Running /jffs/scripts/services-start
Nov 14 04:30:15 dbg: ==================
Nov 14 04:30:15 kernel: link down LAN1
Nov 14 04:30:15 scMerlin: Waiting for NTP to sync...
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:16 kernel: link up LAN1
Nov 14 04:30:16 pppd[5552]: Plugin rp-pppoe.so loaded.
Nov 14 04:30:16 pppd[5552]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:30:16 pppd[5578]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:30:16 kernel: SCSI subsystem initialized
Nov 14 04:30:16 rc_service: cfg_server 4801:notify_rc update_nbr
Nov 14 04:30:16 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:30:16 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:30:16 [MerlinAU.sh] 4964: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:30:16 kernel: scsi host0: uas

MerlinAU.sh Outdated Show resolved Hide resolved
@Martinski4GitHub
Copy link
Collaborator

@Martinski4GitHub

This is confirmed working now! :)
...

Nov 14 04:30:15 custom_script: Running /jffs/scripts/services-start
Nov 14 04:30:15 dbg: ==================
Nov 14 04:30:15 kernel: link down LAN1
Nov 14 04:30:15 scMerlin: Waiting for NTP to sync...
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:16 kernel: link up LAN1
Nov 14 04:30:16 pppd[5552]: Plugin rp-pppoe.so loaded.
Nov 14 04:30:16 pppd[5552]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:30:16 pppd[5578]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:30:16 kernel: SCSI subsystem initialized
Nov 14 04:30:16 rc_service: cfg_server 4801:notify_rc update_nbr
Nov 14 04:30:16 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:30:16 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:30:16 [MerlinAU.sh] 4964: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:30:16 kernel: scsi host0: uas

Good job, bud. So the success is repeatable (i.e. after every reboot now)?
Let me know when you feel the PR is ready for review.
I'll be going to sleep in a couple of minutes so there's no rush. You can test & verify as much as you need to.

Talk to you in the evening. Have a good night, bud.

@ExtremeFiretop
Copy link
Owner Author

Latest result without me mis-identifying the firmware:

Nov 14 04:53:29 custom_script: Running /jffs/scripts/services-start
Nov 14 04:53:29 dbg: =====================
Nov 14 04:53:29 kernel: link down LAN1
Nov 14 04:53:29 scMerlin: Waiting for NTP to sync...
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/jffs/configs/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: *WARNING*: Temporary Backup directory [/jffs/configs/SavedUserIcons]
Nov 14 04:53:30 rc_service: cfg_server 4803:notify_rc update_nbr
Nov 14 04:53:30 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:53:30 kernel: link up LAN1
Nov 14 04:53:30 pppd[5549]: Plugin rp-pppoe.so loaded.
Nov 14 04:53:30 pppd[5549]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:53:30 pppd[5608]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:53:30 kernel: SCSI subsystem initialized
Nov 14 04:53:30 [MerlinAU.sh] 4988: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:53:30 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:53:30 [MerlinAU.sh] 4988: Cron job hook was added successfully to '/jffs/scripts/services-start' script.

image

@Martinski4GitHub
Copy link
Collaborator

Latest result without me mis-identifying the firmware:

I'll take a closer look & review in the evening.
Good night!!

@ExtremeFiretop
Copy link
Owner Author

Latest result without me mis-identifying the firmware:

I'll take a closer look & review in the evening. Good night!!

Goodnight buddy!

@ExtremeFiretop ExtremeFiretop changed the title Fix Post Update Release Lock Email Issue Fix For Post Upgrade Email Issue (Multi-Lock) Nov 14, 2024
@Martinski4GitHub
Copy link
Collaborator

@ExtremeFiretop.

Apologies for the delay. It's been a very busy day...

I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.

Would you mind explaining how you see this PR solving that particular problem?

@ExtremeFiretop
Copy link
Owner Author

ExtremeFiretop commented Nov 15, 2024

@ExtremeFiretop.

Apologies for the delay. It's been a very busy day...

I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.

Would you mind explaining how you see this PR solving that particular problem?

No worries buddy! I totally understand.
I was so tired last night by the end of our troubleshooting session I didn't have the energy to go over it then haha!

Basically you got it, you pointed me in the right direction as soon as you said "you sure there isn't 2 executed?"

The services-start script had 2 calls of MerlinAU inside, the first which is the regular addcronjob call inside services-start when the router reboots, etc normally to add the cron jobs. And the second call when we actually do an upgrade which is added to handle the post reboot email, etc.

That second call was the one giving us the lock error, and failing to send the email, but in services-start the very first call (in order) was the addcronjob call.

So while I couldn't see anything in the syslogs indicating that call existed. And I couldn't connect via SSH when that call was run. (Too early in the boot process to connect) I took a guess that the first call for addcronjob was running first and creating the lock file, and then milliseconds later services-start would call the next MerlinAU background job to do the post upgrade email and it would encounter the lock file.

It became a race condition between the 2 calls since they are both backgrounds jobs. The random times it worked would be just because the second call sending the email before the first call created the lock file, but if the first call created the lock file first then the second call wouldn't send the email!

This explains why when we added more code for WAN check, and more delay and debugging to the second call to troubleshoot, it actually slowed it down more compared to the first, so it would happen more consistently and caused the problem to get "worse" instead of being 50/50 or 60/40. The extra code was causing it to "lose the race".

The solution tested 4 times now, is just remove the addcronjob hook from services-start when we do an upgrade, and add it back when we are done in the post upgrade hook.

I must say this problem is the first time in encountered something like this before, so I really appreciated having you run through the logic with.

@Martinski4GitHub
Copy link
Collaborator

@ExtremeFiretop.
Apologies for the delay. It's been a very busy day...
I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.
Would you mind explaining how you see this PR solving that particular problem?

No worries buddy! I totally understand. I was so tired last night by the end of our troubleshooting session I didn't have the energy to go over it then haha!

Basically you got it, you pointed me in the right direction as soon as you said "you sure there isn't 2 executed?"

The services-start script had 2 calls of MerlinAU inside, the first which is the regular addcronjob call inside services-start when the router reboots, etc normally to add the cron jobs. And the second call when we actually do an upgrade which is added to handle the post reboot email, etc.

That second call was the one giving us the lock error, and failing to send the email, but in services-start the very first call (in order) was the addcronjob call.

So while I couldn't see anything in the syslogs indicating that call existed. And I couldn't connect via SSH when that call was run. (Too early in the boot process to connect) I took a guess that the first call for addcronjob was running first and creating the lock file, and then milliseconds later services-start would call the next MerlinAU background job to do the post upgrade email and it would encounter the lock file.

It became a race condition between the 2 calls since they are both backgrounds jobs. The random times it worked would be just because the second call sending the email before the first call created the lock file, but if the first call created the lock file first then the second call wouldn't send the email!

This explains why when we added more code for WAN check, and more delay and debugging to the second call to troubleshoot, it actually slowed it down more compared to the first, so it would happen more consistently and caused the problem to get "worse" instead of being 50/50 or 60/40. The extra code was causing it to "lose the race".

The solution tested 4 times now, is just remove the addcronjob hook from services-start when we do an upgrade, and add it back when we are done in the post upgrade hook.

I must say this problem is the first time in encountered something like this before, so I really appreciated having you run through the logic with.

OK, thanks for taking the time to explain & clarify. So yeah, I see the possibility of the "race condition" when more than one call to MerlinAU is added to the "services-start" script. That must be dealt with to avoid early termination. However, upon further review of the code, I see 2 issues with the current solution:

  1. Deleting the cron job hook for "addCronJob" when doing the F/W upgrade and then adding it back within the post-reboot update email notification function works only if email notifications have been enabled by the user.

    As we know, not all users have email configuration setup in AMTM, and not all users enable email notifications for our script. For such users, the cron job would not be restored after the automatic reboot because the email notification call was not set up.

  2. Even if email notifications are enabled in AMTM & our script, some users may not have the F/W update check enabled but your changes assume it’s always enabled by adding the cron job hook unconditionally.

Solutions can be found for the above issues, but I think I'd like to modify the File Lock function and add code to handle the execution of 2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern, separated by a few seconds. This would work even for other future calls we may add in the "services-start" script, so it would be a more "general" solution.

@ExtremeFiretop
Copy link
Owner Author

ExtremeFiretop commented Nov 15, 2024

Solutions can be found for the above issues, but I think I'd like to modify the File Lock function and add code to handle the execution of 2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern, separated by a few seconds. This would work even for other future calls we may add in the "services-start" script, so it would be a more "general" solution.

I actually though about point 2, but didn't think about point 1. I was going to add some conditional to address point 2. But I like your idea more :)

@ExtremeFiretop
Copy link
Owner Author

2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern.

I believe this is also why when I added a sleep of 120 seconds, it worked. I was manually creating a staggered pattern :P

@ExtremeFiretop ExtremeFiretop changed the title Various Fixes Various Small Fixes Nov 16, 2024
@ExtremeFiretop
Copy link
Owner Author

@Martinski4GitHub

Added a few more small fixes to review.
This is ready for merger in my eyes! :D All has been tested with positive results.

MerlinAU.sh Outdated Show resolved Hide resolved
@Martinski4GitHub
Copy link
Collaborator

@Martinski4GitHub

Added a few more small fixes to review. This is ready for merger in my eyes! :D All has been tested with positive results.

OK, I'm back online. My apologies for the delay. Yesterday, we went to my brother's home to celebrate one of my nieces birthday, and we came back rather late - it was past midnight already - and I just crashed on the bed and went to sleep right away.

Anyway, I'll start to review the PR now. Man, a lot of messages to go through, LOL ;>)

@ExtremeFiretop
Copy link
Owner Author

@Martinski4GitHub
Added a few more small fixes to review. This is ready for merger in my eyes! :D All has been tested with positive results.

OK, I'm back online. My apologies for the delay. Yesterday, we went to my brother's home to celebrate one of my nieces birthday, and we came back rather late - it was past midnight already - and I just crashed on the bed and went to sleep right away.

Anyway, I'll start to review the PR now. Man, a lot of messages to go through, LOL ;>)

No worries buddy, I'm around. No rush!
I'm sure your brother enjoyed having you there to celebrate!

Ignore most of my blabber on the one we were troubleshooting it ended up being the TLDR is the new WAN function was causing the issue on the node.

Been tested multiple times with success, we know the issue was the multiple sessions causing the lock file to get screwey so I just removed the new function and all worked perfectly!!

Copy link
Collaborator

@Martinski4GitHub Martinski4GitHub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved and good to go for production!!

@Martinski4GitHub Martinski4GitHub merged commit c1fd054 into dev Nov 18, 2024
1 check passed
@Martinski4GitHub
Copy link
Collaborator

Approved and good to go for production!!

@ExtremeFiretop,
There are a few simple changes I'd like to make to have more log entries in the system log file while waiting for the timeouts.
Having more log entries can be very helpful when debugging issues that happen during the reboot sequence, IMO.

@ExtremeFiretop
Copy link
Owner Author

Approved and good to go for production!!

Question, was there anything else you wanted to touch up or add?
The official release of 3006.102. was released today.

@ExtremeFiretop
Copy link
Owner Author

Approved and good to go for production!!

@ExtremeFiretop, There are a few simple changes I'd like to make to have more log entries in the system log file while waiting for the timeouts. Having more log entries can be very helpful when debugging issues that happen during the reboot sequence, IMO.

Gotcha! I'm heading out for 30 minutes anyways, I'll be back to chat in a few!

@Martinski4GitHub
Copy link
Collaborator

Approved and good to go for production!!

Question, was there anything else you wanted to touch up or add? The official release of 3006.102. was released today.

Yep, working on it...

@ExtremeFiretop ExtremeFiretop deleted the ExtremeFiretop-PostUpdateEmailFix branch November 18, 2024 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants