Various Small Fixes #355

ExtremeFiretop · 2024-11-14T07:03:36Z

Re-creation of PR: #352

Testing now.

Martinski4GitHub · 2024-11-14T09:15:26Z

Still the same issue?

Nov 14 02:18:13 scMerlin: Waiting for NTP to sync...
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 02:18:13 YazDHCP: Trying again with directory [/jffs/configs/SavedUserIcons]
Nov 14 02:18:13 YazDHCP: *WARNING*: Temporary Backup directory [/jffs/configs/SavedUserIcons]
Nov 14 02:18:13 kernel: link up LAN1
Nov 14 02:18:13 [MerlinAU.sh] 5110: **ERROR**: The shell script 'MerlinAU.sh' is already running [Lock file: 0 secs.]
Nov 14 02:18:13 [MerlinAU.sh] 5110: Exiting...

is it possible we are having issues when initialization causing the AcquireLock function to fail somehow?

I still think this is a "red herring" and your test results with the new code changes point to a different problem unrelated to the Lock file.

ExtremeFiretop · 2024-11-14T09:16:42Z

@Martinski4GitHub

Testing latest commits now :)

ExtremeFiretop · 2024-11-14T09:41:13Z

@Martinski4GitHub

This is confirmed working now! :)

(Ignore the error in the email; the actual firmware was production and I entered the name for the beta in the offline prompt)

Nov 14 04:30:15 custom_script: Running /jffs/scripts/services-start
Nov 14 04:30:15 dbg: ==================
Nov 14 04:30:15 kernel: link down LAN1
Nov 14 04:30:15 scMerlin: Waiting for NTP to sync...
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:16 kernel: link up LAN1
Nov 14 04:30:16 pppd[5552]: Plugin rp-pppoe.so loaded.
Nov 14 04:30:16 pppd[5552]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:30:16 pppd[5578]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:30:16 kernel: SCSI subsystem initialized
Nov 14 04:30:16 rc_service: cfg_server 4801:notify_rc update_nbr
Nov 14 04:30:16 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:30:16 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:30:16 [MerlinAU.sh] 4964: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:30:16 kernel: scsi host0: uas

MerlinAU.sh

Martinski4GitHub · 2024-11-14T10:01:39Z

@Martinski4GitHub

This is confirmed working now! :)
...

Nov 14 04:30:15 custom_script: Running /jffs/scripts/services-start
Nov 14 04:30:15 dbg: ==================
Nov 14 04:30:15 kernel: link down LAN1
Nov 14 04:30:15 scMerlin: Waiting for NTP to sync...
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:15 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:30:15 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:30:16 kernel: link up LAN1
Nov 14 04:30:16 pppd[5552]: Plugin rp-pppoe.so loaded.
Nov 14 04:30:16 pppd[5552]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:30:16 pppd[5578]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:30:16 kernel: SCSI subsystem initialized
Nov 14 04:30:16 rc_service: cfg_server 4801:notify_rc update_nbr
Nov 14 04:30:16 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:30:16 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:30:16 [MerlinAU.sh] 4964: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:30:16 kernel: scsi host0: uas

Good job, bud. So the success is repeatable (i.e. after every reboot now)?
Let me know when you feel the PR is ready for review.
I'll be going to sleep in a couple of minutes so there's no rush. You can test & verify as much as you need to.

Talk to you in the evening. Have a good night, bud.

ExtremeFiretop · 2024-11-14T10:06:13Z

Latest result without me mis-identifying the firmware:

Nov 14 04:53:29 custom_script: Running /jffs/scripts/services-start
Nov 14 04:53:29 dbg: =====================
Nov 14 04:53:29 kernel: link down LAN1
Nov 14 04:53:29 scMerlin: Waiting for NTP to sync...
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/opt/var/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: **INFO**:  Backup directory [/opt/var/SavedUserIcons] NOT FOUND.
Nov 14 04:53:29 YazDHCP: Trying again with directory [/jffs/configs/SavedUserIcons]
Nov 14 04:53:29 YazDHCP: *WARNING*: Temporary Backup directory [/jffs/configs/SavedUserIcons]
Nov 14 04:53:30 rc_service: cfg_server 4803:notify_rc update_nbr
Nov 14 04:53:30 rc_service: waitting "restart_firewall" via  ...
Nov 14 04:53:30 kernel: link up LAN1
Nov 14 04:53:30 pppd[5549]: Plugin rp-pppoe.so loaded.
Nov 14 04:53:30 pppd[5549]: RP-PPPoE plugin version 3.11 compiled against pppd 2.4.7
Nov 14 04:53:30 pppd[5608]: pppd 2.4.7 started by Admin, uid 0
Nov 14 04:53:30 kernel: SCSI subsystem initialized
Nov 14 04:53:30 [MerlinAU.sh] 4988: Post-update email notification hook was deleted successfully from '/jffs/scripts/services-start' script.
Nov 14 04:53:30 kernel: usbcore: registered new interface driver usb-storage
Nov 14 04:53:30 [MerlinAU.sh] 4988: Cron job hook was added successfully to '/jffs/scripts/services-start' script.

Martinski4GitHub · 2024-11-14T10:14:07Z

Latest result without me mis-identifying the firmware:

I'll take a closer look & review in the evening.
Good night!!

ExtremeFiretop · 2024-11-14T10:16:43Z

Latest result without me mis-identifying the firmware:

I'll take a closer look & review in the evening. Good night!!

Goodnight buddy!

Martinski4GitHub · 2024-11-15T06:22:34Z

@ExtremeFiretop.

Apologies for the delay. It's been a very busy day...

I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.

Would you mind explaining how you see this PR solving that particular problem?

ExtremeFiretop · 2024-11-15T06:45:23Z

@ExtremeFiretop.

Apologies for the delay. It's been a very busy day...

I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.

Would you mind explaining how you see this PR solving that particular problem?

No worries buddy! I totally understand.
I was so tired last night by the end of our troubleshooting session I didn't have the energy to go over it then haha!

Basically you got it, you pointed me in the right direction as soon as you said "you sure there isn't 2 executed?"

The services-start script had 2 calls of MerlinAU inside, the first which is the regular addcronjob call inside services-start when the router reboots, etc normally to add the cron jobs. And the second call when we actually do an upgrade which is added to handle the post reboot email, etc.

That second call was the one giving us the lock error, and failing to send the email, but in services-start the very first call (in order) was the addcronjob call.

So while I couldn't see anything in the syslogs indicating that call existed. And I couldn't connect via SSH when that call was run. (Too early in the boot process to connect) I took a guess that the first call for addcronjob was running first and creating the lock file, and then milliseconds later services-start would call the next MerlinAU background job to do the post upgrade email and it would encounter the lock file.

It became a race condition between the 2 calls since they are both backgrounds jobs. The random times it worked would be just because the second call sending the email before the first call created the lock file, but if the first call created the lock file first then the second call wouldn't send the email!

This explains why when we added more code for WAN check, and more delay and debugging to the second call to troubleshoot, it actually slowed it down more compared to the first, so it would happen more consistently and caused the problem to get "worse" instead of being 50/50 or 60/40. The extra code was causing it to "lose the race".

The solution tested 4 times now, is just remove the addcronjob hook from services-start when we do an upgrade, and add it back when we are done in the post upgrade hook.

I must say this problem is the first time in encountered something like this before, so I really appreciated having you run through the logic with.

Martinski4GitHub · 2024-11-15T08:23:54Z

@ExtremeFiretop.
Apologies for the delay. It's been a very busy day...
I’ve reviewed the PR changes and while I understand what the 2 function calls do, I still don’t quite understand how they fix the problem that you were repeatedly duplicating last night (and early this morning) where 2 instances of the script were apparently being executed, one right after the other, resulting in the 2nd instance terminating quickly due to the Lock file (created by the 1st instance) with "age" of ZERO or ONE second.
Would you mind explaining how you see this PR solving that particular problem?

No worries buddy! I totally understand. I was so tired last night by the end of our troubleshooting session I didn't have the energy to go over it then haha!

Basically you got it, you pointed me in the right direction as soon as you said "you sure there isn't 2 executed?"

The services-start script had 2 calls of MerlinAU inside, the first which is the regular addcronjob call inside services-start when the router reboots, etc normally to add the cron jobs. And the second call when we actually do an upgrade which is added to handle the post reboot email, etc.

That second call was the one giving us the lock error, and failing to send the email, but in services-start the very first call (in order) was the addcronjob call.

So while I couldn't see anything in the syslogs indicating that call existed. And I couldn't connect via SSH when that call was run. (Too early in the boot process to connect) I took a guess that the first call for addcronjob was running first and creating the lock file, and then milliseconds later services-start would call the next MerlinAU background job to do the post upgrade email and it would encounter the lock file.

It became a race condition between the 2 calls since they are both backgrounds jobs. The random times it worked would be just because the second call sending the email before the first call created the lock file, but if the first call created the lock file first then the second call wouldn't send the email!

This explains why when we added more code for WAN check, and more delay and debugging to the second call to troubleshoot, it actually slowed it down more compared to the first, so it would happen more consistently and caused the problem to get "worse" instead of being 50/50 or 60/40. The extra code was causing it to "lose the race".

The solution tested 4 times now, is just remove the addcronjob hook from services-start when we do an upgrade, and add it back when we are done in the post upgrade hook.

I must say this problem is the first time in encountered something like this before, so I really appreciated having you run through the logic with.

OK, thanks for taking the time to explain & clarify. So yeah, I see the possibility of the "race condition" when more than one call to MerlinAU is added to the "services-start" script. That must be dealt with to avoid early termination. However, upon further review of the code, I see 2 issues with the current solution:

Deleting the cron job hook for "addCronJob" when doing the F/W upgrade and then adding it back within the post-reboot update email notification function works only if email notifications have been enabled by the user.

As we know, not all users have email configuration setup in AMTM, and not all users enable email notifications for our script. For such users, the cron job would not be restored after the automatic reboot because the email notification call was not set up.
Even if email notifications are enabled in AMTM & our script, some users may not have the F/W update check enabled but your changes assume it’s always enabled by adding the cron job hook unconditionally.

Solutions can be found for the above issues, but I think I'd like to modify the File Lock function and add code to handle the execution of 2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern, separated by a few seconds. This would work even for other future calls we may add in the "services-start" script, so it would be a more "general" solution.

ExtremeFiretop · 2024-11-15T08:43:39Z

Solutions can be found for the above issues, but I think I'd like to modify the File Lock function and add code to handle the execution of 2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern, separated by a few seconds. This would work even for other future calls we may add in the "services-start" script, so it would be a more "general" solution.

I actually though about point 2, but didn't think about point 1. I was going to add some conditional to address point 2. But I like your idea more :)

ExtremeFiretop · 2024-11-15T09:15:36Z

2 or more instances of the MerlinAU script so that it looks like they executed in a staggered pattern.

I believe this is also why when I added a sleep of 120 seconds, it worked. I was manually creating a staggered pattern :P

MerlinAU.sh

ExtremeFiretop · 2024-11-17T00:28:56Z

@Martinski4GitHub

Added a few more small fixes to review.
This is ready for merger in my eyes! :D All has been tested with positive results.

MerlinAU.sh

Martinski4GitHub · 2024-11-18T02:57:04Z

@Martinski4GitHub

Added a few more small fixes to review. This is ready for merger in my eyes! :D All has been tested with positive results.

OK, I'm back online. My apologies for the delay. Yesterday, we went to my brother's home to celebrate one of my nieces birthday, and we came back rather late - it was past midnight already - and I just crashed on the bed and went to sleep right away.

Anyway, I'll start to review the PR now. Man, a lot of messages to go through, LOL ;>)

ExtremeFiretop · 2024-11-18T03:01:38Z

@Martinski4GitHub
Added a few more small fixes to review. This is ready for merger in my eyes! :D All has been tested with positive results.

OK, I'm back online. My apologies for the delay. Yesterday, we went to my brother's home to celebrate one of my nieces birthday, and we came back rather late - it was past midnight already - and I just crashed on the bed and went to sleep right away.

Anyway, I'll start to review the PR now. Man, a lot of messages to go through, LOL ;>)

No worries buddy, I'm around. No rush!
I'm sure your brother enjoyed having you there to celebrate!

Ignore most of my blabber on the one we were troubleshooting it ended up being the TLDR is the new WAN function was causing the issue on the node.

Been tested multiple times with success, we know the issue was the multiple sessions causing the lock file to get screwey so I just removed the new function and all worked perfectly!!

Martinski4GitHub

Approved and good to go for production!!

Martinski4GitHub · 2024-11-18T04:47:09Z

Approved and good to go for production!!

@ExtremeFiretop,
There are a few simple changes I'd like to make to have more log entries in the system log file while waiting for the timeouts.
Having more log entries can be very helpful when debugging issues that happen during the reboot sequence, IMO.

ExtremeFiretop · 2024-11-18T04:47:39Z

Approved and good to go for production!!

Question, was there anything else you wanted to touch up or add?
The official release of 3006.102. was released today.

ExtremeFiretop · 2024-11-18T04:48:16Z

Approved and good to go for production!!

@ExtremeFiretop, There are a few simple changes I'd like to make to have more log entries in the system log file while waiting for the timeouts. Having more log entries can be very helpful when debugging issues that happen during the reboot sequence, IMO.

Gotcha! I'm heading out for 30 minutes anyways, I'll be back to chat in a few!

Martinski4GitHub · 2024-11-18T04:48:28Z

Approved and good to go for production!!

Question, was there anything else you wanted to touch up or add? The official release of 3006.102. was released today.

Yep, working on it...

Update MerlinAU.sh

123831f

ExtremeFiretop requested a review from Martinski4GitHub as a code owner November 14, 2024 07:03

ExtremeFiretop assigned ExtremeFiretop and Martinski4GitHub Nov 14, 2024

Update MerlinAU.sh

2f802be

ExtremeFiretop added the bug Something isn't working label Nov 14, 2024

ExtremeFiretop closed this Nov 14, 2024

ExtremeFiretop reopened this Nov 14, 2024

ExtremeFiretop added 2 commits November 14, 2024 04:13

Update MerlinAU.sh

f5049c6

Update MerlinAU.sh

f9b5cf6

ExtremeFiretop added 4 commits November 14, 2024 04:21

Update MerlinAU.sh

37ef83a

Update MerlinAU.sh

f055bd9

Update MerlinAU.sh

92c35ee

Update MerlinAU.sh

eb83d07

ExtremeFiretop commented Nov 14, 2024

View reviewed changes

MerlinAU.sh Outdated Show resolved Hide resolved

Update MerlinAU.sh

9a67caf

ExtremeFiretop changed the title ~~Fix Post Update Release Lock Email Issue~~ Fix For Post Upgrade Email Issue (Multi-Lock) Nov 14, 2024

Update MerlinAU.sh

cf6d0dc

ExtremeFiretop added 8 commits November 16, 2024 02:27

Update MerlinAU.sh

777f462

Merge branch 'dev' into ExtremeFiretop-PostUpdateEmailFix

7cbee1c

Update MerlinAU.sh

1ee9510

Update MerlinAU.sh

6518e67

Update MerlinAU.sh

543f26f

Update MerlinAU.sh

438c128

Update MerlinAU.sh

f8f0cd9

Update MerlinAU.sh

b58d525

ExtremeFiretop changed the title ~~Various Fixes~~ Various Small Fixes Nov 16, 2024

Update MerlinAU.sh

23fe688

ExtremeFiretop commented Nov 17, 2024

View reviewed changes

MerlinAU.sh Show resolved Hide resolved

ExtremeFiretop added 3 commits November 16, 2024 19:06

Update MerlinAU.sh

42fdc44

Update MerlinAU.sh

0914b1b

Update MerlinAU.sh

e0a72f0

ExtremeFiretop commented Nov 17, 2024

View reviewed changes

MerlinAU.sh Show resolved Hide resolved

Update MerlinAU.sh

5b3da64

ExtremeFiretop commented Nov 17, 2024

View reviewed changes

MerlinAU.sh Outdated Show resolved Hide resolved

Update MerlinAU.sh

684665e

Martinski4GitHub approved these changes Nov 18, 2024

View reviewed changes

Martinski4GitHub merged commit c1fd054 into dev Nov 18, 2024
1 check passed

ExtremeFiretop deleted the ExtremeFiretop-PostUpdateEmailFix branch November 18, 2024 07:17

ExtremeFiretop mentioned this pull request Nov 18, 2024

Dev 1.3.5 as Next Stable Release #358

Merged

ExtremeFiretop mentioned this pull request Nov 30, 2024

Dev 1.3.7 as Next Stable Release #369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Small Fixes #355

Various Small Fixes #355

ExtremeFiretop commented Nov 14, 2024 •

edited

Loading

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024 •

edited

Loading

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 15, 2024

ExtremeFiretop commented Nov 15, 2024 •

edited

Loading

Martinski4GitHub commented Nov 15, 2024

ExtremeFiretop commented Nov 15, 2024 •

edited

Loading

ExtremeFiretop commented Nov 15, 2024

ExtremeFiretop commented Nov 17, 2024

Martinski4GitHub commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

Martinski4GitHub left a comment

Martinski4GitHub commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

Martinski4GitHub commented Nov 18, 2024

Various Small Fixes #355

Various Small Fixes #355

Conversation

ExtremeFiretop commented Nov 14, 2024 • edited Loading

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024 • edited Loading

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 14, 2024

ExtremeFiretop commented Nov 14, 2024

Martinski4GitHub commented Nov 15, 2024

ExtremeFiretop commented Nov 15, 2024 • edited Loading

Martinski4GitHub commented Nov 15, 2024

ExtremeFiretop commented Nov 15, 2024 • edited Loading

ExtremeFiretop commented Nov 15, 2024

ExtremeFiretop commented Nov 17, 2024

Martinski4GitHub commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

Martinski4GitHub left a comment

Choose a reason for hiding this comment

Martinski4GitHub commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

ExtremeFiretop commented Nov 18, 2024

Martinski4GitHub commented Nov 18, 2024

ExtremeFiretop commented Nov 14, 2024 •

edited

Loading

ExtremeFiretop commented Nov 14, 2024 •

edited

Loading

ExtremeFiretop commented Nov 15, 2024 •

edited

Loading

ExtremeFiretop commented Nov 15, 2024 •

edited

Loading