Cron job not running after crashed once #23054

QuentinFarizon · 2019-05-29T14:28:46Z

Preconditions (*)

Magento EE 2.2.8
Crontab configured as per the documentation

Steps to reproduce (*)

An error makes the cron job indexer_update_all_views fail once (in my case, database unavailable)

Expected result (*)

Current run should be marked as failed in the table cron_schedule
Next run should run correctly and status be updated at the end in cron_schedule

Actual result (*)

Table cron_schedule is filled with pending jobs, no job for indexer_update_all_views is run (no output in var/log/cron.log, no status update in cron_schedule table.

Logs :
var/log/cron.log (last success + the error message)

[2019-05-29 11:25:10] report.INFO: Cron Job indexer_update_all_views is run [] []
[2019-05-29 11:28:03] report.ERROR: Cron Job indexer_update_all_views has an error: SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use, query was: SELECT `mview_state`.* FROM `mview_state` WHERE (`mview_state`.`view_id`='catalog_product_flat'). Statistics: {"sum":0,"count":1,"realmem":0,"emalloc":0,"realmem_start":182714368,"emalloc_start":180313880} [] []

=> And then no more logs about indexer_update_all_views, even if other jobs from the index group run correctly and output success in var/log/cron.log
Database recovered a minute after and query was OK

The text was updated successfully, but these errors were encountered:

m2-assistant · 2019-05-29T14:28:50Z

Hi @QuentinFarizonAfrimarket. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

Summary of the issue
Information on your environment
Steps to reproduce
Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@QuentinFarizonAfrimarket do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

yes
no

Join Magento Community Engineering Slack and ask your questions in #github channel.

hostep · 2019-05-30T15:48:04Z

I agree, this definitely needs fixing.

Just FYI: there has recently been a proposal to review the current cron implementation and improve its stability: magento/architecture#171

QuentinFarizon · 2019-05-31T07:16:55Z

Hello @hostep thank you !
I agree with the conclusion of magento/architecture#171 : cron management must be more resilient, fireproof, and protected against periodic or permanent failing of one of the job codes.

I think I found a reproducible scenario that caused issue on my system :

Indexing job (group "index") configured as "no separate proces"
Indexing runs out of memory, process is killed by the system (or process crashes violently for another reason)
As process didn't have the occasion to set the job as "error", it stays in running
No new indexing job runs (because one is running)
Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

Consequences
=> No error or few errors in cron.log or other logs
=> No reporting on the back-office or cli, apart from indexer status keeping piling up
=> Index job marked as "running" for 3 days after a single error

Workaround :
=> Increase memory limit
=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

Ideas :
=> Store host+pid in database to regularly check for crashed processes (when you're on the correct host)

QuentinFarizon · 2019-05-31T08:31:35Z

@hostep Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

So it may never run again, what do you think ?

hostep · 2019-05-31T17:44:52Z

Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

This is not entirely true, since it doesn't clean up jobs with status running:

magento2/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php

Lines 496 to 499 in 48d8d43

    
           Schedule::STATUS_SUCCESS => $historySuccess * self::SECONDS_IN_MINUTE, 
        
           Schedule::STATUS_MISSED => $historyFailure * self::SECONDS_IN_MINUTE, 
        
           Schedule::STATUS_ERROR => $historyFailure * self::SECONDS_IN_MINUTE, 
        
           Schedule::STATUS_PENDING => max($historyFailure, $historySuccess) * self::SECONDS_IN_MINUTE,

As far as I know, running jobs are just never cleaned up, even if they are no longer actually running. And that's the biggest problem here.

=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

That's a good point, the lifetime was already lowered a lot in 244a2e9, but for the indexing group, it probably makes sense to lower the default failure history even further.
Although, how many errors do you actually see with indexer jobs? Not that many I reckon? But it might still make sense to lower it, since that lifetime is also used to cleanup pending jobs.

Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

I'm not exactly sure what happens with a lock acquired in a mysql server when the php code suddenly stops, I would think that the connection with mysql then closes and the lock is released automatically, but I'm not sure about this...

Btw: the locking is not used to manage the statuses of the jobs themselves as far as I know.
The reason why this locking was necessary (if I remember correctly) was because sometimes you could have situations where the same cron group was being executed multiple times in parallel, because the first run wasn't fully done in a single minute (Magento recommends crontab to be set so it runs every minute). And you don't want to have a new process trying to manage the statuses of jobs in the same cron group before the existing process has finished its work.

QuentinFarizon · 2019-06-03T08:06:33Z

Thanks @hostep

Indeed you're right, running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

I think this requires a quick fix in 2.2 and 2.3 release lines, until your proposition for hardening crons is implemented. What do you think ?

Indeed, from my observations the mysql lock seems to be released, so it should'nt be an additional issue (needs confirmation)

QuentinFarizon · 2019-06-03T10:21:50Z

@magento give me 2.2.8 instance

magento-engcom-team · 2019-06-03T10:21:52Z

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8-develop instance for you

QuentinFarizon · 2019-06-03T10:29:56Z

@magento give me 2.2.8 instance

magento-engcom-team · 2019-06-03T10:29:57Z

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8 instance for you

QuentinFarizon · 2019-06-03T14:00:00Z

Hello @hostep

I have another issue : individual mview updates (like catalog_product_flat) are stuck in 'working' mode in the table mviews_state.
As they are not hold by any process, and are not cleaned (I believe) so they will be stuck forever.
So even with an indexer job running correctly, you can get individual views that are not updated following their changelogs.

I suspect that this modification : 5927a75 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

I can push a PR if you agree (for both issues)

hostep · 2019-06-03T20:31:51Z

... running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

Agreed. There should be some way for Magento to detect failed cron jobs and not assume they are still running. But I'm not sure if the existing lifetime fields are the right solution here. Maybe a new field should be added for a new lifetime called 'max duration of running job' or something like that?

I suspect that this modification : 5927a75 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here. If php runs out of memory, it will just stop and will just never go into the catch I think. You can possibly catch such situations with register_shutdown_function, but then you need to make sure you reserved a bunch of memory beforehand which you then have to free up so you still have memory enough to fix the state in the database. This sound pretty complex and not a very elegant solution.
And what do you do when the electricity fails just when the state in the database is set to working but before any work actually begun?
Would using database transactions work here for example?

It would be nice to have some feedback from some experts in here. @dmanners: I spoke to you on Imagine 2018 about this particular problem (amongst some other problems), but this still hasn't been properly fixed, can you try to pull in some experts from Magento in here to see how this problem can be solved?
The quick summary: we need a way for Magento to not keep cronjobs in state running or materialized views in the state working forever and ever. There must be a way for Magento to detect this and work around this somehow.

m2-assistant · 2019-06-04T06:30:05Z

Hi @M-A-X-I-M. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: 👇

1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).
Details
If the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.
2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.
3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.
4. Verify that the issue is reproducible on 2.3-develop branch
Details
- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
- If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and stop verification process here!

5. Verify that the issue is reproducible on 2.2-develop branch.
Details
- Add the comment @magento give me 2.2-develop instance to deploy test instance on Magento infrastructure.
- If the issue is reproducible on 2.2-develop branch, please add the label Reproduced on 2.2.x

Join Magento Community Engineering Slack and ask your questions in #github channel.

QuentinFarizon · 2019-06-04T09:02:32Z

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here.

Yes agreed, but it would be easy to push as a first fix. I thought about register_shutdown_function, but indeed it's probably complicated to make database queries in it.
Database transactions should work yes, specifying to lock table cron_schedule but not lock the indexed table.

hostep · 2019-06-04T09:50:53Z

Yes agreed, but it would be easy to push as a first fix.

Cool, feel free to create a PR with this suggestion! It will most likely get approved.

QuentinFarizon · 2019-06-06T10:45:48Z

Even with my patch (#23125) I still have issues where no cron is currently running, it's not stuck in cron_schedule, but some mviews are stuck to 'working'

MariaDB [ivwooaagji7o4]> select * from cron_schedule where job_code  = 'indexer_update_all_views' order by executed_at desc limit 20;
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
| schedule_id | job_code                 | status  | messages | created_at          | scheduled_at        | executed_at         | finished_at         |
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
|    21779692 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:39:00 | 2019-06-06 10:40:10 | 2019-06-06 10:40:21 |
|    21779689 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:38:00 | 2019-06-06 10:39:06 | 2019-06-06 10:39:19 |
|    21779632 | indexer_update_all_views | success | NULL     | 2019-06-06 10:34:15 | 2019-06-06 10:37:00 | 2019-06-06 10:38:06 | 2019-06-06 10:38:19 |

MariaDB [ivwooaagji7o4]> select * from mview_state;
+----------+-----------------------------------+----------+---------+---------------------+------------+
| state_id | view_id                           | mode     | status  | updated             | version_id |
+----------+-----------------------------------+----------+---------+---------------------+------------+
|        1 | catalogsearch_fulltext            | enabled  | idle    | 2019-06-06 10:41:13 |   73367814 |
|        2 | catalog_category_product          | enabled  | idle    | 2019-06-06 10:41:11 |      69390 |
|        3 | catalog_product_category          | enabled  | working | 2019-06-06 08:52:35 |    8593557 |
|        4 | catalog_product_price             | enabled  | working | 2019-06-05 15:10:16 |    8879166 |
|        5 | catalog_product_attribute         | enabled  | idle    | 2019-06-06 10:41:11 |   30248796 |
|        6 | catalogrule_rule                  | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|        7 | catalogrule_product               | enabled  | idle    | 2019-06-06 10:41:11 |   44660022 |
|        8 | cataloginventory_stock            | enabled  | idle    | 2019-06-06 10:41:11 |   17497764 |
|        9 | targetrule_product_rule           | enabled  | working | 2019-06-05 15:10:19 |     920777 |
|       10 | targetrule_rule_product           | enabled  | idle    | 2019-06-06 10:41:13 |          0 |
|       11 | salesrule_rule                    | enabled  | idle    | 2019-06-06 10:41:11 |        528 |
|       12 | algolia_products                  | disabled | idle    | 2019-06-06 10:41:11 |   19351077 |
|       13 | algolia_categories                | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       14 | algolia_pages                     | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       15 | algolia_suggestions               | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       16 | algolia_additional_sections       | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       17 | algolia_queue_runner              | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       18 | catalog_category_flat             | enabled  | idle    | 2019-06-06 10:41:11 |      15264 |
|       23 | catalog_product_flat              | enabled  | working | 2019-06-05 15:10:14 |   21107781 |
|       27 | design_config_dummy               | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|       30 | customer_dummy                    | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       33 | algolia_delete_products           | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       36 | afrimarket_catalog_basket_product | enabled  | idle    | 2019-06-06 10:41:12 |    4258158 |
|       39 | afrimarket_catalog_ranking        | enabled  | idle    | 2019-06-06 10:41:19 |   24287076 |
|       42 | afrimarket_catalog_margin         | enabled  | idle    | 2019-06-06 10:41:19 |     138633 |
+----------+-----------------------------------+----------+---------+---------------------+------------+

I have no ideas how this can happen, I'm open to suggestions

mattheo-geoffray · 2019-06-06T12:24:35Z

@QuentinFarizonAfrimarket Hi!
Did you try this fix?
Issue: https://github.com/magento/magento2/pull/23079/files
Fix: #23077
We have the same issue with millions of lines in changelog tables especially catalog_product_flat_cl maybe it is the same for you.
It is stuck in \Magento\Framework\Mview\View::update processing millions of versions in chunck

QuentinFarizon · 2019-06-06T14:14:31Z

Hello, yes I have applied the fix #23077 and recreated triggers : #23079 (comment)

It certainly helps, but we sill have issues, also with catalog_product_flat (temporary index table seems to disappear during indexation). Not sure how it is related to mview getting stuck since it's supposed to be catched by this PR

Ctucker9233 · 2019-06-13T22:37:51Z

@hostep I've noticed in my own environment that any jobs that are running and never complete are marked as missed. A method that clears missed jobs after a determined lifetime would also solve this issue.

hostep · 2019-06-14T18:00:25Z

@Ctucker9233: reading the code I'm not seeing how that can happen

A job is only set to missed when it tries to run a pending job, and if the scheduled time was missed by some offset:

magento2/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php

Lines 294 to 297 in 8e16abc

    
           if ($scheduledTime < $currentTime - $scheduleLifetime) { 
        
               $schedule->setStatus(Schedule::STATUS_MISSED); 
        
               throw new \Exception(sprintf('Cron Job %s is missed at %s', $jobCode, $schedule->getScheduledAt())); 
        
           }

This _runJob method only gets called from the processPendingJobs method. Which only processes jobs with state pending.

So you shouldn't see running jobs getting changed to missed somehow, unless I'm missing something or you are using some custom code which is reponsible for this?

And there is already some code to cleanup missed jobs as far as I can see:

magento2/app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php

Line 497 in 8e16abc

Schedule::STATUS_MISSED => $historyFailure * self::SECONDS_IN_MINUTE,

Ctucker9233 · 2019-06-14T19:11:31Z

@hostep I misscommunicated. I didn't mean that I am seeing running jobs switch their status to missed. What I meant was that in my environment at least a lot of missing jobs seem to accumulate and that can also cause the cron to crash or have a deadlock. The code that looks like it's meant to clean up missed jobs also seems to be not functioning in my case.

driskell · 2020-09-24T07:39:15Z

The PR in #28007 should fix this. It clears up things stuck in running and mitigates deadlocks so the cleanups can finish and stop the cron_schedule table growing exponentially.

magento-engcom-team · 2020-12-22T00:35:56Z

Hi @QuentinFarizon. Thank you for your report.
The issue has been fixed in #28007 by @driskell in 2.4-develop branch
Related commit(s):

The fix will be available with the upcoming 2.4.3 release.

magento-engcom-team added the Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed label May 29, 2019

max-tkachuk self-assigned this Jun 4, 2019

QuentinFarizon mentioned this issue Jun 4, 2019

Catch throwables in mview updating #23124

Closed

4 tasks

ghost assigned QuentinFarizon Jun 4, 2019

QuentinFarizon mentioned this issue Jun 4, 2019

Catch throwables in mview updating #23125

Merged

4 tasks

ghost assigned QuentinFarizon and unassigned QuentinFarizon Jul 16, 2019

ghost unassigned QuentinFarizon and max-tkachuk Sep 27, 2019

sdzhepa added the Progress: PR in progress label Mar 26, 2020

driskell mentioned this issue Apr 28, 2020

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

Merged

4 tasks

ihor-sviziev added the Severity: S1 Affects critical data or functionality and forces users to employ a workaround. label Jun 5, 2020

VladimirZaets added the Priority: P2 A defect with this priority could have functionality issues which are not to expectations. label Aug 7, 2020

m2-community-project bot added Progress: PR Created Indicates that Pull Request has been created to fix issue and removed Progress: PR in progress labels Sep 23, 2020

m2-community-project bot assigned driskell Sep 24, 2020

m2-community-project bot added Progress: PR Created Indicates that Pull Request has been created to fix issue and removed Progress: PR Created Indicates that Pull Request has been created to fix issue labels Sep 24, 2020

ghost added Progress: PR in progress and removed Progress: PR Created Indicates that Pull Request has been created to fix issue labels Oct 20, 2020

magento-engcom-team added the Fixed in 2.4.x The issue has been fixed in 2.4-develop branch label Dec 22, 2020

magento-engcom-team closed this as completed Dec 22, 2020

m2-community-project bot added Progress: PR in progress Progress: done and removed Progress: PR in progress labels Dec 22, 2020

engcom-Hotel added this to High Priority Backlog Aug 19, 2024

engcom-Hotel moved this to Pull Request In Progress in High Priority Backlog Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron job not running after crashed once #23054

Cron job not running after crashed once #23054

QuentinFarizon commented May 29, 2019

m2-assistant bot commented May 29, 2019

hostep commented May 30, 2019

QuentinFarizon commented May 31, 2019 •

edited

Loading

QuentinFarizon commented May 31, 2019

hostep commented May 31, 2019 •

edited

Loading

QuentinFarizon commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019 •

edited

Loading

magento-engcom-team commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019

magento-engcom-team commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019

hostep commented Jun 3, 2019

m2-assistant bot commented Jun 4, 2019

QuentinFarizon commented Jun 4, 2019

hostep commented Jun 4, 2019

QuentinFarizon commented Jun 6, 2019

mattheo-geoffray commented Jun 6, 2019 •

edited

Loading

QuentinFarizon commented Jun 6, 2019 •

edited

Loading

Ctucker9233 commented Jun 13, 2019

hostep commented Jun 14, 2019

Ctucker9233 commented Jun 14, 2019 •

edited

Loading

driskell commented Sep 24, 2020

magento-engcom-team commented Dec 22, 2020

Cron job not running after crashed once #23054

Cron job not running after crashed once #23054

Comments

QuentinFarizon commented May 29, 2019

Preconditions (*)

Steps to reproduce (*)

Expected result (*)

Actual result (*)

m2-assistant bot commented May 29, 2019

hostep commented May 30, 2019

QuentinFarizon commented May 31, 2019 • edited Loading

QuentinFarizon commented May 31, 2019

hostep commented May 31, 2019 • edited Loading

QuentinFarizon commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019 • edited Loading

magento-engcom-team commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019

magento-engcom-team commented Jun 3, 2019

QuentinFarizon commented Jun 3, 2019

hostep commented Jun 3, 2019

m2-assistant bot commented Jun 4, 2019

QuentinFarizon commented Jun 4, 2019

hostep commented Jun 4, 2019

QuentinFarizon commented Jun 6, 2019

mattheo-geoffray commented Jun 6, 2019 • edited Loading

QuentinFarizon commented Jun 6, 2019 • edited Loading

Ctucker9233 commented Jun 13, 2019

hostep commented Jun 14, 2019

Ctucker9233 commented Jun 14, 2019 • edited Loading

driskell commented Sep 24, 2020

magento-engcom-team commented Dec 22, 2020

QuentinFarizon commented May 31, 2019 •

edited

Loading

hostep commented May 31, 2019 •

edited

Loading

QuentinFarizon commented Jun 3, 2019 •

edited

Loading

mattheo-geoffray commented Jun 6, 2019 •

edited

Loading

QuentinFarizon commented Jun 6, 2019 •

edited

Loading

Ctucker9233 commented Jun 14, 2019 •

edited

Loading