Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cron job not running after crashed once #23054

Closed
QuentinFarizon opened this issue May 29, 2019 · 24 comments · Fixed by #28007
Closed

Cron job not running after crashed once #23054

QuentinFarizon opened this issue May 29, 2019 · 24 comments · Fixed by #28007
Assignees
Labels
Fixed in 2.4.x The issue has been fixed in 2.4-develop branch Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed Priority: P2 A defect with this priority could have functionality issues which are not to expectations. Progress: done Progress: PR in progress Severity: S1 Affects critical data or functionality and forces users to employ a workaround.

Comments

@QuentinFarizon
Copy link

Preconditions (*)

Magento EE 2.2.8
Crontab configured as per the documentation

Steps to reproduce (*)

  • An error makes the cron job indexer_update_all_views fail once (in my case, database unavailable)

Expected result (*)

  • Current run should be marked as failed in the table cron_schedule
  • Next run should run correctly and status be updated at the end in cron_schedule

Actual result (*)

Table cron_schedule is filled with pending jobs, no job for indexer_update_all_views is run (no output in var/log/cron.log, no status update in cron_schedule table.

Logs :
var/log/cron.log (last success + the error message)

[2019-05-29 11:25:10] report.INFO: Cron Job indexer_update_all_views is run [] []
[2019-05-29 11:28:03] report.ERROR: Cron Job indexer_update_all_views has an error: SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use, query was: SELECT `mview_state`.* FROM `mview_state` WHERE (`mview_state`.`view_id`='catalog_product_flat'). Statistics: {"sum":0,"count":1,"realmem":0,"emalloc":0,"realmem_start":182714368,"emalloc_start":180313880} [] []

=> And then no more logs about indexer_update_all_views, even if other jobs from the index group run correctly and output success in var/log/cron.log
Database recovered a minute after and query was OK

@magento-engcom-team magento-engcom-team added the Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed label May 29, 2019
@m2-assistant
Copy link

m2-assistant bot commented May 29, 2019

Hi @QuentinFarizonAfrimarket. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

  • Summary of the issue
  • Information on your environment
  • Steps to reproduce
  • Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@QuentinFarizonAfrimarket do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

  • yes
  • no

@hostep
Copy link
Contributor

hostep commented May 30, 2019

I agree, this definitely needs fixing.

Just FYI: there has recently been a proposal to review the current cron implementation and improve its stability: magento/architecture#171

@QuentinFarizon
Copy link
Author

QuentinFarizon commented May 31, 2019

Hello @hostep thank you !
I agree with the conclusion of magento/architecture#171 : cron management must be more resilient, fireproof, and protected against periodic or permanent failing of one of the job codes.

I think I found a reproducible scenario that caused issue on my system :

  • Indexing job (group "index") configured as "no separate proces"
  • Indexing runs out of memory, process is killed by the system (or process crashes violently for another reason)
  • As process didn't have the occasion to set the job as "error", it stays in running
  • No new indexing job runs (because one is running)
  • Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

Consequences
=> No error or few errors in cron.log or other logs
=> No reporting on the back-office or cli, apart from indexer status keeping piling up
=> Index job marked as "running" for 3 days after a single error

Workaround :
=> Increase memory limit
=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

Ideas :
=> Store host+pid in database to regularly check for crashed processes (when you're on the correct host)

@QuentinFarizon
Copy link
Author

@hostep Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

So it may never run again, what do you think ?

@hostep
Copy link
Contributor

hostep commented May 31, 2019

Running jobs are cleaned after max(successLifetime, erroLifetime) which by default is 3 days (!) for group index

This is not entirely true, since it doesn't clean up jobs with status running:

Schedule::STATUS_SUCCESS => $historySuccess * self::SECONDS_IN_MINUTE,
Schedule::STATUS_MISSED => $historyFailure * self::SECONDS_IN_MINUTE,
Schedule::STATUS_ERROR => $historyFailure * self::SECONDS_IN_MINUTE,
Schedule::STATUS_PENDING => max($historyFailure, $historySuccess) * self::SECONDS_IN_MINUTE,

As far as I know, running jobs are just never cleaned up, even if they are no longer actually running. And that's the biggest problem here.

=> Set error lifetime to something between 1 and 6 hours (should be sufficient for indexing)

That's a good point, the lifetime was already lowered a lot in 244a2e9, but for the indexing group, it probably makes sense to lower the default failure history even further.
Although, how many errors do you actually see with indexer jobs? Not that many I reckon? But it might still make sense to lower it, since that lifetime is also used to cleanup pending jobs.

Furthermore, I don't think the database lock (created in ProcessCronQueueObserver::lockGroup) is ever released if process crash without having a chance to unlock it.

I'm not exactly sure what happens with a lock acquired in a mysql server when the php code suddenly stops, I would think that the connection with mysql then closes and the lock is released automatically, but I'm not sure about this...

Btw: the locking is not used to manage the statuses of the jobs themselves as far as I know.
The reason why this locking was necessary (if I remember correctly) was because sometimes you could have situations where the same cron group was being executed multiple times in parallel, because the first run wasn't fully done in a single minute (Magento recommends crontab to be set so it runs every minute). And you don't want to have a new process trying to manage the statuses of jobs in the same cron group before the existing process has finished its work.

@QuentinFarizon
Copy link
Author

Thanks @hostep

Indeed you're right, running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

I think this requires a quick fix in 2.2 and 2.3 release lines, until your proposition for hardening crons is implemented. What do you think ?

Indeed, from my observations the mysql lock seems to be released, so it should'nt be an additional issue (needs confirmation)

@QuentinFarizon
Copy link
Author

QuentinFarizon commented Jun 3, 2019

@magento give me 2.2.8 instance

@magento-engcom-team
Copy link
Contributor

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8-develop instance for you

@QuentinFarizon
Copy link
Author

@magento give me 2.2.8 instance

@magento-engcom-team
Copy link
Contributor

Hi @QuentinFarizonAfrimarket. Thank you for your request. I'm working on Magento 2.2.8 instance for you

@QuentinFarizon
Copy link
Author

Hello @hostep

I have another issue : individual mview updates (like catalog_product_flat) are stuck in 'working' mode in the table mviews_state.
As they are not hold by any process, and are not cleaned (I believe) so they will be stuck forever.
So even with an indexer job running correctly, you can get individual views that are not updated following their changelogs.

I suspect that this modification : 5927a75 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

I can push a PR if you agree (for both issues)

@hostep
Copy link
Contributor

hostep commented Jun 3, 2019

... running jobs are never cleaned. I think it would make sens to clear the after $historyFailure or max($historyFailure, $historySuccess)
There are so many reason a process can fail (out of memory, server reboot, ...), having your indexing stuck until you clear the line in the database is a very high price to pay.

Agreed. There should be some way for Magento to detect failed cron jobs and not assume they are still running. But I'm not sure if the existing lifetime fields are the right solution here. Maybe a new field should be added for a new lifetime called 'max duration of running job' or something like that?

I suspect that this modification : 5927a75 must also be applied to https://github.com/magento/magento2/blob/2.3-develop/lib/internal/Magento/Framework/Mview/View.php#L303

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here. If php runs out of memory, it will just stop and will just never go into the catch I think. You can possibly catch such situations with register_shutdown_function, but then you need to make sure you reserved a bunch of memory beforehand which you then have to free up so you still have memory enough to fix the state in the database. This sound pretty complex and not a very elegant solution.
And what do you do when the electricity fails just when the state in the database is set to working but before any work actually begun?
Would using database transactions work here for example?

It would be nice to have some feedback from some experts in here. @dmanners: I spoke to you on Imagine 2018 about this particular problem (amongst some other problems), but this still hasn't been properly fixed, can you try to pull in some experts from Magento in here to see how this problem can be solved?
The quick summary: we need a way for Magento to not keep cronjobs in state running or materialized views in the state working forever and ever. There must be a way for Magento to detect this and work around this somehow.

@max-tkachuk max-tkachuk self-assigned this Jun 4, 2019
@m2-assistant
Copy link

m2-assistant bot commented Jun 4, 2019

Hi @M-A-X-I-M. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: 👇

  • 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).

    DetailsIf the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.

  • 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.

  • 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.

  • 4. Verify that the issue is reproducible on 2.3-develop branch

    Details- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
    - If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and stop verification process here!

  • 5. Verify that the issue is reproducible on 2.2-develop branch.
    Details- Add the comment @magento give me 2.2-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.2-develop branch, please add the label Reproduced on 2.2.x

@QuentinFarizon
Copy link
Author

The change to catching a Throwable instead of an Exception might prevent more possible problematic situations, but will not entirely fix all the possible problems here.

Yes agreed, but it would be easy to push as a first fix. I thought about register_shutdown_function, but indeed it's probably complicated to make database queries in it.
Database transactions should work yes, specifying to lock table cron_schedule but not lock the indexed table.

@hostep
Copy link
Contributor

hostep commented Jun 4, 2019

Yes agreed, but it would be easy to push as a first fix.

Cool, feel free to create a PR with this suggestion! It will most likely get approved.

@QuentinFarizon
Copy link
Author

Even with my patch (#23125) I still have issues where no cron is currently running, it's not stuck in cron_schedule, but some mviews are stuck to 'working'

MariaDB [ivwooaagji7o4]> select * from cron_schedule where job_code  = 'indexer_update_all_views' order by executed_at desc limit 20;
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
| schedule_id | job_code                 | status  | messages | created_at          | scheduled_at        | executed_at         | finished_at         |
+-------------+--------------------------+---------+----------+---------------------+---------------------+---------------------+---------------------+
|    21779692 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:39:00 | 2019-06-06 10:40:10 | 2019-06-06 10:40:21 |
|    21779689 | indexer_update_all_views | success | NULL     | 2019-06-06 10:36:06 | 2019-06-06 10:38:00 | 2019-06-06 10:39:06 | 2019-06-06 10:39:19 |
|    21779632 | indexer_update_all_views | success | NULL     | 2019-06-06 10:34:15 | 2019-06-06 10:37:00 | 2019-06-06 10:38:06 | 2019-06-06 10:38:19 |
MariaDB [ivwooaagji7o4]> select * from mview_state;
+----------+-----------------------------------+----------+---------+---------------------+------------+
| state_id | view_id                           | mode     | status  | updated             | version_id |
+----------+-----------------------------------+----------+---------+---------------------+------------+
|        1 | catalogsearch_fulltext            | enabled  | idle    | 2019-06-06 10:41:13 |   73367814 |
|        2 | catalog_category_product          | enabled  | idle    | 2019-06-06 10:41:11 |      69390 |
|        3 | catalog_product_category          | enabled  | working | 2019-06-06 08:52:35 |    8593557 |
|        4 | catalog_product_price             | enabled  | working | 2019-06-05 15:10:16 |    8879166 |
|        5 | catalog_product_attribute         | enabled  | idle    | 2019-06-06 10:41:11 |   30248796 |
|        6 | catalogrule_rule                  | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|        7 | catalogrule_product               | enabled  | idle    | 2019-06-06 10:41:11 |   44660022 |
|        8 | cataloginventory_stock            | enabled  | idle    | 2019-06-06 10:41:11 |   17497764 |
|        9 | targetrule_product_rule           | enabled  | working | 2019-06-05 15:10:19 |     920777 |
|       10 | targetrule_rule_product           | enabled  | idle    | 2019-06-06 10:41:13 |          0 |
|       11 | salesrule_rule                    | enabled  | idle    | 2019-06-06 10:41:11 |        528 |
|       12 | algolia_products                  | disabled | idle    | 2019-06-06 10:41:11 |   19351077 |
|       13 | algolia_categories                | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       14 | algolia_pages                     | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       15 | algolia_suggestions               | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       16 | algolia_additional_sections       | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       17 | algolia_queue_runner              | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       18 | catalog_category_flat             | enabled  | idle    | 2019-06-06 10:41:11 |      15264 |
|       23 | catalog_product_flat              | enabled  | working | 2019-06-05 15:10:14 |   21107781 |
|       27 | design_config_dummy               | enabled  | idle    | 2019-06-06 10:41:11 |          0 |
|       30 | customer_dummy                    | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       33 | algolia_delete_products           | disabled | idle    | 2019-06-06 10:41:11 |          0 |
|       36 | afrimarket_catalog_basket_product | enabled  | idle    | 2019-06-06 10:41:12 |    4258158 |
|       39 | afrimarket_catalog_ranking        | enabled  | idle    | 2019-06-06 10:41:19 |   24287076 |
|       42 | afrimarket_catalog_margin         | enabled  | idle    | 2019-06-06 10:41:19 |     138633 |
+----------+-----------------------------------+----------+---------+---------------------+------------+

I have no ideas how this can happen, I'm open to suggestions

@mattheo-geoffray
Copy link

mattheo-geoffray commented Jun 6, 2019

@QuentinFarizonAfrimarket Hi!
Did you try this fix?
Issue: https://github.com/magento/magento2/pull/23079/files
Fix: #23077
We have the same issue with millions of lines in changelog tables especially catalog_product_flat_cl maybe it is the same for you.
It is stuck in \Magento\Framework\Mview\View::update processing millions of versions in chunck

@QuentinFarizon
Copy link
Author

QuentinFarizon commented Jun 6, 2019

Hello, yes I have applied the fix #23077 and recreated triggers : #23079 (comment)

It certainly helps, but we sill have issues, also with catalog_product_flat (temporary index table seems to disappear during indexation). Not sure how it is related to mview getting stuck since it's supposed to be catched by this PR

@Ctucker9233
Copy link

@hostep I've noticed in my own environment that any jobs that are running and never complete are marked as missed. A method that clears missed jobs after a determined lifetime would also solve this issue.

@hostep
Copy link
Contributor

hostep commented Jun 14, 2019

@Ctucker9233: reading the code I'm not seeing how that can happen

A job is only set to missed when it tries to run a pending job, and if the scheduled time was missed by some offset:

if ($scheduledTime < $currentTime - $scheduleLifetime) {
$schedule->setStatus(Schedule::STATUS_MISSED);
throw new \Exception(sprintf('Cron Job %s is missed at %s', $jobCode, $schedule->getScheduledAt()));
}

This _runJob method only gets called from the processPendingJobs method. Which only processes jobs with state pending.

So you shouldn't see running jobs getting changed to missed somehow, unless I'm missing something or you are using some custom code which is reponsible for this?

And there is already some code to cleanup missed jobs as far as I can see:

Schedule::STATUS_MISSED => $historyFailure * self::SECONDS_IN_MINUTE,

@Ctucker9233
Copy link

Ctucker9233 commented Jun 14, 2019

@hostep I misscommunicated. I didn't mean that I am seeing running jobs switch their status to missed. What I meant was that in my environment at least a lot of missing jobs seem to accumulate and that can also cause the cron to crash or have a deadlock. The code that looks like it's meant to clean up missed jobs also seems to be not functioning in my case.

@ghost ghost assigned QuentinFarizon and unassigned QuentinFarizon Jul 16, 2019
@ghost ghost unassigned QuentinFarizon and max-tkachuk Sep 27, 2019
@ihor-sviziev ihor-sviziev added the Severity: S1 Affects critical data or functionality and forces users to employ a workaround. label Jun 5, 2020
@VladimirZaets VladimirZaets added the Priority: P2 A defect with this priority could have functionality issues which are not to expectations. label Aug 7, 2020
@m2-community-project m2-community-project bot added Progress: PR Created Indicates that Pull Request has been created to fix issue and removed Progress: PR in progress labels Sep 23, 2020
@m2-community-project m2-community-project bot added Progress: PR Created Indicates that Pull Request has been created to fix issue and removed Progress: PR Created Indicates that Pull Request has been created to fix issue labels Sep 24, 2020
@driskell
Copy link
Contributor

The PR in #28007 should fix this. It clears up things stuck in running and mitigates deadlocks so the cleanups can finish and stop the cron_schedule table growing exponentially.

@ghost ghost added Progress: PR in progress and removed Progress: PR Created Indicates that Pull Request has been created to fix issue labels Oct 20, 2020
@magento-engcom-team magento-engcom-team added the Fixed in 2.4.x The issue has been fixed in 2.4-develop branch label Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fixed in 2.4.x The issue has been fixed in 2.4-develop branch Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed Priority: P2 A defect with this priority could have functionality issues which are not to expectations. Progress: done Progress: PR in progress Severity: S1 Affects critical data or functionality and forces users to employ a workaround.
Projects
Status: Pull Request In Progress