Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect metric at the dashboard overview #621

Closed
ghobaty opened this issue Jun 22, 2019 · 8 comments · Fixed by #633
Closed

Incorrect metric at the dashboard overview #621

ghobaty opened this issue Jun 22, 2019 · 8 comments · Fixed by #633
Labels

Comments

@ghobaty
Copy link
Contributor

ghobaty commented Jun 22, 2019

  • Horizon Version: 3.2.3
  • Laravel Version: 5.7.28
  • PHP Version: 7.2.19
  • Redis Driver & Version: predis/predis 1.1.1
  • Database Driver & Version: MySQL 5.7

Description:

I believe the dashboard displays an incorrect label or number for the failed jobs metric.
Screen Shot 2019-06-22 at 14 35 15
In the screenshot, 19 is the number of jobs failed within the last hour and not last 7 days.
Not sure if it is related, but my horizon trim configuration is:

'trim' => [
    'recent' => 60,
    'failed' => 10080,
],
@driesvints driesvints added the bug label Jun 24, 2019
@shawnhind
Copy link

shawnhind commented Jun 24, 2019

I've also seen this value go down between reloads as if the number of failed jobs in the last 7 days has decreased, meanwhile new failed jobs are actually still occurring. Would this be related? I guess this would just be because it's actually over the last hour that is decreasing despite more jobs failing just because the last hour's failed jobs are no longer being counted.

@JayBizzle
Copy link
Contributor

We are also seeing the same issue...

These screenshots were taken on the 13th June...

Screenshot 2019-06-13 at 20 17 36

Says there is 1 failed job in the last 7 days...

Screenshot 2019-06-13 at 20 17 52

...but surely it should say 6?

@derekmd
Copy link
Contributor

derekmd commented Jul 7, 2019

The problem is (by default) only the past hour of failed jobs are included in the dashboard count. The labelled "FAILED JOBS PAST 7 DAYS" period is from config('horizon.trim.failed') but the metric is actually filtered by config('horizon.trim.recent').

I put some branches together, one of them can be put up as a pull request.

Option 1 - make count behave as already labelled

option-1-week

Use config('horizon.trim.failed') (default 7 days) when counting how many minutes ago to include. This will make the dashboard count exactly match the number of rows under the "Failed Jobs" tab for URI /horizon/failed.

Proposed branch: 3.0...derekmd:dashboard-7-day-failed-job-count

Option 2 - fix label to match metric count being show

option-2-hour

Keep the current config('horizon.trim.recent') count, correcting the label to show "FAILED JOBS PAST HOUR". I personally don't find such a metric useful since not many devs view this page every hour. I would assume a queue status check-in every day or two would be the typical use case.

Branch: 3.0...derekmd:fix-dashboard-failed-jobs-duration-label

Option 3 - add a new configuration variable

option-3-any-config

Add a new config('horizon.trim.recent_failed') that can be a value between config('horizon.trim.recent') and config('horizon.trim.failed'), by default set to 24 hours. Reason being:

  • config('horizon.trim.recent') ("JOBS PAST HOUR" on the dashboard) gives an idea of queue throughput and it doesn't grow Redis memory use too much.
  • config('horizon.trim.failed') ("Failed Jobs" page results) has a longer time period that indicate trends and historical context for application exceptions.
  • A dashboard failed jobs count gives an idea of something that should be looked at today.
    • If a 7 day count is shown, devs may be bothered by a failing count metric not resetting back to 0 shortly after the issue is fixed.
    • Instead you have to remember, "what was the failure count yesterday? Has it lowered?"
    • A 12hr or 24hr config allows the above two metrics to continue as normal while allowing devs to choose how they want this counter to behave.

Branch: 3.0...derekmd:dashboard-failed-job-count-config-filter

Cast a Vote

If there isn't a consensus what to do by Tuesday, I'll pull request option 1 since it has the least friction.

@JayBizzle
Copy link
Contributor

Option 1 for me 👍

@themsaid
Copy link
Member

themsaid commented Jul 8, 2019

I wonder what @halaei thinks here.

@shawnhind
Copy link

Option 1 for me

@halaei
Copy link
Contributor

halaei commented Jul 9, 2019

Option 1 for me too

@mfn
Copy link
Contributor

mfn commented Jul 9, 2019

1️⃣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants