-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix retry number handling #1230
Conversation
Coverage remained the same at 66.381% when pulling 6d97da2e3ea47ca27ff9f99cf94a74a3f513e31b on jlowin:attempt-message into 0e325d9 on airbnb:master. |
Coverage remained the same at 66.381% when pulling 73d87dc13ba26aa3597caf35f8c90bb83e4db1ad on jlowin:attempt-message into 0e325d9 on airbnb:master. |
Thanks for the great commit message. It really helped reading through what you have changed. Question: else where try_number is compared to task.retries to mark a task failed (larger than). As you are resetting try_number to 0, will tasks still be marked failed? |
@bolkedebruin that's a great question -- I need to address that case because right now a task would put it up for retries forever. However, looking at the code, I think there's a larger bug here than just a logging message. In the current code, try_number never gets reset. That means tasks that are re-run (either by force or just because they failed) will not honor their own retry instructions. Their "first" run will already appear to be over the retry limit. In quick testing this can put the backfilljob into an infinite loop -- I need to see if #1220 is related. I'm going to mark this as a bug and add a unit test for retries with my next commit. |
3a3be08
to
699cd9d
Compare
A task’s try_number is unbounded (incremented by 1 on every run) so it needs to be adjusted both for logging and for seeing if a task has eclipsed the retry cap. Rerunning a task (either because it failed or with the `force` option) not only leads to nonsensical error messages (“Attempt 2 of 1”) but also would never kick off a retry attempt (because try_number > retries). The solution is to mod the `try_number` with `retries` to keep everything sensible. Fixed: use the correct attempt number when logging Fixed: log when tasks are queued (log message was being created but not logged) Fixed: situation where tasks being run after the first time would not be put up for retry
|
Coverage increased (+0.09%) to 66.64% when pulling 699cd9d296e3c4a8f72b59408920fc462ccbfc28 on jlowin:attempt-message into c14d2ea on airbnb:master. |
@bolkedebruin glad you pointed that out. There was indeed an issue with re-running tasks that had |
Nice! |
The number of run attempts was always incremented by 1, so forcing a
task to run or running a task after it had failed resulted in
nonsensical logs like “Attempt 2 out of 1”. Here, we use the mod
function to reset the try_number back to zero whenever it reaches its
maximum number of retries. That way the next run (if it happens) gets
logged correctly.
Also, the log for queued tasks was never actually being logged — that’s
fixed.
Closes #1229