Fix retry number handling #1230

jlowin · 2016-03-28T04:48:53Z

The number of run attempts was always incremented by 1, so forcing a
task to run or running a task after it had failed resulted in
nonsensical logs like “Attempt 2 out of 1”. Here, we use the mod
function to reset the try_number back to zero whenever it reaches its
maximum number of retries. That way the next run (if it happens) gets
logged correctly.

Also, the log for queued tasks was never actually being logged — that’s
fixed.

Closes #1229

landscape-bot · 2016-03-28T04:57:39Z

Code quality remained the same when pulling 73d87dc on jlowin:attempt-message into 0e325d9 on airbnb:master.

coveralls · 2016-03-28T05:01:24Z

Coverage remained the same at 66.381% when pulling 6d97da2e3ea47ca27ff9f99cf94a74a3f513e31b on jlowin:attempt-message into 0e325d9 on airbnb:master.

coveralls · 2016-03-28T05:08:02Z

Coverage remained the same at 66.381% when pulling 73d87dc13ba26aa3597caf35f8c90bb83e4db1ad on jlowin:attempt-message into 0e325d9 on airbnb:master.

bolkedebruin · 2016-03-28T09:25:37Z

Thanks for the great commit message. It really helped reading through what you have changed. Question: else where try_number is compared to task.retries to mark a task failed (larger than). As you are resetting try_number to 0, will tasks still be marked failed?

jlowin · 2016-03-28T12:30:15Z

@bolkedebruin that's a great question -- I need to address that case because right now a task would put it up for retries forever.

However, looking at the code, I think there's a larger bug here than just a logging message. In the current code, try_number never gets reset. That means tasks that are re-run (either by force or just because they failed) will not honor their own retry instructions. Their "first" run will already appear to be over the retry limit. In quick testing this can put the backfilljob into an infinite loop -- I need to see if #1220 is related.

I'm going to mark this as a bug and add a unit test for retries with my next commit.

A task’s try_number is unbounded (incremented by 1 on every run) so it needs to be adjusted both for logging and for seeing if a task has eclipsed the retry cap. Rerunning a task (either because it failed or with the `force` option) not only leads to nonsensical error messages (“Attempt 2 of 1”) but also would never kick off a retry attempt (because try_number > retries). The solution is to mod the `try_number` with `retries` to keep everything sensible. Fixed: use the correct attempt number when logging Fixed: log when tasks are queued (log message was being created but not logged) Fixed: situation where tasks being run after the first time would not be put up for retry

landscape-bot · 2016-03-28T19:46:41Z

Repository health decreased by 0.18% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

13 new problems were found (including 0 errors and 2 code smells).
7 problems were fixed (including 0 errors and 3 code smells).

coveralls · 2016-03-28T19:50:40Z

Coverage increased (+0.09%) to 66.64% when pulling 699cd9d296e3c4a8f72b59408920fc462ccbfc28 on jlowin:attempt-message into c14d2ea on airbnb:master.

coveralls · 2016-03-28T20:23:07Z

Coverage increased (+0.09%) to 66.64% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

coveralls · 2016-03-28T20:25:00Z

Coverage increased (+0.09%) to 66.64% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

jlowin · 2016-03-28T20:43:55Z

@bolkedebruin glad you pointed that out. There was indeed an issue with re-running tasks that had retries > 0 -- they simply wouldn't retry. It should be fixed now, plus unit tests (pending Travis's OK.)

bolkedebruin · 2016-03-29T11:47:19Z

Nice!

jlowin force-pushed the attempt-message branch from 6d97da2 to 73d87dc Compare March 28, 2016 04:56

jlowin mentioned this pull request Mar 28, 2016

Fix "Attempt 2 out of 1" messages #1229

Closed

jlowin self-assigned this Mar 28, 2016

jlowin added the kind:bug This is a clearly a bug label Mar 28, 2016

jlowin changed the title ~~Report correct number of task attempts in log~~ Fix retry number handling Mar 28, 2016

jlowin force-pushed the attempt-message branch 2 times, most recently from 3a3be08 to 699cd9d Compare March 28, 2016 19:40

jlowin force-pushed the attempt-message branch from 699cd9d to 8051aa9 Compare March 28, 2016 19:45

bolkedebruin merged commit be2a181 into apache:master Mar 29, 2016

jlowin mentioned this pull request Mar 29, 2016

Use refactored utils module in unit test imports #1244

Merged

jlowin deleted the attempt-message branch March 29, 2016 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix retry number handling #1230

Fix retry number handling #1230

jlowin commented Mar 28, 2016

landscape-bot commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

bolkedebruin commented Mar 28, 2016

jlowin commented Mar 28, 2016

landscape-bot commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

jlowin commented Mar 28, 2016

bolkedebruin commented Mar 29, 2016

Fix retry number handling #1230

Fix retry number handling #1230

Conversation

jlowin commented Mar 28, 2016

landscape-bot commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

bolkedebruin commented Mar 28, 2016

jlowin commented Mar 28, 2016

landscape-bot commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

coveralls commented Mar 28, 2016

jlowin commented Mar 28, 2016

bolkedebruin commented Mar 29, 2016