Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retry number handling #1230

Merged
merged 1 commit into from
Mar 29, 2016
Merged

Conversation

jlowin
Copy link
Member

@jlowin jlowin commented Mar 28, 2016

The number of run attempts was always incremented by 1, so forcing a
task to run or running a task after it had failed resulted in
nonsensical logs like “Attempt 2 out of 1”. Here, we use the mod
function to reset the try_number back to zero whenever it reaches its
maximum number of retries. That way the next run (if it happens) gets
logged correctly.

Also, the log for queued tasks was never actually being logged — that’s
fixed.

Closes #1229

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 73d87dc on jlowin:attempt-message into 0e325d9 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 66.381% when pulling 6d97da2e3ea47ca27ff9f99cf94a74a3f513e31b on jlowin:attempt-message into 0e325d9 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 66.381% when pulling 73d87dc13ba26aa3597caf35f8c90bb83e4db1ad on jlowin:attempt-message into 0e325d9 on airbnb:master.

@bolkedebruin
Copy link
Contributor

Thanks for the great commit message. It really helped reading through what you have changed. Question: else where try_number is compared to task.retries to mark a task failed (larger than). As you are resetting try_number to 0, will tasks still be marked failed?

@jlowin
Copy link
Member Author

jlowin commented Mar 28, 2016

@bolkedebruin that's a great question -- I need to address that case because right now a task would put it up for retries forever.

However, looking at the code, I think there's a larger bug here than just a logging message. In the current code, try_number never gets reset. That means tasks that are re-run (either by force or just because they failed) will not honor their own retry instructions. Their "first" run will already appear to be over the retry limit. In quick testing this can put the backfilljob into an infinite loop -- I need to see if #1220 is related.

I'm going to mark this as a bug and add a unit test for retries with my next commit.

@jlowin jlowin self-assigned this Mar 28, 2016
@jlowin jlowin added the kind:bug This is a clearly a bug label Mar 28, 2016
@jlowin jlowin changed the title Report correct number of task attempts in log Fix retry number handling Mar 28, 2016
@jlowin jlowin force-pushed the attempt-message branch 2 times, most recently from 3a3be08 to 699cd9d Compare March 28, 2016 19:40
A task’s try_number is unbounded (incremented by 1 on every run) so
it needs to be adjusted both for logging and for seeing if a task
has eclipsed the retry cap. Rerunning a task (either because it
failed or with the `force` option) not only leads to nonsensical
error messages (“Attempt 2 of 1”) but also would never kick off a
retry attempt (because try_number > retries). The solution is to mod
the `try_number` with `retries` to keep everything sensible.

Fixed: use the correct attempt number when logging
Fixed: log when tasks are queued (log message was being created but
not logged)
Fixed: situation where tasks being run after the first time would
not be put up for retry
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.18% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 66.64% when pulling 699cd9d296e3c4a8f72b59408920fc462ccbfc28 on jlowin:attempt-message into c14d2ea on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 66.64% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 66.64% when pulling 8051aa9 on jlowin:attempt-message into c14d2ea on airbnb:master.

@jlowin
Copy link
Member Author

jlowin commented Mar 28, 2016

@bolkedebruin glad you pointed that out. There was indeed an issue with re-running tasks that had retries > 0 -- they simply wouldn't retry. It should be fixed now, plus unit tests (pending Travis's OK.)

@bolkedebruin
Copy link
Contributor

Nice!

@bolkedebruin bolkedebruin merged commit be2a181 into apache:master Mar 29, 2016
@jlowin jlowin deleted the attempt-message branch March 29, 2016 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants