Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie tasks able to acquire locks after failure #1715

Closed
drcrallen opened this issue Sep 9, 2015 · 2 comments · Fixed by #1740
Closed

Zombie tasks able to acquire locks after failure #1715

drcrallen opened this issue Sep 9, 2015 · 2 comments · Fixed by #1740
Labels

Comments

@drcrallen
Copy link
Contributor

We had an issue where disconnect somewhere occurred long enough for the overlord to log a task as failed, but the task was still running. After 45 mins or so the task submitted a lock acquire request and it was granted, but since the task had already been failed, it never was properly cleaned up.

These are from the overlord

2015-09-09T07:33:58,102 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.RemoteTaskRunner - Running scheduled cleanup for Worker[REDACTED:8080]
2015-09-09T07:33:58,106 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.RemoteTaskRunner - Failing task[index_realtime_REDACTED]
2015-09-09T07:33:58,106 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_REDACTED
2015-09-09T07:33:58,109 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.RemoteTaskRunner - Can't shutdown! No worker running task index_realtime_REDACTED
2015-09-09T07:33:58,110 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_REDACTED to status: TaskStatus{id=index_realtime_REDACTED, status=FAILED, duration=-1}
2015-09-09T07:33:58,132 INFO [RemoteTaskRunner-Scheduled-Cleanup--0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_REDACTED] from TaskLock[index_realtime_REDACTED]
2015-09-09T08:15:15,428 INFO [qtpREDACTED] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_REDACTED]: LockAcquireAction{interval=2015-09-09T08:00:00.000Z/2015-09-09T09:00:00.000Z}
2015-09-09T08:15:15,428 INFO [qtpREDACTED] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_realtime_REDACTED, dataSource=REDACTED, interval=2015-09-09T08:00:00.000Z/2015-09-09T09:00:00.000Z, version=2015-09-09T08:15:15.428Z}, taskIds=[]}
2015-09-09T08:15:15,428 INFO [qtpREDACTED] io.druid.indexing.overlord.TaskLockbox - Added task[index_realtime_REDACTED] to TaskLock[index_realtime_REDACTED]

Suffice to say, this is a long-running task that acquires multiple 1-hr locks during the course of its execution.

@drcrallen
Copy link
Contributor Author

this is against 0.8.0

@gianm
Copy link
Contributor

gianm commented Sep 14, 2015

Spoke with @drcrallen and @nishantmonu51 about this- we thought it would make sense for the TaskLockbox to have a Set of active task ids that it is the responsibility of the TaskQueue to keep up to date. Then the TaskLockbox could reject lock requests for tasks that are not active.

The TaskQueue already has a reference to the lockbox so this is perhaps not too much of a stretch.

@drcrallen drcrallen added the Bug label Sep 14, 2015
@nishantmonu51 nishantmonu51 self-assigned this Sep 14, 2015
nishantmonu51 added a commit to metamx/druid that referenced this issue Sep 24, 2015
fixes apache#1715
- TaskLockBox has a set of active tasks
- lock requests throws exception for if they are from a task not in
active task set.
- TaskQueue is responsible for updating the active task set on
tasklockbox

fix apache#1715

fixes apache#1715
- TaskLockBox has a set of active tasks
- lock requests throws exception for if they are from a task not in
active task set.
- TaskQueue is responsible for updating the active task set on
tasklockbox

review comment

remove duplicate line

use ISE instead

organise imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants