Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor dequeue transaction handling #55

Merged
merged 4 commits into from
Nov 11, 2024
Merged

Conversation

maxcountryman
Copy link
Owner

This refactor encapsulates pending task dequeue operations within their own transactions, updating the task row state to prevent duplicate processing by other callers. By doing so, we ensure that state changes are immediately visible, accurately reflecting task ownership throughout its processing lifecycle.

Additionally, tasks now include a configurable heartbeat interval and a record of the last heartbeat. Workers periodically update the task row to indicate ongoing liveness. Should a task’s heartbeat become stale, the dequeue method can select it for reassignment.

It's important to note that a missed heartbeat alone does not definitively indicate task abandonment, as a worker might resume processing after a temporary delay. To guard against this type of partial failure, workers also acquire a transaction-level advisory lock on the task ID. As long as a worker's transaction remains active, this lock prevents other workers from processing the same task, ensuring exclusive ownership and consistent processing even across intermittent failures.

A notable benefit of these changes is that task progress states are fully utilized and in-progress tasks are visible globally. Furthermore, transaction overhead is reduced as a dequeue's transaction is only held for the duration of obtaining an available task. That said, a second transaction is still maintained for the duration of execution so long-running tasks still benefit from decomposition into e.g. multiple job steps.

This refactor encapsulates pending task dequeue operations within their
own transactions, updating the task row state to prevent duplicate
processing by other callers. By doing so, we ensure that state changes
are immediately visible, accurately reflecting task ownership throughout
its processing lifecycle.

Additionally, tasks now include a configurable heartbeat interval and a
record of the last heartbeat. Workers periodically update the task row
to indicate ongoing liveness. Should a task’s heartbeat become stale,
the dequeue method can select it for reassignment.

It's important to note that a missed heartbeat alone does not
definitively indicate task abandonment, as a worker might resume
processing after a temporary delay. To guard against this type of
partial failure, workers also acquire a transaction-level advisory
lock on the task ID. As long as a worker's transaction remains active,
this lock prevents other workers from processing the same task,
ensuring exclusive ownership and consistent processing even across
intermittent failures.

A notable benefit of these changes is that task progress states are fully
utilized and in-progress tasks are visible globally. Furthermore,
transaction overhead is reduced as a dequeue's transaction is only held
for the duration of obtaining an available task. That said, a second
transaction is still maintained for the duration of execution so
long-running tasks still benefit from decomposition into e.g. multiple
job steps.
@maxcountryman
Copy link
Owner Author

@kirillsalykin this addresses the fact that "in-progress" hasn't be used and increases visibility overall. I think this is also somewhat closer to how e.g. pg-boss approaches things.

In cases where a lock can't be acquired, workers will return early and
try again later. This is important because only one worker can execute a
task and failure to acquire the lock invalidates the assumption that the
task is still in an available state.
@kirillsalykin
Copy link

kirillsalykin commented Nov 11, 2024

I might be wrong (please correct me if so), but this change reduces consistency.
Imagine this scenario - work being started, task marked as in_progress, then the worker get killed w/o possibility to update the database (for instance connection drops) and task stays in in_progress state - it will never be picked up again...

UPDATED:
ah, I see what you did here, if task misses hearbeat - it considered failed.

sorry for the noice, clear now!

PS I would read description before making comments

update underway.task
set updated_at = now(),
last_heartbeat_at = now()
where id = $1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still confuses me, I'd expect id to be the "real" id, eg it is unique

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unique per queue name.

In part this is a semantical quality that's "more correct" but there's also some practical reasons to do it this way. For example, this means the same task (i.e. per ID) can be in multiple queues but that for each queue only one unique copy can exist.

Copy link

@kirillsalykin kirillsalykin Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means the same task (i.e. per ID) can be in multiple queues

This confuses me even more, I would expect id = identity = lifecycle, so it makes it tricky to figure out which queue and how updates task status...
but lets take this discussion out of the PR, look good to me, btw :)

@kirillsalykin
Copy link

this makes code (and approach) slightly more complicated, but I think not keeping tx opened for the duration of task is a good thing...

@maxcountryman
Copy link
Owner Author

Agree on all points--I wish it weren't more complex of course but I think it's worth it for the observability gained. I also trust that projects like pg-boss have had longer to mature and so there's probably good reasons for some of their more significant design decisions which we can also benefit from.

@maxcountryman maxcountryman merged commit b9d584e into main Nov 11, 2024
4 checks passed
@maxcountryman maxcountryman deleted the transaction-refactor branch November 13, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants