refactor dequeue transaction handling #55

maxcountryman · 2024-11-10T18:41:03Z

This refactor encapsulates pending task dequeue operations within their own transactions, updating the task row state to prevent duplicate processing by other callers. By doing so, we ensure that state changes are immediately visible, accurately reflecting task ownership throughout its processing lifecycle.

Additionally, tasks now include a configurable heartbeat interval and a record of the last heartbeat. Workers periodically update the task row to indicate ongoing liveness. Should a task’s heartbeat become stale, the dequeue method can select it for reassignment.

It's important to note that a missed heartbeat alone does not definitively indicate task abandonment, as a worker might resume processing after a temporary delay. To guard against this type of partial failure, workers also acquire a transaction-level advisory lock on the task ID. As long as a worker's transaction remains active, this lock prevents other workers from processing the same task, ensuring exclusive ownership and consistent processing even across intermittent failures.

A notable benefit of these changes is that task progress states are fully utilized and in-progress tasks are visible globally. Furthermore, transaction overhead is reduced as a dequeue's transaction is only held for the duration of obtaining an available task. That said, a second transaction is still maintained for the duration of execution so long-running tasks still benefit from decomposition into e.g. multiple job steps.

This refactor encapsulates pending task dequeue operations within their own transactions, updating the task row state to prevent duplicate processing by other callers. By doing so, we ensure that state changes are immediately visible, accurately reflecting task ownership throughout its processing lifecycle. Additionally, tasks now include a configurable heartbeat interval and a record of the last heartbeat. Workers periodically update the task row to indicate ongoing liveness. Should a task’s heartbeat become stale, the dequeue method can select it for reassignment. It's important to note that a missed heartbeat alone does not definitively indicate task abandonment, as a worker might resume processing after a temporary delay. To guard against this type of partial failure, workers also acquire a transaction-level advisory lock on the task ID. As long as a worker's transaction remains active, this lock prevents other workers from processing the same task, ensuring exclusive ownership and consistent processing even across intermittent failures. A notable benefit of these changes is that task progress states are fully utilized and in-progress tasks are visible globally. Furthermore, transaction overhead is reduced as a dequeue's transaction is only held for the duration of obtaining an available task. That said, a second transaction is still maintained for the duration of execution so long-running tasks still benefit from decomposition into e.g. multiple job steps.

maxcountryman · 2024-11-10T18:45:44Z

@kirillsalykin this addresses the fact that "in-progress" hasn't be used and increases visibility overall. I think this is also somewhat closer to how e.g. pg-boss approaches things.

In cases where a lock can't be acquired, workers will return early and try again later. This is important because only one worker can execute a task and failure to acquire the lock invalidates the assumption that the task is still in an available state.

kirillsalykin · 2024-11-11T08:46:08Z

I might be wrong (please correct me if so), but this change reduces consistency.
Imagine this scenario - work being started, task marked as in_progress, then the worker get killed w/o possibility to update the database (for instance connection drops) and task stays in in_progress state - it will never be picked up again...

UPDATED:
ah, I see what you did here, if task misses hearbeat - it considered failed.

sorry for the noice, clear now!

PS I would read description before making comments

kirillsalykin · 2024-11-11T08:48:40Z

src/queue.rs

+            update underway.task
+            set updated_at = now(),
+                last_heartbeat_at = now()
+            where id = $1


this still confuses me, I'd expect id to be the "real" id, eg it is unique

It is unique per queue name.

In part this is a semantical quality that's "more correct" but there's also some practical reasons to do it this way. For example, this means the same task (i.e. per ID) can be in multiple queues but that for each queue only one unique copy can exist.

this means the same task (i.e. per ID) can be in multiple queues

This confuses me even more, I would expect id = identity = lifecycle, so it makes it tricky to figure out which queue and how updates task status...
but lets take this discussion out of the PR, look good to me, btw :)

kirillsalykin · 2024-11-11T09:06:33Z

this makes code (and approach) slightly more complicated, but I think not keeping tx opened for the duration of task is a good thing...

maxcountryman · 2024-11-11T13:54:27Z

Agree on all points--I wish it weren't more complex of course but I think it's worth it for the observability gained. I also trust that projects like pg-boss have had longer to mature and so there's probably good reasons for some of their more significant design decisions which we can also benefit from.

maxcountryman added 2 commits November 10, 2024 11:04

ensure last_heartbeat_at on dequeue

01cb957

maxcountryman force-pushed the transaction-refactor branch from 21fc7ef to 01cb957 Compare November 10, 2024 19:59

encapsulate lock key on in-progress task

0426f70

maxcountryman force-pushed the transaction-refactor branch from c10bb14 to 0426f70 Compare November 10, 2024 21:29

kirillsalykin reviewed Nov 11, 2024

View reviewed changes

maxcountryman merged commit b9d584e into main Nov 11, 2024
4 checks passed

maxcountryman deleted the transaction-refactor branch November 13, 2024 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor dequeue transaction handling #55

refactor dequeue transaction handling #55

maxcountryman commented Nov 10, 2024

maxcountryman commented Nov 10, 2024

kirillsalykin commented Nov 11, 2024 •

edited

Loading

kirillsalykin Nov 11, 2024

maxcountryman Nov 11, 2024

kirillsalykin Nov 11, 2024 •

edited

Loading

kirillsalykin commented Nov 11, 2024

maxcountryman commented Nov 11, 2024

refactor dequeue transaction handling #55

refactor dequeue transaction handling #55

Conversation

maxcountryman commented Nov 10, 2024

maxcountryman commented Nov 10, 2024

kirillsalykin commented Nov 11, 2024 • edited Loading

kirillsalykin Nov 11, 2024

Choose a reason for hiding this comment

maxcountryman Nov 11, 2024

Choose a reason for hiding this comment

kirillsalykin Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

kirillsalykin commented Nov 11, 2024

maxcountryman commented Nov 11, 2024

kirillsalykin commented Nov 11, 2024 •

edited

Loading

kirillsalykin Nov 11, 2024 •

edited

Loading