-
Notifications
You must be signed in to change notification settings - Fork 13
User's Guide 04:02 Pro: Safer Queueing
This is a feature of TaskBotJS Pro.
TaskBotJS, by default, uses Redis's BRPOP instruction to fetch jobs from all configured queues. This is a simple, high-performance operation that can watch multiple queues with a single call. However, this comes with a downside: in the (unlikely, but definitely not impossible) case where TaskBotJS crashes in an unrecoverable manner, that job is irretrievably lost. Another concern is that, when TaskBotJS is shut down with a SIGTERM or a SIGINT, it will attempt to return jobs to the queue. That said, it's possible for TaskBotJS to fail to do so--for example, a network interruption may break the connection to Redis or the server running TaskBotJS shuts down before the service can do its cleanup tasks.
For most applications, this is an unfortunate but tolerable edge case; it's very unlikely that TaskBotJS will lose jobs in a healthy environment. But, sometimes, one in a million is next Tuesday, and users' tolerance of this possibility can vary. To that end, TaskBotJS Pro supports safer queueing, which uses the Redis RPOPLPUSH command to stash a dequeued job in a separate list owned by the worker doing the dequeueing. When the job is resolved (successfully or not), it is removed from this list. When a worker shuts down in an unclean manner, these jobs are then left in this list and other workers can pick up the pieces.
It's important to realize that this is not safe queueing, but safer; in the event of jobs being orphaned we need to prioritize getting the job done ASAP, so we want to put the job at the head of our list. However, Redis lacks a command to do this atomically, so we must POP and then RPUSH the job. It is thus possible, though considerably less likely than without safer queueing, to lose a job in the milliseconds where the job is out of Redis. It is also, and separately, possible for a job to have been completed by the dying worker before it could be acknowledged in Redis, and so another worker might re-do the same work. This is why it's important for jobs to be designed for idempotency.
Safer queueing also incurs some performance penalty and load on the Redis server. If you need this feature, be prepared to provision for it.
(These bits of code are extracted from the example project's configuration, which is worth reading for this and all Pro features.)
Two steps are necessary to enable safer queueing. First, we need to specify
the reliable
option to the intake.
config.intake = {
type: "weighted",
reliable: true,
timeoutSeconds: 1,
queues: [
{ name: "critical", weight: 5 },
{ name: "default", weight: 3 },
{ name: "low", weight: 2 }
]
};
Once that's done, we need to activate the orphan plugin, which will check
intermittently for workers who are no longer reporting a heartbeat to the
datastore because of a crash or a network partition or the like. You can check a
worker's last heartbeat in the control panel or by invoking
Client.getWorkerInfo()
.
config.orphan.enabled = true;
config.orphan.polling.interval = { seconds: 3 };
config.orphan.polling.splay = { seconds: 1 };
config.orphan.requeueAge = { seconds: 30 };
It is important to note that the example project uses very accelerated
timings to make it easier to show off the features of TaskBotJS Pro. The default
polling time for the orphan plugin ranges between 30 and 34 minutes, with a
default requeueAge
of 30 minutes. Those defaults are almost certainly fine for
the vast majority of projects, but you can tune them if necessary.