You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been thinking about a few abstractions that may be helpful. I'll focus on the current set of jobs we have on crates.io, but my main insight is that database connections are probably the most limited resource. Of course memory and thread usage matter as well, but swirl should make it simple to share a single small database connection pool across a set of jobs with varying, and possibly intertwined, behavior.
My proposal is for the ability to create Job groups. Each group of 1 or more jobs is responsible for scheduling its jobs. Each job belongs to only one group. Groups could also be used to set behavior that applies to a set of jobs, such as timeouts or conditions for sending alerts.
Group Types
I'm proposing 3 initial group types:
Queue
A queue of jobs run serially. When running a job a connection will be taken from the pool, but at most a single connection will be used by the queue at any given time. Swirl could potentially even provide a facility to ensure this queuing behavior holds even if multiple swirl instances are run in parallel (although I would consider that out of scope for now).
This group type would be applicable for jobs that interact with a global singleton resource.
For our publish and yank jobs, that resource is a git repository. There is no point in attempting to spawn multiple index update jobs in parallel. If 2 or more jobs from this queue were spawned at the same time, the additional jobs would block waiting on the mutex while holding onto extra database connections that could be utilized by other jobs.
Possible customization:
On job failure, we could retry the failed job (with backoff), rather than move on to the next job in the queue. This way index updates are done in chronological order, even if there is an intermittent network issue. (The downside would be that if a job fails for a non-network reason, then it would block following jobs in the queue which could possibly make progress.)
Alerting: If the queue of index jobs hasn't been drained in the last 5 minutes, then alert, GitHub might be down. (Currently we alert if any job (globally) remains in the queue for too long, but different limits could potentially be scoped to individual job groups.)
Parallel
A set of independent jobs which can be run in parallel. This is basically the current behavior.
On crates.io the readme rendering would fit in this group. This group type would also be useful for sending emails in the background.
Repeated
Similar to a queue, this group would run its job in series, but swirl could also automatically handle enqueuing the next repetition after the current job finishes.
On crates.io, the job to update download counts would fit into this group.
Summary
This is a rough outline of possible ideas, but I think it makes sense to provide some mechanism for the developer to customize how jobs are scheduled and to set different alerting rules to a group of jobs. We probably want to alert more promptly during a GitHub outage than we do if there is a delay in publishing readmes to S3.
Currently, if a batch of crates are published during a GitHub outage where git operations end up hitting a network timeout (where say each job pauses for 30 seconds before failing), then the index jobs can potentially starve out other jobs from obtaining a database connection. By defining a queue of related jobs, effectively letting swirl know about our internal Mutex, swirl could schedule these jobs more efficiently.
The text was updated successfully, but these errors were encountered:
I've been thinking about a few abstractions that may be helpful. I'll focus on the current set of jobs we have on crates.io, but my main insight is that database connections are probably the most limited resource. Of course memory and thread usage matter as well, but swirl should make it simple to share a single small database connection pool across a set of jobs with varying, and possibly intertwined, behavior.
My proposal is for the ability to create
Job
groups. Each group of 1 or more jobs is responsible for scheduling its jobs. Each job belongs to only one group. Groups could also be used to set behavior that applies to a set of jobs, such as timeouts or conditions for sending alerts.Group Types
I'm proposing 3 initial group types:
Queue
A queue of jobs run serially. When running a job a connection will be taken from the pool, but at most a single connection will be used by the queue at any given time. Swirl could potentially even provide a facility to ensure this queuing behavior holds even if multiple swirl instances are run in parallel (although I would consider that out of scope for now).
This group type would be applicable for jobs that interact with a global singleton resource.
For our publish and yank jobs, that resource is a git repository. There is no point in attempting to spawn multiple index update jobs in parallel. If 2 or more jobs from this queue were spawned at the same time, the additional jobs would block waiting on the mutex while holding onto extra database connections that could be utilized by other jobs.
Possible customization:
Parallel
A set of independent jobs which can be run in parallel. This is basically the current behavior.
On crates.io the readme rendering would fit in this group. This group type would also be useful for sending emails in the background.
Repeated
Similar to a queue, this group would run its job in series, but swirl could also automatically handle enqueuing the next repetition after the current job finishes.
On crates.io, the job to update download counts would fit into this group.
Summary
This is a rough outline of possible ideas, but I think it makes sense to provide some mechanism for the developer to customize how jobs are scheduled and to set different alerting rules to a group of jobs. We probably want to alert more promptly during a GitHub outage than we do if there is a delay in publishing readmes to S3.
Currently, if a batch of crates are published during a GitHub outage where git operations end up hitting a network timeout (where say each job pauses for 30 seconds before failing), then the index jobs can potentially starve out other jobs from obtaining a database connection. By defining a queue of related jobs, effectively letting swirl know about our internal
Mutex
, swirl could schedule these jobs more efficiently.The text was updated successfully, but these errors were encountered: