Skip to content

Make the job runner a bit more resilient to slow jobs or other errors #1804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 21, 2019

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Aug 14, 2019

A brief incident was caused by #1798. A band-aid fix is in place, and
#1803 (included in this branch) makes it possible to apply similar
band-aids in the future without requiring a rebuild of the code. This
commit attempts to better address the root problem though.

The short version (which is expanded on below, but not required to
understand this commit or why it's needed) is that update_downloads
takes longer than our job timeout to run. When we moved that task to a
background job, we did not increase the number of concurrent jobs, nor
did we increase the timeout. This meant that swirl timed out trying to
start new jobs, and our behavior in that case was to crash the process.
This would mean that update_downloads never completes, and remains at
the front of the queue. This PR addresses all 3 of the problematic
cases.

  • Increasing concurrency
    • When this system was added, the only jobs we had were index updates.
      These want to be serial, so we set the thread pool size to 1. We
      added readme renderings, which probably should have been parallel,
      but only happen with crate publishes anyway so it was fine.
      update_downloads always takes longer than the timeout to run
      though. We can't have it block everything else while it's running.

      The main downside to this is that index updates are no longer
      guaranteed to run in serial, which means that if two crates are
      uploaded simultaneously one job will fail and will have to wait for
      a retry to update the index. In theory if a crate happened to be
      uploaded at the exact instant of the retry 7 or 8 times in a row
      this could even result in getting paged. This is exceptionally
      unlikely, and I'm not concerned about it for now. As more features
      land in swirl we may want to move index updates to their own queue
      or tweak the retry behavior on that job though.

      Swirl will eventually handle this for us by default, and we should
      use its defaults once that lands.

  • Increasing the default timeout
    • 10s was a bit too aggressive. Fundamentally there is always a
      condition where we hit this timeout, and if the reason for hitting
      it is that we are receiving more jobs than we can process (either
      because of volume of jobs, or our jobs are too slow).

      The most common reason we would hit this is that all threads are
      occupied by a job which takes longer than the timeout to execute.
      Increasing the concurrency makes this less likely to occur since our
      jobs are low volume, but we were actually seeing this crash before
      the addition of update_downloads meaning that our other jobs are
      sometimes taking >10s to run. Increasing the concurrency beyond 2
      would make it extremely unlikely we will ever hit this, but since we
      theoretically can with a burst of crate uploads at any concurrency,
      I've also upped the timeout.

  • Rebuild the runner a few times before crashing the process
    • This is the most important change, though it's the only one that
      wouldn't fix the problem by itself. The first two changes address
      why the problem occurred, this last change addresses why it placed
      us in an unrecoverable state.

      What would happen is we would time out trying to start another job
      after update_downloads, and then the process would crash. This
      would mean that update_downloads would never complete, so as soon
      as we restarted, we'd just try to run it again (I may also change
      swirl to increment the retry counter before even beginning to run
      the job, but there are issues with that which are out of scope for
      this commit to discuss).

      This commit changes the behavior to instead built a new runner
      (which means a new thread pool and DB pool) up to 5 times before
      crashing the process. This means that any spawned threads will get a
      bit more time to run before the process itself crashes, so any jobs
      clogging the runner still get a chance to complete. I've opted to
      have a hard limit on the number of failures in the runner to avoid
      potentially unbounded growth in DB connections. We do still want to
      eventually fail, since being unable to start jobs can indicate
      issues that are only solved by starting a new process or moving to
      another physical machine.

More specific technical details on the issue that are not required to review this PR, but may be interesting

I've written this issue up at sgrif/swirl#16
as well.

The main entry point for a Swirl runner today is run_all_pending_jobs.
This method is fairly low level. The intent is to eventually add a
"reasonable defaults" binary shipped with swirl, probably somewhat based
on what crates.io needs here. This method will run in a loop, attempting
to fully saturate its thread pool on each iteration. It will check the
number of availble threads, spawning that many tasks.

Each task that is spawned will quickly communicate back to the
coordinator via an mpsc channel. The coordinator keeps track of how many
messages it's expecting (we get exactly 1 message per spawned task). If
we aren't currently expecting any messages, and there are also 0
available threads, we will attempt to spawn 1 task no matter what. This
is to ensure we don't loop forever waiting for a free thread, and
respsect the given timeout.

We do this in a loop until we hear from a thread that there was no job
available, or receive an error (caused by a thread being unable to get a
DB connection, an error loading the job from the DB [which should only
happen if the DB has gone away], or if we time out waiting to hear back
at all).

That's exactly what happened in this case. We would see 1 available
thread, spawn 1 task, and have 1 pending message. The worker would
communicate back that it got a job. We'd loop. There are 0 available
threads. We are expecting 0 messages, so we spawn 1 task anyway. We are
now expecting 1 pending message. We block waiting for it. The only way
we will receive a message is for the job we started in the first
iteration to complete before the timeout. It doesn't, so
run_all_pending_jobs returns an error. Our runner was calling
.expect on that, so the process crashes.

This shows several issues both in the configuration that was being used
by crates.io, and also in Swirl itself. I discussed the configuration
issues above, but there are also questions WRT Swirl's design. The first
issue is whether this case should be separated from not getting a
response from the worker at all. The latter should never happen under
reasonable circumstances, so my gut is that we can assume if it does
happen it was due to this case...

The second issue is that this was put us in an unrecoverable state
rather than causing one class of issues to fail to run. This could be
prevented by increasing the retry counter outside of a transaction
before running the job. This has issues though, which are out of scope
for this commit, but basically boil down to introducing non-atomic
pieces to an otherwise atomic operation.

sgrif added 2 commits August 14, 2019 12:46
An incident was caused by rust-lang#1798. There is a description below if you're
interested, but this PR does not fix the problem. However, the band-aid
to get things running again fix is to increase the timeout for the job
runner. When responding to an incident, waiting for a full rebuild to
change this is not acceptable. This replaces the hard-coded value with
an environment variable so we can quickly change this on the fly in the
future.

Description of the actual problem that this does not fix
--

 The problem was that the `update_downloads` job takes longer than the
timeout we had set for jobs to begin running. So swirl would start the
`update_downloads` job, try to spawn another worker, and then would time
out hearing from that worker whether it got a job or not. So we would
crash the process, the job would be incomplete, and we'd just start the
whole thing over again.

There's several real fixes for this, and I will open a PR that is some
combination of all of them. Ultimately each of these fixes just increase
the number of slow concurrent jobs that can be run before we hit the
timeout and the problem re-appears, but that's fundamentally always
going to be the case... If we are getting more jobs than we can process,
we do need to get paged so we can remedy the situation. Still, any or
all of these will be the "real" fix:

- Increasing the number of concurrent jobs
- Increasing the timeout
- Re-building the runner before crashing
  - The reason this would fix the issue is that by not crashing the
    process, we give the spawned threads a chance to finish. We do still
    want to *eventually* crash the process, as there might be something
    inherent to this process or machine preventing the jobs from
    running, but starting with a new thread/connection pool a few times
    gives things a better chance to recover on their own.
A brief incident was caused by rust-lang#1798. A band-aid fix is in place, and
 rust-lang#1803 (included in this branch) makes it possible to apply similar
band-aids in the future without requiring a rebuild of the code. This
commit attempts to better address the root problem though.

The short version (which is expanded on below, but not required to
understand this commit or why it's needed) is that `update_downloads`
takes longer than our job timeout to run. When we moved that task to a
background job, we did not increase the number of concurrent jobs, nor
did we increase the timeout. This meant that swirl timed out trying to
start new jobs, and our behavior in that case was to crash the process.
This would mean that `update_downloads` never completes, and remains at
the front of the queue. This PR addresses all 3 of the problematic
cases.

- Increasing concurrency
  - When this system was added, the only jobs we had were index updates.
    These want to be serial, so we set the thread pool size to 1. We
    added readme renderings, which probably should have been parallel,
    but only happen with crate publishes anyway so it was fine.
    `update_downloads` *always* takes longer than the timeout to run
    though. We can't have it block everything else while it's running.

    The main downside to this is that index updates are no longer
    guaranteed to run in serial, which means that if two crates are
    uploaded simultaneously one job will fail and will have to wait for
    a retry to update the index. In theory if a crate happened to be
    uploaded at the exact instant of the retry 7 or 8 times in a row
    this could even result in getting paged. This is exceptionally
    unlikely, and I'm not concerned about it for now. As more features
    land in swirl we may want to move index updates to their own queue
    or tweak the retry behavior on that job though.

    Swirl will eventually handle this for us by default, and we should
    use its defaults once that lands.
- Increasing the default timeout
  - 10s was a bit too aggressive. Fundamentally there is always a
    condition where we hit this timeout, and if the reason for hitting
    it is that we are receiving more jobs than we can process (either
    because of volume of jobs, or our jobs are too slow).

    The most common reason we would hit this is that all threads are
    occupied by a job which takes longer than the timeout to execute.
    Increasing the concurrency makes this less likely to occur since our
    jobs are low volume, but we were actually seeing this crash before
    the addition of `update_downloads` meaning that our other jobs are
    sometimes taking >10s to run. Increasing the concurrency beyond 2
    would make it extremely unlikely we will ever hit this, but since we
    theoretically can with a burst of crate uploads at any concurrency,
    I've also upped the timeout.
- Rebuild the runner a few times before crashing the process
  - This is the most important change, though it's the only one that
    wouldn't fix the problem by itself. The first two changes address
    why the problem occurred, this last change addresses why it placed
    us in an unrecoverable state.

    What would happen is we would time out trying to start another job
    after `update_downloads`, and then the process would crash. This
    would mean that `update_downloads` would never complete, so as soon
    as we restarted, we'd just try to run it again (I may also change
    swirl to increment the retry counter before even beginning to run
    the job, but there are issues with that which are out of scope for
    this commit to discuss).

    This commit changes the behavior to instead built a new runner
    (which means a new thread pool and DB pool) up to 5 times before
    crashing the process. This means that any spawned threads will get a
    bit more time to run before the process itself crashes, so any jobs
    clogging the runner still get a chance to complete. I've opted to
    have a hard limit on the number of failures in the runner to avoid
    potentially unbounded growth in DB connections. We do still want to
    eventually fail, since being unable to start jobs can indicate
    issues that are only solved by starting a new process or moving to
    another physical machine.

More specific technical details on the issue that are not required to review this PR, but may be interesting
--

I've written this issue up at sgrif/swirl#16
as well.

The main entry point for a Swirl runner today is `run_all_pending_jobs`.
This method is fairly low level. The intent is to eventually add a
"reasonable defaults" binary shipped with swirl, probably somewhat based
on what crates.io needs here. This method will run in a loop, attempting
to fully saturate its thread pool on each iteration. It will check the
number of availble threads, spawning that many tasks.

Each task that is spawned will quickly communicate back to the
coordinator via an mpsc channel. The coordinator keeps track of how many
messages it's expecting (we get exactly 1 message per spawned task). If
we aren't currently expecting any messages, and there are also 0
available threads, we will attempt to spawn 1 task no matter what. This
is to ensure we don't loop forever waiting for a free thread, and
respsect the given timeout.

We do this in a loop until we hear from a thread that there was no job
available, or receive an error (caused by a thread being unable to get a
DB connection, an error loading the job from the DB [which should only
happen if the DB has gone away], or if we time out waiting to hear back
at all).

That's exactly what happened in this case. We would see 1 available
thread, spawn 1 task, and have 1 pending message. The worker would
communicate back that it got a job. We'd loop. There are 0 available
threads. We are expecting 0 messages, so we spawn 1 task anyway. We are
now expecting 1 pending message. We block waiting for it. The only way
we will receive a message is for the job we started in the first
iteration to complete before the timeout. It doesn't, so
`run_all_pending_jobs` returns an error. Our runner was calling
`.expect` on that, so the process crashes.

This shows several issues both in the configuration that was being used
by crates.io, and also in Swirl itself. I discussed the configuration
issues above, but there are also questions WRT Swirl's design. The first
issue is whether this case should be separated from not getting a
response from the worker at all. The latter should *never* happen under
reasonable circumstances, so my gut is that we can assume if it does
happen it was due to this case...

The second issue is that this was put us in an unrecoverable state
rather than causing one class of issues to fail to run. This could be
prevented by increasing the retry counter outside of a transaction
before running the job. This has issues though, which are out of scope
for this commit, but basically boil down to introducing non-atomic
pieces to an otherwise atomic operation.
@rust-highfive
Copy link

r? @jtgeibel

(rust_highfive has picked a reviewer for you, use r? to override)

@jtgeibel
Copy link
Member

The main downside to this is that index updates are no longer
guaranteed to run in serial, which means that if two crates are
uploaded simultaneously one job will fail and will have to wait for
a retry to update the index.

I think I'm a bit confused here. At first I was thinking this was referring to the retry loop (that used to be in commit_and_push and is now managed by swirl retrying the job) and that 2 index updates could potentially make changes to the file system in parallel. Fortunately, we do wrap the git repo in a Mutex and for both the add_repo and yank jobs the first action is to obtain a lock on the repo. Importantly, in both cases this occurs before the code mutates anything on the file system. So I think the existing locking will ensure that git operations are serialized and that we aren't relying on a job failure and retry here.

I still want to dig into a few things, but if my understanding is correct I think these changes look fine to me. I think the biggest operational risk probably occurs when there is a GitHub outage. In that scenario we could have a backlog of git operations that are blocked, causing a delay in running independent jobs like update-downloads.

@sgrif
Copy link
Contributor Author

sgrif commented Aug 15, 2019 via email

@sgrif
Copy link
Contributor Author

sgrif commented Aug 15, 2019

You're right that the mutex will prevent multiple git jobs from racing as long as they're run on the same machine (which for now they will be), I had forgotten that was there.

@sgrif
Copy link
Contributor Author

sgrif commented Aug 21, 2019

We just had another incident from this. @jtgeibel Is this good to go?

@jtgeibel
Copy link
Member

I have a few suggestions long-term (but don't have time to type them up at the moment). This PR is definitely an improvement, merging.

@bors r+

@bors
Copy link
Contributor

bors commented Aug 21, 2019

📌 Commit ad2bfe7 has been approved by jtgeibel

bors added a commit that referenced this pull request Aug 21, 2019
Make the job runner a bit more resilient to slow jobs or other errors

A brief incident was caused by #1798. A band-aid fix is in place, and
 #1803 (included in this branch) makes it possible to apply similar
band-aids in the future without requiring a rebuild of the code. This
commit attempts to better address the root problem though.

The short version (which is expanded on below, but not required to
understand this commit or why it's needed) is that `update_downloads`
takes longer than our job timeout to run. When we moved that task to a
background job, we did not increase the number of concurrent jobs, nor
did we increase the timeout. This meant that swirl timed out trying to
start new jobs, and our behavior in that case was to crash the process.
This would mean that `update_downloads` never completes, and remains at
the front of the queue. This PR addresses all 3 of the problematic
cases.

- Increasing concurrency
  - When this system was added, the only jobs we had were index updates.
    These want to be serial, so we set the thread pool size to 1. We
    added readme renderings, which probably should have been parallel,
    but only happen with crate publishes anyway so it was fine.
    `update_downloads` *always* takes longer than the timeout to run
    though. We can't have it block everything else while it's running.

    The main downside to this is that index updates are no longer
    guaranteed to run in serial, which means that if two crates are
    uploaded simultaneously one job will fail and will have to wait for
    a retry to update the index. In theory if a crate happened to be
    uploaded at the exact instant of the retry 7 or 8 times in a row
    this could even result in getting paged. This is exceptionally
    unlikely, and I'm not concerned about it for now. As more features
    land in swirl we may want to move index updates to their own queue
    or tweak the retry behavior on that job though.

    Swirl will eventually handle this for us by default, and we should
    use its defaults once that lands.
- Increasing the default timeout
  - 10s was a bit too aggressive. Fundamentally there is always a
    condition where we hit this timeout, and if the reason for hitting
    it is that we are receiving more jobs than we can process (either
    because of volume of jobs, or our jobs are too slow).

    The most common reason we would hit this is that all threads are
    occupied by a job which takes longer than the timeout to execute.
    Increasing the concurrency makes this less likely to occur since our
    jobs are low volume, but we were actually seeing this crash before
    the addition of `update_downloads` meaning that our other jobs are
    sometimes taking >10s to run. Increasing the concurrency beyond 2
    would make it extremely unlikely we will ever hit this, but since we
    theoretically can with a burst of crate uploads at any concurrency,
    I've also upped the timeout.
- Rebuild the runner a few times before crashing the process
  - This is the most important change, though it's the only one that
    wouldn't fix the problem by itself. The first two changes address
    why the problem occurred, this last change addresses why it placed
    us in an unrecoverable state.

    What would happen is we would time out trying to start another job
    after `update_downloads`, and then the process would crash. This
    would mean that `update_downloads` would never complete, so as soon
    as we restarted, we'd just try to run it again (I may also change
    swirl to increment the retry counter before even beginning to run
    the job, but there are issues with that which are out of scope for
    this commit to discuss).

    This commit changes the behavior to instead built a new runner
    (which means a new thread pool and DB pool) up to 5 times before
    crashing the process. This means that any spawned threads will get a
    bit more time to run before the process itself crashes, so any jobs
    clogging the runner still get a chance to complete. I've opted to
    have a hard limit on the number of failures in the runner to avoid
    potentially unbounded growth in DB connections. We do still want to
    eventually fail, since being unable to start jobs can indicate
    issues that are only solved by starting a new process or moving to
    another physical machine.

More specific technical details on the issue that are not required to review this PR, but may be interesting
--

I've written this issue up at sgrif/swirl#16
as well.

The main entry point for a Swirl runner today is `run_all_pending_jobs`.
This method is fairly low level. The intent is to eventually add a
"reasonable defaults" binary shipped with swirl, probably somewhat based
on what crates.io needs here. This method will run in a loop, attempting
to fully saturate its thread pool on each iteration. It will check the
number of availble threads, spawning that many tasks.

Each task that is spawned will quickly communicate back to the
coordinator via an mpsc channel. The coordinator keeps track of how many
messages it's expecting (we get exactly 1 message per spawned task). If
we aren't currently expecting any messages, and there are also 0
available threads, we will attempt to spawn 1 task no matter what. This
is to ensure we don't loop forever waiting for a free thread, and
respsect the given timeout.

We do this in a loop until we hear from a thread that there was no job
available, or receive an error (caused by a thread being unable to get a
DB connection, an error loading the job from the DB [which should only
happen if the DB has gone away], or if we time out waiting to hear back
at all).

That's exactly what happened in this case. We would see 1 available
thread, spawn 1 task, and have 1 pending message. The worker would
communicate back that it got a job. We'd loop. There are 0 available
threads. We are expecting 0 messages, so we spawn 1 task anyway. We are
now expecting 1 pending message. We block waiting for it. The only way
we will receive a message is for the job we started in the first
iteration to complete before the timeout. It doesn't, so
`run_all_pending_jobs` returns an error. Our runner was calling
`.expect` on that, so the process crashes.

This shows several issues both in the configuration that was being used
by crates.io, and also in Swirl itself. I discussed the configuration
issues above, but there are also questions WRT Swirl's design. The first
issue is whether this case should be separated from not getting a
response from the worker at all. The latter should *never* happen under
reasonable circumstances, so my gut is that we can assume if it does
happen it was due to this case...

The second issue is that this was put us in an unrecoverable state
rather than causing one class of issues to fail to run. This could be
prevented by increasing the retry counter outside of a transaction
before running the job. This has issues though, which are out of scope
for this commit, but basically boil down to introducing non-atomic
pieces to an otherwise atomic operation.
@bors
Copy link
Contributor

bors commented Aug 21, 2019

⌛ Testing commit ad2bfe7 with merge aa9b61c...

@bors
Copy link
Contributor

bors commented Aug 21, 2019

☀️ Test successful - checks-travis
Approved by: jtgeibel
Pushing aa9b61c to master...

@bors bors merged commit ad2bfe7 into rust-lang:master Aug 21, 2019
sgrif added a commit to sgrif/crates.io that referenced this pull request Apr 16, 2020
This changes the behavior of the `update_downloads` background job from
processing all rows serially to spawning a smaller job for each 1000
rows that need to be processed. This shortens the amount of time that
any one job runs (making us less likely to hit timeouts in the runner
and encounter issues that rust-lang#2267 and rust-lang#1804 addressed). More importantly,
it means that we are able to do more in parallel, reducing the overall
time it takes to count downloads.

About the Problem
===

There are two main thresholds we care about for how long this job takes
to run:

- If it takes longer than the interval at which we enqueue this job
(typically every 10 minutes, currently every hour due to the issues this
PR addresses), we can end up with two instances of it running in
parallel. This causes downloads to get double counted, and the jobs tend
to contend for row locks and slow each other down. The double counting
will be corrected the next time the job runs. This only tends to happen
if a crawler downloads a large number of crates in rapid succession,
causing the rows we have to process to increase from our normal volume
of ~10k per hour to ~150k. When this occurs, we're likely to hit the
second threshold.

- If it takes longer than `$MAX_JOB_TIME` (currently set to 60 for the
reasons below, defaults to 15), I will be paged. This has been happening
much more frequently as of late (which is why that env var is currently
at 60 minutes). It's unclear if this is because crawlers are downloading
large volumes of crates more frequently, or if we're just seeing normal
volume push us over 15 minutes to process serially.

Splitting into smaller jobs doesn't directly help either of those
thresholds, but being able to process rows in parallel does, since the
overall time this takes to complete will go down dramatically (currently
by a factor of 4, but we can probably set the number of threads to
higher than CPU cores and still see benefits since we're I/O bound).

Based on extremely anecdotal, non-scientific measurements of "I ran
`select count(*) from version_downloads where downloads != counted`
while the job was churning through >100k rows roughly every minute a few
times", we can process roughly ~4k rows per minute, which seems about
right for 6 queries per row. We can substantially increase throughput if
we reduce this to one round trip, but for now we can expect this to take
roughly 15 seconds per batch. The longest I've ever seen this job take
(and I get paged if it takes too long, I've 100% seen the longest run
times) is just over an hour. Since this should reduce it by *at least* a
factor of 4, this will mean the time it takes to run if every version
was downloaded at least once since the last run will be around 15
minutes. If we can bring this down to a single round trip per row, that
should further reduce it to around 2.5 minutes

Since this means we'll use all available worker threads in parallel, it
also means that even if we have `update_downloads` queued again before
the previous run completed, it's unlikely to ever be looking at the same
rows in parallel, since the batches from the second run wouldn't be
handled until all but worker_count - 1 batches from the first run have
completed.

Drawbacks
===

There are two main drawbacks to this commit:

- Since we no longer process rows serially before running
`update_recent_crate_downloads`, the data in `recent_crate_downloads`
will reflect the *previous* run of `update_downloads`, meaning it's
basically always 10-20 minutes behind. This is a regression over a few
months ago, where it was typically 3-13 minutes behind, but an
improvement over today, where it's 3-63 minutes behind.

- The entire background queue will be blocked while `update_downloads`
runs. This was the case prior to rust-lang#1804. At the time of that commit, we
did not consider blocking publishes to be a problem. We added the
additional thread (assuming only one would be taken by
`update_downloads` at any given time) to prevent the runner from
crashing because it couldn't tell if progress was being made. That won't
be an issue with this commit (since we're always going to make progress
in relatively small chunks), but does mean that index updates will
potentially be delayed by as much as 15 minutes in the worst case.
(this number may be higher than is realistic since we've only observed
>1 hour runs with the job set to queue hourly, meaning more rows to
process per run). Typically the delay will only be at most 30 seconds.

If I wasn't getting paged almost every day, I'd say this PR should be
blocked on the second issue (which is resolved by adding queue priority
to swirl). But given the operational load this issue is causing, I think
increasing the worst case delay for index updates is a reasonable
tradeoff for now.

Impl details
===

I've written the test in a sorta funky way, adding functions to get a
connection in and out of a test DB pool. This was primarily so I could
change the tests to queue the job, and then run any pending jobs,
without too much churn (this would otherwise require having the runner
own the connection, and putting any uses of the connection in braces
since we'd have to fetch it from the pool each time).

This relies on an update to swirl (which is not in master at the time of
writing this commit) for ease of testing. Testing `update_downloads`
after this change requires actually running the background job. At the
time of writing this, on master that would mean needing to construct a
`background_jobs::Environment`, which involves cloning git indexes. The
update to swirl means we can have the jobs take a connection directly,
changing their environment type to `()`, making them much easier to
test.
Turbo87 pushed a commit to sgrif/crates.io that referenced this pull request Dec 6, 2020
This changes the behavior of the `update_downloads` background job from
processing all rows serially to spawning a smaller job for each 1000
rows that need to be processed. This shortens the amount of time that
any one job runs (making us less likely to hit timeouts in the runner
and encounter issues that rust-lang#2267 and rust-lang#1804 addressed). More importantly,
it means that we are able to do more in parallel, reducing the overall
time it takes to count downloads.

About the Problem
===

There are two main thresholds we care about for how long this job takes
to run:

- If it takes longer than the interval at which we enqueue this job
(typically every 10 minutes, currently every hour due to the issues this
PR addresses), we can end up with two instances of it running in
parallel. This causes downloads to get double counted, and the jobs tend
to contend for row locks and slow each other down. The double counting
will be corrected the next time the job runs. This only tends to happen
if a crawler downloads a large number of crates in rapid succession,
causing the rows we have to process to increase from our normal volume
of ~10k per hour to ~150k. When this occurs, we're likely to hit the
second threshold.

- If it takes longer than `$MAX_JOB_TIME` (currently set to 60 for the
reasons below, defaults to 15), I will be paged. This has been happening
much more frequently as of late (which is why that env var is currently
at 60 minutes). It's unclear if this is because crawlers are downloading
large volumes of crates more frequently, or if we're just seeing normal
volume push us over 15 minutes to process serially.

Splitting into smaller jobs doesn't directly help either of those
thresholds, but being able to process rows in parallel does, since the
overall time this takes to complete will go down dramatically (currently
by a factor of 4, but we can probably set the number of threads to
higher than CPU cores and still see benefits since we're I/O bound).

Based on extremely anecdotal, non-scientific measurements of "I ran
`select count(*) from version_downloads where downloads != counted`
while the job was churning through >100k rows roughly every minute a few
times", we can process roughly ~4k rows per minute, which seems about
right for 6 queries per row. We can substantially increase throughput if
we reduce this to one round trip, but for now we can expect this to take
roughly 15 seconds per batch. The longest I've ever seen this job take
(and I get paged if it takes too long, I've 100% seen the longest run
times) is just over an hour. Since this should reduce it by *at least* a
factor of 4, this will mean the time it takes to run if every version
was downloaded at least once since the last run will be around 15
minutes. If we can bring this down to a single round trip per row, that
should further reduce it to around 2.5 minutes

Since this means we'll use all available worker threads in parallel, it
also means that even if we have `update_downloads` queued again before
the previous run completed, it's unlikely to ever be looking at the same
rows in parallel, since the batches from the second run wouldn't be
handled until all but worker_count - 1 batches from the first run have
completed.

Drawbacks
===

There are two main drawbacks to this commit:

- Since we no longer process rows serially before running
`update_recent_crate_downloads`, the data in `recent_crate_downloads`
will reflect the *previous* run of `update_downloads`, meaning it's
basically always 10-20 minutes behind. This is a regression over a few
months ago, where it was typically 3-13 minutes behind, but an
improvement over today, where it's 3-63 minutes behind.

- The entire background queue will be blocked while `update_downloads`
runs. This was the case prior to rust-lang#1804. At the time of that commit, we
did not consider blocking publishes to be a problem. We added the
additional thread (assuming only one would be taken by
`update_downloads` at any given time) to prevent the runner from
crashing because it couldn't tell if progress was being made. That won't
be an issue with this commit (since we're always going to make progress
in relatively small chunks), but does mean that index updates will
potentially be delayed by as much as 15 minutes in the worst case.
(this number may be higher than is realistic since we've only observed
>1 hour runs with the job set to queue hourly, meaning more rows to
process per run). Typically the delay will only be at most 30 seconds.

If I wasn't getting paged almost every day, I'd say this PR should be
blocked on the second issue (which is resolved by adding queue priority
to swirl). But given the operational load this issue is causing, I think
increasing the worst case delay for index updates is a reasonable
tradeoff for now.

Impl details
===

I've written the test in a sorta funky way, adding functions to get a
connection in and out of a test DB pool. This was primarily so I could
change the tests to queue the job, and then run any pending jobs,
without too much churn (this would otherwise require having the runner
own the connection, and putting any uses of the connection in braces
since we'd have to fetch it from the pool each time).

This relies on an update to swirl (which is not in master at the time of
writing this commit) for ease of testing. Testing `update_downloads`
after this change requires actually running the background job. At the
time of writing this, on master that would mean needing to construct a
`background_jobs::Environment`, which involves cloning git indexes. The
update to swirl means we can have the jobs take a connection directly,
changing their environment type to `()`, making them much easier to
test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants