Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate job metrics #686

Merged
merged 3 commits into from
Dec 13, 2023
Merged

aggregate job metrics #686

merged 3 commits into from
Dec 13, 2023

Conversation

bloodearnest
Copy link
Member

  • Collect and store per-job aggregate metrics
  • Expose metrics from executor.
  • Expose metrics in sync protocol

Stores and updates running totals/aggregates of metrics whilst a job is
EXECUTING.

To keep executor state separate from scheduler state, we store this
state in a separate db file altogether. This prevents lock contention on
the main db file. Unlike the last time we did this, it should not grow
to exessive size, as we only store 1 small row per job.

For each metric, we calculate and store: lastest sample, cumsum, current
mean (measured at this point in time), and peak value.

We also add them to the tick trace spans as can be useful.
This modifies local executor get_status() to return any collected
metrics.

This is not used right now, but potentially in future we might want to
make scheduling decisions on it.
@bloodearnest bloodearnest merged commit 95de4e6 into main Dec 13, 2023
12 checks passed
@bloodearnest bloodearnest deleted the aggregate-job-metrics branch December 13, 2023 10:40
lucyb added a commit to opensafely-core/job-server that referenced this pull request Jan 15, 2024
Job runner is now returning summary statistics for jobs (see
[PR](opensafely-core/job-runner#686)). This allows us
to see things like the average and peak CPU and memory. We expect these fields
to change and be added to as we learn what's useful (or not), so the field has
been created as a JSON field to give us plenty of flexibility.
iaindillingham added a commit to opensafely-core/job-server that referenced this pull request Jan 24, 2024
This adds a "Job metrics" card to the Job Detail page for authenticated
(logged in) users that displays CPU and memory usage statistics, when
available (see below).

In addition to mean and peak, which are displayed here, Job Runner
collects sample and cumsum for both CPU and memory usage statistics
(opensafely-core/job-runner#686). However, I don't think sample and
cumsum usage statistics are useful to users: sample, because we (and so
they) don't know when the sample was taken; cumsum, because we don't
know how many samples were taken (and if we did, then all we would be
able to compute would be the mean, which Job Runner has computed for
us).

As #3998 states, usage statistics are not available for historic jobs
(i.e. prior to 17/01/2024). In this case, `job.metrics == None`. They
are also not available for jobs that are neither running nor succeeded.
In this case, `job.metrics == {}`. We don't distinguish these cases,
displaying "-" to users in both cases.

Closes #3998
iaindillingham added a commit to opensafely-core/job-server that referenced this pull request Jan 26, 2024
This adds a "Job metrics" card to the Job Detail page for users with one
or more roles. This card displays CPU and memory usage statistics, when
available (see below).

In addition to mean and peak, which are displayed here, Job Runner
collects sample and cumsum for both CPU and memory usage statistics
(opensafely-core/job-runner#686). However, I don't think sample and
cumsum usage statistics are useful to users: sample, because we (and so
they) don't know when the sample was taken; cumsum, because we (and so
they) don't know how many samples were taken (and if we did, then all we
would be able to compute would be the mean, which Job Runner has
computed for us).

As #3998 states, usage statistics are not available for historic jobs
(i.e. prior to 17/01/2024). In this case, `job.metrics == None`. They
are also not available for some other types of job. In these cases,
`job.metrics == {}`. We don't distinguish these cases, displaying "-" to
users in both cases.

Closes #3998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants