-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aggregate job metrics #686
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
bloodearnest
commented
Dec 11, 2023
- Collect and store per-job aggregate metrics
- Expose metrics from executor.
- Expose metrics in sync protocol
bloodearnest
force-pushed
the
aggregate-job-metrics
branch
4 times, most recently
from
December 11, 2023 12:45
f7ad151
to
fe900d3
Compare
evansd
approved these changes
Dec 12, 2023
Stores and updates running totals/aggregates of metrics whilst a job is EXECUTING. To keep executor state separate from scheduler state, we store this state in a separate db file altogether. This prevents lock contention on the main db file. Unlike the last time we did this, it should not grow to exessive size, as we only store 1 small row per job. For each metric, we calculate and store: lastest sample, cumsum, current mean (measured at this point in time), and peak value. We also add them to the tick trace spans as can be useful.
This modifies local executor get_status() to return any collected metrics. This is not used right now, but potentially in future we might want to make scheduling decisions on it.
bloodearnest
force-pushed
the
aggregate-job-metrics
branch
from
December 13, 2023 10:33
fe900d3
to
c70cbc6
Compare
This was referenced Dec 14, 2023
lucyb
added a commit
to opensafely-core/job-server
that referenced
this pull request
Jan 15, 2024
Job runner is now returning summary statistics for jobs (see [PR](opensafely-core/job-runner#686)). This allows us to see things like the average and peak CPU and memory. We expect these fields to change and be added to as we learn what's useful (or not), so the field has been created as a JSON field to give us plenty of flexibility.
iaindillingham
added a commit
to opensafely-core/job-server
that referenced
this pull request
Jan 24, 2024
This adds a "Job metrics" card to the Job Detail page for authenticated (logged in) users that displays CPU and memory usage statistics, when available (see below). In addition to mean and peak, which are displayed here, Job Runner collects sample and cumsum for both CPU and memory usage statistics (opensafely-core/job-runner#686). However, I don't think sample and cumsum usage statistics are useful to users: sample, because we (and so they) don't know when the sample was taken; cumsum, because we don't know how many samples were taken (and if we did, then all we would be able to compute would be the mean, which Job Runner has computed for us). As #3998 states, usage statistics are not available for historic jobs (i.e. prior to 17/01/2024). In this case, `job.metrics == None`. They are also not available for jobs that are neither running nor succeeded. In this case, `job.metrics == {}`. We don't distinguish these cases, displaying "-" to users in both cases. Closes #3998
iaindillingham
added a commit
to opensafely-core/job-server
that referenced
this pull request
Jan 26, 2024
This adds a "Job metrics" card to the Job Detail page for users with one or more roles. This card displays CPU and memory usage statistics, when available (see below). In addition to mean and peak, which are displayed here, Job Runner collects sample and cumsum for both CPU and memory usage statistics (opensafely-core/job-runner#686). However, I don't think sample and cumsum usage statistics are useful to users: sample, because we (and so they) don't know when the sample was taken; cumsum, because we (and so they) don't know how many samples were taken (and if we did, then all we would be able to compute would be the mean, which Job Runner has computed for us). As #3998 states, usage statistics are not available for historic jobs (i.e. prior to 17/01/2024). In this case, `job.metrics == None`. They are also not available for some other types of job. In these cases, `job.metrics == {}`. We don't distinguish these cases, displaying "-" to users in both cases. Closes #3998
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.