Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect, store and send aggregate metrics, take 2 #691

Merged
merged 5 commits into from
Dec 20, 2023

Conversation

bloodearnest
Copy link
Member

@bloodearnest bloodearnest commented Dec 13, 2023

This PR contains the 3 reverted commits #686, and then an additional commit to fix the use of a global variable to cache the connection string in multithreaded application. This is known to be Bad Juju, and I regret the error.

The sync, run and stats thread all have a db connection to the metrics
db, so this cause sqlite to barf.

This was not caught in the test suite because we do not currently have
a functional test that runs the full multithreaded service and tries to
run a job with it. This is in theory possible (it's what the
a backend-server tests for the test backend do), but its also hard to
set up and slow and requires lxc currently, so I have not done that
here.

Once this lands, I will test it does fix the multithreaded issue by
using the backend-server functional tests to fully exercise it.

The commit message for the fix is:

This removes the use of a process global, and adds support for readonly
connections to help reduce contention on the metrics db with
multithreaded access.

Adding readonly mode, and the initial PR feedback, led to a re-working
of the connection management. Rather than re-use
lib.database.get_connection(), we instead implement our own version.
This gives us control, and allows us to efficiently assure appropriate
set up as well as cleanly handle readonly connections.

One small chnage with this - we cannot use an in-memory db in testing
for the metrics db, as mode=memory is not compatible with setting
mode=ro. Each test still gets their own db, but its on disk.

@bloodearnest bloodearnest force-pushed the fix-metrics-db-global branch 3 times, most recently from ed81264 to c3280ea Compare December 13, 2023 17:34
Copy link
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have spotted the issue with that global because I had exactly the same problem myself a while back which is why get_connection() now uses a thread-local.

I'm not quite convinced the changes here are the right approach though. There's a test failure with:

sqlite3.OperationalError: cannot change into wal mode from within a transaction

But I think that indicates that get_connection() isn't the right place to do this set up.

jobrunner/lib/database.py Outdated Show resolved Hide resolved
jobrunner/record_stats.py Outdated Show resolved Hide resolved
jobrunner/lib/database.py Outdated Show resolved Hide resolved
jobrunner/record_stats.py Show resolved Hide resolved
Stores and updates running totals/aggregates of metrics whilst a job is
EXECUTING.

To keep executor state separate from scheduler state, we store this
state in a separate db file altogether. This prevents lock contention on
the main db file. Unlike the last time we did this, it should not grow
to exessive size, as we only store 1 small row per job.

For each metric, we calculate and store: lastest sample, cumsum, current
mean (measured at this point in time), and peak value.

We also add them to the tick trace spans as can be useful.
This modifies local executor get_status() to return any collected
metrics.

This is not used right now, but potentially in future we might want to
make scheduling decisions on it.
@bloodearnest
Copy link
Member Author

Ok, I've completely reworking this, after we reverted the old.

This PR now includes the 3 reverted commits, and then 1 additional commit, which changes the connection management to a) avoid process global and b) support readonly mode

@bloodearnest bloodearnest changed the title Fix metrics db connection global Collect, store and send aggregate metrics Dec 15, 2023
@bloodearnest bloodearnest changed the title Collect, store and send aggregate metrics Collect, store and send aggregate metrics, take 2 Dec 15, 2023
This removes the use of a process global, and adds support for readonly
connections to help reduce contention on the metrics db with
multithreaded access.

Adding readonly mode, and the initial PR feedback, led to a re-working
of the connection management. Rather than re-use
`lib.database.get_connection()`, we instead implement our own version.
This gives us control, and allows us to efficiently assure appropriate
set up as well as cleanly handle readonly connections.

One small chnage with this - we cannot use an in-memory db in testing
for the metrics db, as `mode=memory` is not compatible with setting
`mode=ro`. Each test still gets their own db, but its on disk.
Copy link
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this makes sense. Small str/Path type confusion which will need fixing though.

jobrunner/record_stats.py Outdated Show resolved Hide resolved
@bloodearnest bloodearnest merged commit 591abfd into main Dec 20, 2023
12 checks passed
@bloodearnest bloodearnest deleted the fix-metrics-db-global branch December 20, 2023 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants