Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cache sqlite connections for macros #1930

Merged

Conversation

CosmicHorrorDev
Copy link
Contributor

Fixes #1929

This works around the issue mentioned in #1929 by not including SQLite for the cached connection. I've confirmed using the repro provided in the issue that this no longer runs into the problem of the database reporting that it's busy for several seconds after being reset when journaling is turned off.

It is worth noting that I still don't really know what the root cause of the issue is...


Random other things I've noticed regarding the issue:

  • I'm not able to repro nearly as much using check instead of clippy
    • Reproed 10/10 times with clippy, but 1/10 with check
  • Even with the fix I still noticed lingering -shm and -wal files very rarely after running the repro script. I'm not really sure how to inspect them, so I don't know why they occasionally stick around

@abonander
Copy link
Collaborator

The LOCKED error is likely some statement not getting reset properly.

@CosmicHorrorDev
Copy link
Contributor Author

Any pointers on where to look and how to debug the locking?

I'm not familiar with the whole migrations section of the codebase

@abonander
Copy link
Collaborator

I've spent the last couple hours day digging into it myself.

Something that's annoyingly misleading: database is locked is the error message returned by SQLITE_BUSY, not SQLITE_LOCKED (which is database table is locked, smdh). To disambiguate, I added the actual integer code to the error printout for SqliteError.

SQLITE_BUSY means another process has acquired an exclusive lock on the database because it's preparing to write to it. That doesn't really make a lot of sense though, as we shouldn't be doing any writing since we just do an EXPLAIN and maybe a couple SELECTs.

What I thought was happening was that because we're doing EXPLAIN INSERT INTO ..., the normal autocommit logic still holds and begins an exclusive write transaction which acquires either a RESERVED or EXCLUSIVE lock on the database, preventing any other readers or writers.

What's really weird, though, is that we set a busy handler with sqlite3_busy_timeout() which should sleep as long as the database is locked, but no matter how much I adjust the timeout it still sporadically errors out with database is locked. However, a closer reading of the docs reveals that the user-specified busy handler is only called when a table is locked by another connection, and looking at the usage in SQLite's source code confirms that.

After some more experimenting and simplifying the repro, I found that the error appears to occur in setting the initial PRAGMAs on the connection.

Digging further, I think it's specifically due to the PRAGMA journal_mode = off; and the fact that the database was created in the WAL journaling mode (by sqlx database reset). Switching the journaling mode requires taking an EXCLUSIVE (write) lock on the database because it first needs to apply any pending changes in the WAL to maintain consistency: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

However, this is happening concurrently to other connections either having an EXCLUSIVE lock and applying the new journaling mode or having SHARED locks against the database for executing the EXPLAINs and so the statement dies with SQLITE_BUSY.

This also explains why it would fail the first time but succeed the second, because one of the concurrent build tasks already changed the journaling mode and so no exclusive lock is necessary.

So I added code in https://github.com/launchbadge/sqlx/blob/main/sqlx-core/src/sqlite/options/connect.rs#L13 to execute each PRAGMA separately, retrying it in a backoff loop up until a hardcoded timeout if it receives SQLITE_BUSY. It had a noticeable effect but it's still returning the error after the timeout, so I think there's more investigation needed here.

However, upon learning more about journaling modes, I think there's a significant mitigation we can apply here: we should not be sending PRAGMA journal_mode unless the user specifically requested it, and it should definitely not default to WAL for databases created by sqlx-cli.

I think the WAL-by-default is a holdover from when we were experimenting with trying to make SQLite nonblocking without needing a worker thread. WAL mode is very cool and potentially very fast, but it comes with tradeoffs that the user should probably consider, and doesn't necessarily apply to every use case. It really should be the user's choice whether to enable it or not.

Fortunately, it shouldn't be a breaking change to drop it as the default as any database created in WAL mode will retain the setting unless explicitly changed with PRAGMA journal_mode = ....

@CosmicHorrorDev
Copy link
Contributor Author

Thanks a lot for covering all of that! It clears up a lot of questions/confusion I had. It certainly covers a lot of information that I wasn't aware of, and is a much deeper dive into the root cause analysis than I would have been able to manage 😅

I'll leave this PR open as a possible work-around, but feel free to close it once the change for journal_mode, and potentially anything else that may be required, is confirmed to fix the issue (or you can close it if you just feel like closing it of course)

@abonander
Copy link
Collaborator

This is still probably a decent fix for the temporary files being left around even when in WAL mode, however. I would add conn.close().await?; after line 209 to ensure the files are cleaned up.

@CosmicHorrorDev
Copy link
Contributor Author

I added the explicit close, but there are still occasionally temporary files after running the repro script (in 4 runs of 20). It's worth noting that this same behavior sometimes happens back on sqlx v0.5.13, so it's not new behavior at the very least

@abonander
Copy link
Collaborator

If it exits early with an error then the database might not get closed properly.

I might do a refactor that lets the SQLite codepath execute directly in blocking mode so the database always gets cleaned up properly since it would be owned by the main thread instead of a daemon thread.

@abonander abonander merged commit d9fd21c into launchbadge:main Jul 9, 2022
@abonander abonander mentioned this pull request Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0.6.0: query macro fails on sqlite with "error returned from database: database is locked"
2 participants