Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclean close leaves Litestream in weird state #510

Closed
hifi opened this issue Oct 21, 2023 · 5 comments · Fixed by #519
Closed

Unclean close leaves Litestream in weird state #510

hifi opened this issue Oct 21, 2023 · 5 comments · Fixed by #519

Comments

@hifi
Copy link
Collaborator

hifi commented Oct 21, 2023

I've been using Litestream as a library and tried a couple of things to abuse it since the usage pattern is a bit different than when using it as a separate process.

What I'm seeing is that the final sync at the beginning of Close() will fail due to this because I had sqlite3 cli just writing while closing (and rolling in the WAL): cannot verify wal state: stat foo/bar.db-wal: no such file or directory. This might be fine to ignore but the caller doesn't know it and calling Close() again will fail because the underlying database connection has already been closed and will always return an error at that point.

This is very much an edge case of an edge case and it shouldn't ever really happen but the way database Close() is structured it's designed to ignore all errors and just continue but it might be better to return early and allow calling it more than once to retry if it's a passing issue or handle the errors more gracefully. Now because it's designed to ignore errors the caller needs to assume the database is somewhat closed and ignore the returned error and at most log it.

@benbjohnson
Copy link
Owner

Do you know if it's your sqlite3 CLI or Litestream that's removing the WAL? Litestream sets the SQLITE_FCNTL_PERSIST_WAL to avoid removing the WAL on close so maybe you need to set it with your client as well? I'm not sure if that's possible with the CLI though.

https://github.com/benbjohnson/litestream//blob/c1ae96818888693bddcf4913bd8dcb9baa55eb05/litestream.go#L55C1-L64

@hifi
Copy link
Collaborator Author

hifi commented Oct 21, 2023

It's definitely the cli that's removing the WAL but Litestream not closing cleanly after that is slightly problematic because it's hard to know from the caller if the database has been closed or not as the function returns an error and can't be called again.

Definitely not something that one would expect in normal circumstances but not being able to retry or "safely" ignore the error from Close() is what I'm looking at here.

@hifi
Copy link
Collaborator Author

hifi commented Oct 26, 2023

I'm currently testing a race where I start replicating and call Sync() once and immediately after I run sqlite3 once that writes and it somehow succeeds to roll in the WAL but that shouldn't be possible if there's a read lock by Litestream, right? I've verified the read lock is in fact acquired successfully before the CLI goes in and does damage.

It doesn't always happen so there seems to be a race somewhere that I can't prevent by calling Sync() manually once on startup. My current theory is the read lock doesn't prevent rolling in the WAL file for whatever reason and is basically no-op for that sometimes but not always.

@Jacob2161
Copy link

Jacob2161 commented Nov 3, 2023

After a bit of testing, it seems that to be safe one should always use .filectrl persist_wal 1 when connecting to a database that Litestream is replicating.

To do this on startup, one can put the meta-command in a sqlite init script:

Create a sqlite init script sqlite3.init

PRAGMA wal_autocheckpoint = 0;
.filectrl persist_wal 1
sqlite3 --init sqlite3.init database.sqlite

In my case, I created a wrapper script called /usr/bin/sqlite3 and moved my sqlite3 binary to /usr/bin/sqlite3.real to hopefully prevent myself from triggering this issue.

I'm pretty satisfied with this solution for now.

Thanks for raising this issue and the response!

hifi added a commit to beeper/litestream that referenced this issue Nov 12, 2023
If Litestream is used as a library and it can be closed and
re-opened against the same database this leaking fd can break
SQLite database file locking. During normal cli operation this
race cannot happen as the whole process will exit after closing
databases.

Closing Litestream and opening again right after will leave the
unclosed file handle out-of-scope and free to garbage collect. Go
tries to be helpful and closes the file handle during GC to prevent
it leaking permanently but that will cause a race with POSIX
advisory locks.

This order of operations will release the _real_ lock the embedded
SQLite of Litestream is holding while leaving it clueless. That
will allow the main application to take a new lock on the file and
eventually both may be checkpointing the same WAL into the main
database file causing permanent corruption.

https://www.sqlite.org/howtocorrupt.html#_posix_advisory_locks_canceled_by_a_separate_thread_doing_close_

Fixes benbjohnson#510
@hifi
Copy link
Collaborator Author

hifi commented Nov 12, 2023

This was actually worse than I thought and eventually lead to database corruption, opened #519 with a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants