Unclean close leaves Litestream in weird state #510

hifi · 2023-10-21T06:33:40Z

I've been using Litestream as a library and tried a couple of things to abuse it since the usage pattern is a bit different than when using it as a separate process.

What I'm seeing is that the final sync at the beginning of Close() will fail due to this because I had sqlite3 cli just writing while closing (and rolling in the WAL): cannot verify wal state: stat foo/bar.db-wal: no such file or directory. This might be fine to ignore but the caller doesn't know it and calling Close() again will fail because the underlying database connection has already been closed and will always return an error at that point.

This is very much an edge case of an edge case and it shouldn't ever really happen but the way database Close() is structured it's designed to ignore all errors and just continue but it might be better to return early and allow calling it more than once to retry if it's a passing issue or handle the errors more gracefully. Now because it's designed to ignore errors the caller needs to assume the database is somewhat closed and ignore the returned error and at most log it.

The text was updated successfully, but these errors were encountered:

benbjohnson · 2023-10-21T16:13:55Z

Do you know if it's your sqlite3 CLI or Litestream that's removing the WAL? Litestream sets the SQLITE_FCNTL_PERSIST_WAL to avoid removing the WAL on close so maybe you need to set it with your client as well? I'm not sure if that's possible with the CLI though.

https://github.com/benbjohnson/litestream//blob/c1ae96818888693bddcf4913bd8dcb9baa55eb05/litestream.go#L55C1-L64

hifi · 2023-10-21T19:10:23Z

It's definitely the cli that's removing the WAL but Litestream not closing cleanly after that is slightly problematic because it's hard to know from the caller if the database has been closed or not as the function returns an error and can't be called again.

Definitely not something that one would expect in normal circumstances but not being able to retry or "safely" ignore the error from Close() is what I'm looking at here.

hifi · 2023-10-26T06:31:02Z

I'm currently testing a race where I start replicating and call Sync() once and immediately after I run sqlite3 once that writes and it somehow succeeds to roll in the WAL but that shouldn't be possible if there's a read lock by Litestream, right? I've verified the read lock is in fact acquired successfully before the CLI goes in and does damage.

It doesn't always happen so there seems to be a race somewhere that I can't prevent by calling Sync() manually once on startup. My current theory is the read lock doesn't prevent rolling in the WAL file for whatever reason and is basically no-op for that sometimes but not always.

Jacob2161 · 2023-11-03T02:38:07Z

After a bit of testing, it seems that to be safe one should always use .filectrl persist_wal 1 when connecting to a database that Litestream is replicating.

To do this on startup, one can put the meta-command in a sqlite init script:

Create a sqlite init script sqlite3.init

PRAGMA wal_autocheckpoint = 0;
.filectrl persist_wal 1

sqlite3 --init sqlite3.init database.sqlite

In my case, I created a wrapper script called /usr/bin/sqlite3 and moved my sqlite3 binary to /usr/bin/sqlite3.real to hopefully prevent myself from triggering this issue.

I'm pretty satisfied with this solution for now.

Thanks for raising this issue and the response!

If Litestream is used as a library and it can be closed and re-opened against the same database this leaking fd can break SQLite database file locking. During normal cli operation this race cannot happen as the whole process will exit after closing databases. Closing Litestream and opening again right after will leave the unclosed file handle out-of-scope and free to garbage collect. Go tries to be helpful and closes the file handle during GC to prevent it leaking permanently but that will cause a race with POSIX advisory locks. This order of operations will release the _real_ lock the embedded SQLite of Litestream is holding while leaving it clueless. That will allow the main application to take a new lock on the file and eventually both may be checkpointing the same WAL into the main database file causing permanent corruption. https://www.sqlite.org/howtocorrupt.html#_posix_advisory_locks_canceled_by_a_separate_thread_doing_close_ Fixes benbjohnson#510

hifi · 2023-11-12T11:05:03Z

This was actually worse than I thought and eventually lead to database corruption, opened #519 with a fix.

devinivy mentioned this issue Nov 9, 2023

Exposing file control operations WiseLibs/better-sqlite3#1101

Open

hifi mentioned this issue Nov 12, 2023

Always release long-running fd on db close #519

Merged

benbjohnson closed this as completed in #519 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclean close leaves Litestream in weird state #510

Unclean close leaves Litestream in weird state #510

hifi commented Oct 21, 2023

benbjohnson commented Oct 21, 2023

hifi commented Oct 21, 2023

hifi commented Oct 26, 2023 •

edited

Loading

Jacob2161 commented Nov 3, 2023 •

edited

Loading

hifi commented Nov 12, 2023

Unclean close leaves Litestream in weird state #510

Unclean close leaves Litestream in weird state #510

Comments

hifi commented Oct 21, 2023

benbjohnson commented Oct 21, 2023

hifi commented Oct 21, 2023

hifi commented Oct 26, 2023 • edited Loading

Jacob2161 commented Nov 3, 2023 • edited Loading

hifi commented Nov 12, 2023

hifi commented Oct 26, 2023 •

edited

Loading

Jacob2161 commented Nov 3, 2023 •

edited

Loading