Spontaneous "cannot verify wal state" errors #395

ryanpbrewster · 2022-07-01T03:44:23Z

I'm running litestream in a fly.io environment using live replication (based on https://github.com/benbjohnson/litestream-read-replica-example). It works well on startup, but after a few hours it reliably gets into a broken state where it emits logs like this:

2022-07-01T03:33:51Z app[de16976d] ewr [info]/app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory

I'm invoking my app via the litestream --exec flag:

/app/bin/litestream replicate \
  --config /app/config/$DB_MODE.yaml \
  --exec "/app/bin/golinks_backend --database $DB_FILE --mode $DB_MODE --address [::]:8080"

I can reliably fix the issue by restarting the container. Indeed I am currently working around this issue by piping litestream into head -n 1000 so that it automatically exits (and is then restarted) anytime it begins spamming logs.

A bit of extra info in case that helps:
I am running litestream v0.4.0-beta.2 installed via Docker with

FROM litestream/litestream:0.4.0-beta.2 AS litestream

The output of litestream version is

# /app/bin/litestream version
80f8de4d9ed16b7ae6f607bde85d90916784de7d

The folder it's complaining about actually does exist (at least at the points in time when I am manually inspecting it):

# ls /app/data
lost+found  mylinks.db  mylinks.db-litestream  mylinks.db-shm  mylinks.db-wal

/app/config/primary.yaml looks like

addr: ":9090"
dbs:
  - path: "$DB_FILE"

I have a replica listening from another region, and that replica's config looks like

dbs:
  - path: "$DB_FILE"
    upstream:
      url: http://${FLY_PRIMARY_REGION}.${FLY_APP_NAME}.internal:9090

This seems to happen mostly when the database is idle, but I can't be too confident. I've had a sessions that lasted multiple hours before hitting this error, and some sessions that hit it within 10 minutes.

The text was updated successfully, but these errors were encountered:

ryanpbrewster · 2022-07-01T03:52:44Z

Found some old logs. I may be mistaken, this might be an issue that occurs on startup; I'm having a hard time telling when fly is restarting the process. Here's an example of a 10m run:

2022-07-01T03:34:06.836 app[de16976d] ewr [info] Starting init (commit: 5556859)...
2022-07-01T03:34:06.850 app[de16976d] ewr [info] Mounting /dev/vdc at /app/data w/ uid: 0, gid: 0 and chmod 0755
2022-07-01T03:34:06.927 app[de16976d] ewr [info] Preparing to run: `/app/bin/entrypoint.sh` as root
2022-07-01T03:34:06.945 app[de16976d] ewr [info] 2022/07/01 03:34:06 listening on [fdaa:0:3159:a7b:ab2:1:2ec6:2]:22 (DNS: [fdaa::3]:53)
2022-07-01T03:34:06.978 app[de16976d] ewr [info] Starting instance as primary 
2022-07-01T03:34:06.978 app[de16976d] ewr [info] Starting Litestream & application
2022-07-01T03:34:09.240 app[de16976d] ewr [info] http: stream generation mismatch, using primary position: client.pos=97ad2d23ba7d9ef0/000000000000001e:0000000000000000
2022-07-01T03:34:09.240 app[de16976d] ewr [info] http: stream connected: pos=8f0f4cf5a30a0870/0000000000000000:0000000000000000 snapshot=true
2022-07-01T03:44:37.408 app[de16976d] ewr [info] litestream 80f8de4d9ed16b7ae6f607bde85d90916784de7d
2022-07-01T03:44:37.408 app[de16976d] ewr [info] initialized db: /app/data/mylinks.db
2022-07-01T03:44:37.408 app[de16976d] ewr [info] http server running at http://localhost:9090
2022-07-01T03:44:37.408 app[de16976d] ewr [info] litestream initialization complete
2022-07-01T03:44:37.408 app[de16976d] ewr [info] 2022-07-01T03:34:07.176382Z INFO golinks_backend: Listening on [::]:8080
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: init: cannot determine last wal position, clearing generation; primary wal header: EOF
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: sync: new generation "8f0f4cf5a30a0870", no generation exists
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory

I cannot account for the 10 minute gap between Starting instance as primary and the first log line from my application, golinks_backend: Listening on [::]:8080. It is possible that litestream initialization is taking 10+ minutes? This database is extremely small:

# du -sh /app/data/*
16K     /app/data/lost+found
36K     /app/data/mylinks.db
2.3M    /app/data/mylinks.db-litestream
32K     /app/data/mylinks.db-shm
8.0K    /app/data/mylinks.db-wal

ryanpbrewster · 2022-07-01T04:35:35Z

Another data point:

I cannot really reproduce this locally. When I run two litestream instances on my dev machine, it starts up instantly and seems completely immune to the issues that plague my fly.io setup

I see logs like this each time the fly.io VM restarts:

2022-07-01T04:27:56Z app[de16976d] ewr [info]/app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory
2022-07-01T04:27:57Z app[de16976d] ewr [info]Main child exited normally with code: 141
2022-07-01T04:27:57Z app[de16976d] ewr [info]Starting clean up.
2022-07-01T04:27:57Z app[de16976d] ewr [info]Umounting /dev/vdc from /app/data
2022-07-01T04:27:57Z app[de16976d] ewr [info]error umounting /app/data: EBUSY: Device or resource busy, retrying in a bit

I'm wondering whether it's detaching the volume forcibly, causing some kind of disk issue that exacerbates the startup process.

I'm gonna try to attach the fly.io disk to a debugging VM and run litestream manually to see if that changes the behavior.

benbjohnson · 2022-07-04T00:49:57Z

Thanks for the testing, @ryanpbrewster. I'm closing this issue per the live replication announcement here: #8 (comment)

benbjohnson closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spontaneous "cannot verify wal state" errors #395

Spontaneous "cannot verify wal state" errors #395

ryanpbrewster commented Jul 1, 2022 •

edited

Loading

ryanpbrewster commented Jul 1, 2022

ryanpbrewster commented Jul 1, 2022

benbjohnson commented Jul 4, 2022

Spontaneous "cannot verify wal state" errors #395

Spontaneous "cannot verify wal state" errors #395

Comments

ryanpbrewster commented Jul 1, 2022 • edited Loading

ryanpbrewster commented Jul 1, 2022

ryanpbrewster commented Jul 1, 2022

benbjohnson commented Jul 4, 2022

ryanpbrewster commented Jul 1, 2022 •

edited

Loading