Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spontaneous "cannot verify wal state" errors #395

Closed
ryanpbrewster opened this issue Jul 1, 2022 · 3 comments
Closed

Spontaneous "cannot verify wal state" errors #395

ryanpbrewster opened this issue Jul 1, 2022 · 3 comments

Comments

@ryanpbrewster
Copy link

ryanpbrewster commented Jul 1, 2022

I'm running litestream in a fly.io environment using live replication (based on https://github.com/benbjohnson/litestream-read-replica-example). It works well on startup, but after a few hours it reliably gets into a broken state where it emits logs like this:

2022-07-01T03:33:51Z app[de16976d] ewr [info]/app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory                                         

I'm invoking my app via the litestream --exec flag:

/app/bin/litestream replicate \
  --config /app/config/$DB_MODE.yaml \
  --exec "/app/bin/golinks_backend --database $DB_FILE --mode $DB_MODE --address [::]:8080"

I can reliably fix the issue by restarting the container. Indeed I am currently working around this issue by piping litestream into head -n 1000 so that it automatically exits (and is then restarted) anytime it begins spamming logs.

A bit of extra info in case that helps:
I am running litestream v0.4.0-beta.2 installed via Docker with

FROM litestream/litestream:0.4.0-beta.2 AS litestream

The output of litestream version is

# /app/bin/litestream version
80f8de4d9ed16b7ae6f607bde85d90916784de7d

The folder it's complaining about actually does exist (at least at the points in time when I am manually inspecting it):

# ls /app/data
lost+found  mylinks.db  mylinks.db-litestream  mylinks.db-shm  mylinks.db-wal

/app/config/primary.yaml looks like

addr: ":9090"
dbs:
  - path: "$DB_FILE"

I have a replica listening from another region, and that replica's config looks like

dbs:
  - path: "$DB_FILE"
    upstream:
      url: http://${FLY_PRIMARY_REGION}.${FLY_APP_NAME}.internal:9090

This seems to happen mostly when the database is idle, but I can't be too confident. I've had a sessions that lasted multiple hours before hitting this error, and some sessions that hit it within 10 minutes.

@ryanpbrewster
Copy link
Author

Found some old logs. I may be mistaken, this might be an issue that occurs on startup; I'm having a hard time telling when fly is restarting the process. Here's an example of a 10m run:

2022-07-01T03:34:06.836 app[de16976d] ewr [info] Starting init (commit: 5556859)...
2022-07-01T03:34:06.850 app[de16976d] ewr [info] Mounting /dev/vdc at /app/data w/ uid: 0, gid: 0 and chmod 0755
2022-07-01T03:34:06.927 app[de16976d] ewr [info] Preparing to run: `/app/bin/entrypoint.sh` as root
2022-07-01T03:34:06.945 app[de16976d] ewr [info] 2022/07/01 03:34:06 listening on [fdaa:0:3159:a7b:ab2:1:2ec6:2]:22 (DNS: [fdaa::3]:53)
2022-07-01T03:34:06.978 app[de16976d] ewr [info] Starting instance as primary 
2022-07-01T03:34:06.978 app[de16976d] ewr [info] Starting Litestream & application
2022-07-01T03:34:09.240 app[de16976d] ewr [info] http: stream generation mismatch, using primary position: client.pos=97ad2d23ba7d9ef0/000000000000001e:0000000000000000
2022-07-01T03:34:09.240 app[de16976d] ewr [info] http: stream connected: pos=8f0f4cf5a30a0870/0000000000000000:0000000000000000 snapshot=true
2022-07-01T03:44:37.408 app[de16976d] ewr [info] litestream 80f8de4d9ed16b7ae6f607bde85d90916784de7d
2022-07-01T03:44:37.408 app[de16976d] ewr [info] initialized db: /app/data/mylinks.db
2022-07-01T03:44:37.408 app[de16976d] ewr [info] http server running at http://localhost:9090
2022-07-01T03:44:37.408 app[de16976d] ewr [info] litestream initialization complete
2022-07-01T03:44:37.408 app[de16976d] ewr [info] 2022-07-01T03:34:07.176382Z INFO golinks_backend: Listening on [::]:8080
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: init: cannot determine last wal position, clearing generation; primary wal header: EOF
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: sync: new generation "8f0f4cf5a30a0870", no generation exists
2022-07-01T03:44:37.408 app[de16976d] ewr [info] /app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory 

I cannot account for the 10 minute gap between Starting instance as primary and the first log line from my application, golinks_backend: Listening on [::]:8080. It is possible that litestream initialization is taking 10+ minutes? This database is extremely small:

# du -sh /app/data/*
16K     /app/data/lost+found
36K     /app/data/mylinks.db
2.3M    /app/data/mylinks.db-litestream
32K     /app/data/mylinks.db-shm
8.0K    /app/data/mylinks.db-wal

@ryanpbrewster
Copy link
Author

Another data point:

  • I cannot really reproduce this locally. When I run two litestream instances on my dev machine, it starts up instantly and seems completely immune to the issues that plague my fly.io setup

I see logs like this each time the fly.io VM restarts:

2022-07-01T04:27:56Z app[de16976d] ewr [info]/app/data/mylinks.db: sync error, retrying: cannot verify wal state: stat /app/data/mylinks.db-wal: no such file or directory
2022-07-01T04:27:57Z app[de16976d] ewr [info]Main child exited normally with code: 141
2022-07-01T04:27:57Z app[de16976d] ewr [info]Starting clean up.
2022-07-01T04:27:57Z app[de16976d] ewr [info]Umounting /dev/vdc from /app/data
2022-07-01T04:27:57Z app[de16976d] ewr [info]error umounting /app/data: EBUSY: Device or resource busy, retrying in a bit

I'm wondering whether it's detaching the volume forcibly, causing some kind of disk issue that exacerbates the startup process.

I'm gonna try to attach the fly.io disk to a debugging VM and run litestream manually to see if that changes the behavior.

@benbjohnson
Copy link
Owner

Thanks for the testing, @ryanpbrewster. I'm closing this issue per the live replication announcement here: #8 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants