Bug: secondary redbean SQLite WAL connections lock after first write #1228

goldenstein64 · 2024-06-17T20:47:53Z

Contact Details

What happened?

If two processes are connected to the same database with WAL enabled, a write operation on one connection will block all other connections until the writer disconnects.

Here is a reproduction of the issue along with its output, executed with redbean -F FILE -i

unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
  ddl_db:close()
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()
unix.exit()

child:  0
child:  not an error
parent: 15
parent: locking protocol

The parent: messages appear after ~13 seconds. Making the write explicitly finish with BEGIN TRANSACTION / END TRANSACTION or COMMIT doesn't alter the program's behavior. A more obvious repro would probably be creating a table in the ddl_db block and having both processes insert values into the table.

The expected output for this program would likely be:

child:  0
child:  not an error
parent: 0
parent: not an error

Version

redbean 2.2

What operating system are you seeing the problem on?

Windows

Relevant log output

> ./redbean.com -F test.lua -i --strace
SYS  29048            714'348 bell system five system call support 430 magnums loaded on the new technology
SYS  29048            808'858 __morph_begin()
SYS  29048          1'433'336 __morph_end()

> ./redbean.com -F test.lua -i --ftrace
error: --ftrace failed to open symbol table
child:  0
child:  not an error
parent: 15
parent: locking protocol

The text was updated successfully, but these errors were encountered:

pkulchenko · 2024-06-17T23:11:26Z

Thank you for the test case to reproduce the issue. I ran into the same issue some time ago and even posted my example to the SQLite forum, but given that cosmo implementation on Windows is different from the "original" sqlite one (as it's using the same code that runs on unix), there wasn't much response.

I can provide an additional case that may help with troubleshooting. Try commenting out the removal of the file on line one and then the "close" call will make all the difference between successful and failing cases:

--unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
--  ddl_db:close() -- succeeds without close() and fails with close() call
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()

print(pname, "done")
if pid ~= 0 then unix.wait(pid) end -- added to reap zombies

unix.exit()

Somehow calling close() affects the structures that the process has before forking, which "breaks" the locking mechanism. I tried to troubleshoot it some time ago and even compare with the same code that succeeds on Linux, but couldn't find the culprit.

I'm open to ideas/suggestions, as it affects one of my projects.

ncruces · 2024-06-24T11:57:37Z

You're getting an SQLITE_PROTOCOL error, which points to SQLite being blocked on a WAL lock that SQLite never expects to be held for very long (so the busy handler is not involved).

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned.

If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

goldenstein64 · 2024-08-23T22:00:28Z

I think the bug got fixed in 3.0.0? The repro now prints the expected output.

Although I should note that the some.db file had to be regenerated, which may be an issue for people who are adapting a redbean 2.2 project (which wouldn't be working great anyway since this bug would manifest in them already).

goldenstein64 · 2024-08-24T05:44:34Z

Nevermind, I just realized that ddl_db:close() was commented out in my version of the repro file... oops

pkulchenko · 2024-09-06T04:40:47Z

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned.
If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

@ncruces, thank you for the details. Do you have any further suggestions on what to check or how to troubleshoot this? It does work for me without closing the DB, so there is something that happens as part of this process that affects the subsequent processing, but I so far has not been able to find what that may be.

ncruces · 2024-09-06T09:11:09Z

No. I will note that what you're trying to do here is fraught with issues.

I hope you've read sqlite.org/howtocorrupt.html, especially the entire section 2; also this.

goldenstein64 added the medium severity Used to report medium severity bugs (e.g. Malfunctioning Features but still useable) label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: secondary redbean SQLite WAL connections lock after first write #1228

Bug: secondary redbean SQLite WAL connections lock after first write #1228

goldenstein64 commented Jun 17, 2024 •

edited

Loading

pkulchenko commented Jun 17, 2024

ncruces commented Jun 24, 2024

goldenstein64 commented Aug 23, 2024 •

edited

Loading

goldenstein64 commented Aug 24, 2024 •

edited

Loading

pkulchenko commented Sep 6, 2024

ncruces commented Sep 6, 2024

Bug: secondary redbean SQLite WAL connections lock after first write #1228

Bug: secondary redbean SQLite WAL connections lock after first write #1228

Comments

goldenstein64 commented Jun 17, 2024 • edited Loading

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output

pkulchenko commented Jun 17, 2024

ncruces commented Jun 24, 2024

goldenstein64 commented Aug 23, 2024 • edited Loading

goldenstein64 commented Aug 24, 2024 • edited Loading

pkulchenko commented Sep 6, 2024

ncruces commented Sep 6, 2024

goldenstein64 commented Jun 17, 2024 •

edited

Loading

goldenstein64 commented Aug 23, 2024 •

edited

Loading

goldenstein64 commented Aug 24, 2024 •

edited

Loading