Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: secondary redbean SQLite WAL connections lock after first write #1228

Open
goldenstein64 opened this issue Jun 17, 2024 · 6 comments
Open
Labels
medium severity Used to report medium severity bugs (e.g. Malfunctioning Features but still useable)

Comments

@goldenstein64
Copy link

goldenstein64 commented Jun 17, 2024

Contact Details

goldnstein64@gmail.com

What happened?

If two processes are connected to the same database with WAL enabled, a write operation on one connection will block all other connections until the writer disconnects.

Here is a reproduction of the issue along with its output, executed with redbean -F FILE -i

unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
  ddl_db:close()
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()
unix.exit()
child:  0
child:  not an error
parent: 15
parent: locking protocol

The parent: messages appear after ~13 seconds. Making the write explicitly finish with BEGIN TRANSACTION / END TRANSACTION or COMMIT doesn't alter the program's behavior. A more obvious repro would probably be creating a table in the ddl_db block and having both processes insert values into the table.

The expected output for this program would likely be:

child:  0
child:  not an error
parent: 0
parent: not an error

Version

redbean 2.2

What operating system are you seeing the problem on?

Windows

Relevant log output

> ./redbean.com -F test.lua -i --strace
SYS  29048            714'348 bell system five system call support 430 magnums loaded on the new technology
SYS  29048            808'858 __morph_begin()
SYS  29048          1'433'336 __morph_end()

> ./redbean.com -F test.lua -i --ftrace
error: --ftrace failed to open symbol table
child:  0
child:  not an error
parent: 15
parent: locking protocol
@goldenstein64 goldenstein64 added the medium severity Used to report medium severity bugs (e.g. Malfunctioning Features but still useable) label Jun 17, 2024
@pkulchenko
Copy link
Collaborator

Thank you for the test case to reproduce the issue. I ran into the same issue some time ago and even posted my example to the SQLite forum, but given that cosmo implementation on Windows is different from the "original" sqlite one (as it's using the same code that runs on unix), there wasn't much response.

I can provide an additional case that may help with troubleshooting. Try commenting out the removal of the file on line one and then the "close" call will make all the difference between successful and failing cases:

--unix.rmrf('some.db')

do
  local ddl_db = assert(lsqlite3.open('some.db'))
  ddl_db:exec('PRAGMA journal_mode=WAL')
--  ddl_db:close() -- succeeds without close() and fails with close() call
end

local pid = assert(unix.fork()) -- run this in 2 processes
local pname = pid == 0 and 'child:' or 'parent:'
local db = assert(lsqlite3.open('some.db'))

-- make the parent sleep for one second
if pid ~= 0 then
  unix.nanosleep(1)
end

print(pname, db:exec('PRAGMA optimize')) -- execute a write operation (I think, still blocks)
print(pname, db:error_message())

-- if executed successfully, sleep for 15 seconds
if db:error_code() == lsqlite3.OK then
  unix.nanosleep(15)
end
db:close()

print(pname, "done")
if pid ~= 0 then unix.wait(pid) end -- added to reap zombies

unix.exit()

Somehow calling close() affects the structures that the process has before forking, which "breaks" the locking mechanism. I tried to troubleshoot it some time ago and even compare with the same code that succeeds on Linux, but couldn't find the culprit.

I'm open to ideas/suggestions, as it affects one of my projects.

@ncruces
Copy link

ncruces commented Jun 24, 2024

You're getting an SQLITE_PROTOCOL error, which points to SQLite being blocked on a WAL lock that SQLite never expects to be held for very long (so the busy handler is not involved).

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned.

If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

@goldenstein64
Copy link
Author

goldenstein64 commented Aug 23, 2024

I think the bug got fixed in 3.0.0? The repro now prints the expected output.

Although I should note that the some.db file had to be regenerated, which may be an issue for people who are adapting a redbean 2.2 project (which wouldn't be working great anyway since this bug would manifest in them already).

@goldenstein64
Copy link
Author

goldenstein64 commented Aug 24, 2024

Nevermind, I just realized that ddl_db:close() was commented out in my version of the repro file... oops

@pkulchenko
Copy link
Collaborator

See this which is called in a loop here. I'd look for places where WAL_RETRY is returned.
If this is happening when opening a new connection, I suspect the DMS lock might be involved. The Windows and Unix implementations are very different. Trying to do Windows by translating Unix syscalls is likely to introduce races here.

@ncruces, thank you for the details. Do you have any further suggestions on what to check or how to troubleshoot this? It does work for me without closing the DB, so there is something that happens as part of this process that affects the subsequent processing, but I so far has not been able to find what that may be.

@ncruces
Copy link

ncruces commented Sep 6, 2024

No. I will note that what you're trying to do here is fraught with issues.

I hope you've read sqlite.org/howtocorrupt.html, especially the entire section 2; also this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium severity Used to report medium severity bugs (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

3 participants