Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"threads.c" error when running binaries or docker image for amd64 devel version #3569

Closed
laurenceisla opened this issue Jun 3, 2024 · 18 comments
Labels

Comments

@laurenceisla
Copy link
Member

laurenceisla commented Jun 3, 2024

Environment

  • PostgreSQL version: PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.3.0, 64-bit
  • PostgREST version: v12.1 (devel)
  • PostgREST docker version: devel
  • Operating system: NixOS 23.11

Description of issue

Executing the binary or docker release for the devel branch , returns an error when trying to connect to the database.

Local binary:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
./postgrest
03/Jun/2024:13:19:01 -0500: Starting PostgREST 12.1...
03/Jun/2024:13:19:01 -0500: Listening on port 3000
03/Jun/2024:13:19:01 -0500: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
fish: Job 1, 'PGRST_DB_URI=postgres://postgre…' terminated by signal PGRST_DB_ANON_ROLE=postgres \ (PGRST_DB_LOG_LEVEL=info \)
fish: Job ./postgrest, 'SIGABRT' terminated by signal Abort ()

Docker:

sudo docker run --net=host --rm --name test-db-conn -p 3000:3000 \
            -e PGRST_DB_URI="postgres://postgres@localhost:5432/postgres" \
            -e PGRST_DB_ANON_ROLE="postgres" \
            -e PGRST_LOG_LEVEL="info" \
            postgrest/postgrest:devel
03/Jun/2024:18:21:03 +0000: Starting PostgREST 12.1...
03/Jun/2024:18:21:03 +0000: Listening on port 3000
03/Jun/2024:18:21:03 +0000: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)

Running the main branch from a Nix environment does not return this error:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
postgrest-run
03/Jun/2024:13:18:45 -0500: Starting PostgREST 12.1 (b6c6f2b)...
03/Jun/2024:13:18:45 -0500: Attempting to connect to the database...
03/Jun/2024:13:18:45 -0500: Listening on port 3000
03/Jun/2024:13:18:45 -0500: Successfully connected to PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.3.0, 64-bit
...
@steve-chavez
Copy link
Member

steve-chavez commented Jun 3, 2024

Local binary:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
postgrest-run

There seems to be a mistake above, postgrest-run doesn't work but I can reproduce with the binary. Like so:

PGRST_DB_URI=postgres://postgres@localhost:5432/postgres \
PGRST_DB_ANON_ROLE=postgres \
PGRST_DB_LOG_LEVEL=info \
./postgrest

03/Jun/2024:13:40:53 -0500: Starting PostgREST 12.1...
03/Jun/2024:13:40:53 -0500: Listening on port 3000
03/Jun/2024:13:40:53 -0500: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
Aborted (core dumped)

Edit: Corrected the issue description.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Jun 3, 2024

So, essentially you are saying that the static build fails to run, right?

Can't reproduce, yet:

PGRST_DB_LOG_LEVEL=info postgrest-with-postgresql-15 /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/socket' -U postgres
postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-2tR/db.log
03/Jun/2024:20:39:36 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:39:36 +0200: Attempting to connect to the database...
03/Jun/2024:20:39:36 +0200: Listening on port 3000
03/Jun/2024:20:39:36 +0200: Listening for notifications on the "pgrst" channel
03/Jun/2024:20:39:36 +0200: Successfully connected to PostgreSQL 15.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
03/Jun/2024:20:39:36 +0200: Config reloaded
03/Jun/2024:20:39:36 +0200: Schema cache queried in 1.9 milliseconds
03/Jun/2024:20:39:36 +0200: Schema cache loaded 0 Relations, 0 Relationships, 0 Functions, 0 Domain Representations, 4 Media Type Handlers

Can't reproduce by downloading the binary either.

@wolfgangwalther
Copy link
Member

Can reproduce with the downloaded binary and a local PG (non nix tooled)

PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres ./postgrest
03/Jun/2024:20:46:46 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:46:46 +0200: Listening on port 3000
03/Jun/2024:20:46:46 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1]    1424861 IOT instruction (core dumped)  PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres ./postgrest

And with the fresh build of nix-build -A postgrestStatic as well:

PGRST_DB_LOG_LEVEL=info PGRST_DB_URI=postgres://postgres@localhost:5432/postgres PGRST_DB_ANON_ROLE=postgres /nix/store/7c9y713ayzvj1812bhiac38f0wkvvcfv-postgrest-static-x86_64-unknown-linux-musl-12.1/bin/postgrest
03/Jun/2024:20:47:42 +0200: Starting PostgREST 12.1...
03/Jun/2024:20:47:42 +0200: Listening on port 3000
03/Jun/2024:20:47:42 +0200: Attempting to connect to the database...
Assertion failed: *(&once->n) != 4 (threads.c: k5_once: 564)
[1]    1427036 IOT instruction (core dumped)  PGRST_DB_LOG_LEVEL=info PGRST_DB_URI= PGRST_DB_ANON_ROLE=postgres

This should allow bisecting.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 3, 2024

Can reproduce with the downloaded binary and a local PG (non nix tooled)

Same for me, it doesn't happen when using the nix postgrest-with-postgresql-* pg, only when using my distro postgres. What's strange is that it doesn't even log the database version, it just dies.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 3, 2024

@wolfgangwalther
Copy link
Member

Interesting: The error doesn't seem to happen every time. It fails sometimes and sometimes doesn't - on the same commit.

@wolfgangwalther
Copy link
Member

git bisect points to 3cf5656 as the first bad commit.

@steve-chavez
Copy link
Member

Ah damn, it likely is the exception throwing and catching I did

tryRethrow :: IO () -> IO ()
tryRethrow action = do
act <- try action
whenLeft act (\ex -> do
AppState.putIsListenerOn appState False
observer $ DBListenFail dbChannel (Right $ Left ex)
unless configDbPoolAutomaticRecovery $ do
killThread mainThreadId
throw ex)

Somehow it must be affecting some state on the C lib. I will try a different approach.

Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?

@laurenceisla
Copy link
Member Author

Why this doesn't happen on the local nix postgres though? Because it connects through unix sockets?

I don't think so. Using postgrest-run in the nix environment but connecting to my local database over TCP works OK (it's my last example).

@wolfgangwalther
Copy link
Member

I think it needs the combination of static executable + TCP to break.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 4, 2024

So it wasn't the exception catching/throwing, I just removed that and it fails the same.

Another change on 3cf5656 is that now the listener and the connection worker start in parallel.

Before the listener waited for the connection worker, here 3cf5656#diff-f6074642dc32a94e9e1997b21b0df631b2d3f098e0a81673edc22b89a86f49edL534-L538. But that was only done to reuse the exponential backoff retrying from the connection worker, I didn't know it would have this effect. I guess the kerberos lib doesn't work with parallel connections?

@wolfgangwalther
Copy link
Member

Hm, not sure whether this is about kerberos (but yes, could be). Could also be about libpq itself:

https://www.postgresql.org/docs/current/libpq-threading.html

One thread restriction is that no two threads attempt to manipulate the same PGconn object at the same time. In particular, you cannot issue concurrent commands from different threads through the same connection object.

Maybe something like that is happening accidentally?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Jun 4, 2024

Oh, on the same page later on:

If you are using Kerberos inside your application (in addition to inside libpq), you will need to do locking around Kerberos calls because Kerberos functions are not thread-safe. See function PQregisterThreadLock in the libpq source code for a way to do cooperative locking between libpq and your application.

Ah, that doesn't seem relevant, though, because we are not using kerberos additionally. But it could explain why libpq throws up in that area when maybe the thing from above happens?

steve-chavez added a commit to steve-chavez/postgrest that referenced this issue Jun 5, 2024
Clears the limitation mentioned on

PostgREST#3536

The Listener no longer uses the https://hackage.haskell.org/package/retry
package and instead uses a much simpler IORef in AppState for the
delays.

Additionally it no longer uses exception throwing/catching, which
is rather messy and brings some
concerns(PostgREST#3569 (comment)).
steve-chavez added a commit that referenced this issue Jun 5, 2024
Clears the limitation mentioned on

#3536

The Listener no longer uses the https://hackage.haskell.org/package/retry
package and instead uses a much simpler IORef in AppState for the
delays.

Additionally it no longer uses exception throwing/catching, which
is rather messy and brings some
concerns(#3569 (comment)).
@wolfgangwalther
Copy link
Member

I tested the current main branch with disabled gssapi / krb5. This makes the error go away, too.

The question is whether we can find a way / configuration where we can have your changes and gssapi-support?

@wolfgangwalther
Copy link
Member

For now, I pushed a revert of the GSSAPI commit. I will try this again, once we get a truly static libpq build via pkgsStatic - this could be related.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 5, 2024

Oh wait.. I have a fix ready now that brings back waiting between the listener and connection worker..

@wolfgangwalther
Copy link
Member

Happy to have that in, if it improves the code. But I would still not want to release the GSSAPI change right now. I'm not 100% convinced that this can not cause problems elsewhere, maybe when two pool connections start up at the same time or something like that.

I actually kind of like the fact that we have a reproducer for this case, so I'm fine with keeping it as is.

steve-chavez added a commit to steve-chavez/postgrest that referenced this issue Jun 5, 2024
Brings back the the signaling/waiting between the connection pool and
the Listener.

Prevents the GSSAPI error shown on PostgREST#3569
@steve-chavez
Copy link
Member

Yeah, I think it's worth adding it back. If anything it reduces wastefully retrying both listener and the connection worker when postgres is completely down. Doing it on #3573.

maybe when two pool connections start up at the same time or something like that.

Right, that could still be trouble.

steve-chavez added a commit that referenced this issue Jun 5, 2024
Brings back the the signaling/waiting between the connection pool and
the Listener.

Prevents the GSSAPI error shown on #3569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants