Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapply #8644 #9242

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Conversation

aakselrod
Copy link
Contributor

@aakselrod aakselrod commented Nov 2, 2024

Change Description

Fix #9229 by reapplying #8644 and

  • correctly handling serialization errors in the batch package
  • splitting batch requests into their own transactions for postgres db backend to reduce serialization errors
  • correctly handling errors that were previously ignored or not passed through in the channeldb package
  • handling current transaction is aborted errors as serialization errors in case we hit a serialization error and ignore it, and get this error in a subsequent call to postgres
  • tuning the db-instance postgres flags in Makefile per @djkazic's recommendations
  • setting the maxconnections parameter for postgres DBs to 20 instead of 50 by default

Steps to Test

See the failing itests prior to the fix, and the passing itests after the fix.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

Copy link
Contributor

coderabbitai bot commented Nov 2, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@aakselrod
Copy link
Contributor Author

Waiting to push fix commit until CI completes for the reapplication.

@aakselrod aakselrod force-pushed the reapply-8644 branch 2 times, most recently from 7a40c4a to 1e7b192 Compare November 2, 2024 03:14
@aakselrod aakselrod changed the title Reapply 8644 Reapply #8644 Nov 2, 2024
@aakselrod
Copy link
Contributor Author

Note the postgres itests fail mostly on opening, announcing, and closing channels. This is due to the use of the batch package to batch announcement update writes, while batch doesn't handle serialization errors correctly.

@aakselrod
Copy link
Contributor Author

Looks like there are still a couple of itests failing. Will keep working on this next week.

@aakselrod
Copy link
Contributor Author

The error message not enough elements in RWConflictPool to record a read/write conflict tells me that postgres might not be running with enough resources to handle the itests with 8644 reapplied.

@bhandras bhandras self-requested a review November 2, 2024 06:41
@saubyk saubyk added database Related to the database/storage of LND postgres kvdb labels Nov 2, 2024
@Roasbeef
Copy link
Member

Roasbeef commented Nov 4, 2024

This looks relevant, re some of the errors I see in the latest CI run: https://stackoverflow.com/a/42303225

@Roasbeef
Copy link
Member

Roasbeef commented Nov 4, 2024

Perhaps part of the issue is with the ON CONFLICT clause in many of the queries:

  • ON CONFLICT Clause with Partial Indexes: Your use of ON CONFLICT with a partial unique index and a WHERE clause may not be matching the index as expected. This can lead to unexpected behavior in conflict resolution and contribute to serialization failures.
  • Conflict Resolution: The ON CONFLICT clause relies on unique indexes or constraints to detect conflicts. If the clause doesn't perfectly match an existing unique index or constraint, PostgreSQL cannot efficiently perform conflict resolution, leading to increased chances of serialization failures.
  • Match ON CONFLICT Clause to Unique Index: Ensure that the ON CONFLICT clause exactly matches the unique index or constraint without additional WHERE clauses that might prevent PostgreSQL from recognizing the conflict.

Based on the SO link above, we might also be lacking some needed indexes.

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 4, 2024

With closing the channel and a couple of other tests, I'm seeing logs similar to:

2024-11-01 22:13:15.503 [ERR] GRPH builder.go:1020: unable to prune routing table: unknown postgres error: ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
2024-11-01 22:13:15.504 [ERR] GRPH builder.go:849: unable to prune graph with closed channels: unknown postgres error: ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)

when I reproduce locally, as well as in the CI logs. I'm going to pull on that thread first...

On the test config side, also seeing these:

2024-11-02 04:28:59.361 UTC [19976] ERROR:  out of shared memory
2024-11-02 04:28:59.361 UTC [19976] HINT:  You might need to increase max_pred_locks_per_transaction.

I think the first issue above is with the code, the second is a config issue, and the other config issue in my comment above are the three major failures still happening. I think the ON CONFLICT/index audit is also a good idea to ensure we minimize serialization errors, but I'd like to solve at least the code issue with the graph pruning first.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 4, 2024

2024-11-01 22:13:15.504 [ERR] GRPH builder.go:849: unable to prune graph with closed channels

This looks like a case where we continue when we get an error, instead of checking it for a serialization error, and returning if it is.

@aakselrod
Copy link
Contributor Author

This looks like a case where we continue when we get an error, instead of checking it for a serialization error, and returning if it is.

Yep, looking into why that isn't caught by the panic/recover mechanism.

@aakselrod
Copy link
Contributor Author

It was actually a lack of error checking in delChannelEdgeUnsafe() when calling updateEdgePolicyDisabledIndex(). This caused the serialization error to be ignored, and the next error is the one I pasted above. Submitting a fix commit below.

@aakselrod
Copy link
Contributor Author

Looks better as far as the errors on closing channels. Will keep working tomorrow to eliminate the other errors.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 6, 2024

2024-11-05 03:38:04.408 UTC [14305] ERROR: out of shared memory
2024-11-05 03:38:04.408 UTC [14305] HINT: You might need to increase max_pred_locks_per_transaction.

Hmm, so we don't have great visibility into how much memory these CI machines have. Perhaps we need to modify the connection settings to reduce the number of active connections, and also tune params like max_pred_locks_per_transaction.

@djkazic has been working on a postgres+lnd tuning/perf guide, that I think we can eventually check directly into lnd.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 6, 2024

This is also very funky:

// Check to see if a bucket with this key exists.
var dummy int
row, cancel := b.tx.QueryRow(
"SELECT 1 FROM "+b.table+" WHERE "+parentSelector(b.id)+
" AND key=$1 AND value IS NULL", key,
)
defer cancel()
err := row.Scan(&dummy)
switch {
// No bucket exists, proceed to deletion of the key.
case err == sql.ErrNoRows:
case err != nil:
return err
// Bucket exists.
default:
return walletdb.ErrIncompatibleValue
}
_, err = b.tx.Exec(
"DELETE FROM "+b.table+" WHERE key=$1 AND "+
parentSelector(b.id)+" AND value IS NOT NULL",
key,
)
if err != nil {
return err
}

We do two queries to just delete: select to see if exists, then delete. Instead of just trying to delete.

Stepping back a minute: perhaps the issue is with this flawed KV abstraction we have. Perhaps we should just re-create a better hierarchical KV table from scratch. We use sqlc elsewhere so we can gain by having a unified set of light abstractions over what we want to do on the SQL layer.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 6, 2024

Here's another instance of duplicated work in CreateBucket:

// Check to see if the bucket already exists.
var (
value *[]byte
id int64
)
row, cancel := b.tx.QueryRow(
"SELECT id,value FROM "+b.table+" WHERE "+parentSelector(b.id)+
" AND key=$1", key,
)
defer cancel()
err := row.Scan(&id, &value)
switch {
case err == sql.ErrNoRows:
case err == nil && value == nil:
return nil, walletdb.ErrBucketExists
case err == nil && value != nil:
return nil, walletdb.ErrIncompatibleValue
case err != nil:
return nil, err
}
// Bucket does not yet exist, so create it. Postgres will generate a
// bucket id for the new bucket.
row, cancel = b.tx.QueryRow(
"INSERT INTO "+b.table+" (parent_id, key) "+
"VALUES($1, $2) RETURNING id", b.id, key,
)
defer cancel()
err = row.Scan(&id)
if err != nil {
return nil, err
}
return newReadWriteBucket(b.tx, &id), nil
}

We select to see if it exists, then potentially do the insert again. Instead, we can just do an UPSERT, then use RETURNING to give us the bucket id and key we care about, so a single query.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 6, 2024

I think the way the sequence is implemented may also be problematic: we have the sequence field directly in the table, which means table locks may need to be held. The sequence gets incremented a lot for stuff like payments, or invoice. We may be able to instead split that out into another table that can be updated independently of the main table:

// Sequence returns the current sequence number for this bucket without
// incrementing it.
func (b *readWriteBucket) Sequence() uint64 {
if b.id == nil {
panic("sequence not supported on top level bucket")
}
var seq int64
row, cancel := b.tx.QueryRow(
"SELECT sequence FROM "+b.table+" WHERE id=$1 "+
"AND sequence IS NOT NULL",
b.id,
)
defer cancel()
err := row.Scan(&seq)
switch {
case err == sql.ErrNoRows:
return 0
case err != nil:
panic(err)
}
return uint64(seq)
}

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 6, 2024

I've been able to reduce (but not fully eliminate) the out of shared memory and not enough elements in RWConflictPool errors locally by changing the -N parameter of the lnd-postgres container to 200, and changing the default maxconnections value in lnd to 20. This follows from this comment about how the RWConflictPool is allocated.

I've also tried treating these errors and current transaction is aborted as serialization errors, since they generally happen when too many transactions are conflicting, and that seemed to reduce the number of test failures.

In addition, I've found one more place where we get the current transaction is aborted errors due to lack of error handling, and added error handling there.

I pushed these changes above for discussion. My next step is to try to reduce the number of conflicts based on @Roasbeef's suggestions above. I'm going on vacation for the rest of the week until next Tuesday, so will keep working on this then.

@aakselrod
Copy link
Contributor Author

I think treating the OOM errors as serialization errors ended up being a mistake. Going to take that out and push when this run is done. In addition, I'm trying doubling the max_pred_locks_per_transaction value from the default (64->128).

@aakselrod
Copy link
Contributor Author

It seems to be reproducible locally for me, should be able to have it fixed today.

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 19, 2024

So I think there's an issue here with the fact that btcwallet requests locks inside transactions, and then re-requests them on retry. When the inner function panics due to a serialization error, the lock is never released, so the retry just tries acquiring the lock when it's already held, causing a deadlock. An example is SetSyncedTo() being called by the blockchain synchronization goroutine while SyncedTo() is called by e.g. a GetInfo RPC call in an itest. The SetSyncedTo panics and retries without ever releasing the lock it holds (or vice versa).

I think for now the answer is to keep a global lock facility and only turn it on for walletdb and then either a) fix the issue in btcwallet separately or b) figure out a mechanism other than a panic to do retries where errors aren't returned, such as in Get(); perhaps a channel would work instead. I think that can be saved for later, as walletdb isn't as performance-critical as e.g. channeldb. I'm pushing a quick and dirty fix commit, but we can make this better (add an option for turning it on instead of doing it in newPostgresBackend and matching on the db prefix) if it's acceptable.

@Roasbeef
Copy link
Member

When the inner function panics due to a serialization error, the lock is never released, so the retry just tries acquiring the lock when it's already held

Why does this inner function panic? Is this another instance where we aren't properly catching the error? Or is it that we actually have panics in kvdb that trigger when errors fall through normal pattern matching?

@Roasbeef
Copy link
Member

Ah ok, I see now that the serialization error handling in general is based around recovering after panics to retry a transaction:

lnd/kvdb/sqlbase/db.go

Lines 168 to 198 in a101950

// catchPanic executes the specified function. If a panic occurs, it is returned
// as an error value.
func catchPanic(f func() error) (err error) {
defer func() {
if r := recover(); r != nil {
switch data := r.(type) {
case error:
err = data
default:
err = errors.New(fmt.Sprintf("%v", data))
}
// Before we issue a critical log which'll cause the
// daemon to shut down, we'll first check if this is a
// DB serialization error. If so, then we don't need to
// log as we can retry safely and avoid tearing
// everything down.
if sqldb.IsSerializationError(sqldb.MapSQLError(err)) {
log.Tracef("Detected db serialization error "+
"via panic: %v", err)
} else {
log.Criticalf("Caught unhandled error: %v", r)
}
}
}()
err = f()
return
}

lnd/kvdb/sqlbase/db.go

Lines 238 to 246 in a101950

execTxBody := func(tx sqldb.Tx) error {
kvTx, ok := tx.(*readWriteTx)
if !ok {
return fmt.Errorf("expected *readWriteTx, got %T", tx)
}
reset()
return catchPanic(func() error { return f(kvTx) })
}

I'm not sure why we went in that direction historically. At this kvdb emulation level, we can just pass through that error (not panic), then rely on the normal serialization error handling:

lnd/sqldb/interfaces.go

Lines 268 to 284 in a101950

if bodyErr := txBody(tx); bodyErr != nil {
// Roll back the transaction, then attempt a random
// backoff and try again if the error was a
// serialization error.
if err := rollbackTx(tx); err != nil {
return MapSQLError(err)
}
dbErr := MapSQLError(bodyErr)
if IsSerializationError(dbErr) {
if waitBeforeRetry(i) {
continue
}
}
return dbErr
}

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 20, 2024

I'll be off tomorrow but I'll see if I can refactor this to avoid panics later this week. I think the biggest reason is that Get() and other functions for cursors/nested buckets don't return an error and are usually called within transactions (inside the functions that do return errors).

I can likely use a channel or mutex/error field in the tx struct to pass back errors instead of a panic in these cases, and then the deferred unlocks should be executed.

Also we can sort of rely on using the "in a failed tx" errors I've started treating as serialization errors in case something after a Get() does return an error, but it's not guaranteed.

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 20, 2024

I think I have a refactor using a mutex/error in the readWriteTx struct working correctly. Iterating through some tests locally to make sure I don't hit any snags, then pushing for a full CI run. Nope, not fully working yet (seems to have the same problem which is weird since it should have eliminated the deadlocks), but I won't get to keep going on it until Thursday. Going to push to get a full CI run anyway so I have logs to come back to.

@Roasbeef
Copy link
Member

Looks like we have a clean run bitcoind-postgres with the latest push? Nice work!

Will check out the set of core commits now. I still think we can likely re structure the the queries and KV-table, but we can save that for another time.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with this PR! I can tell some serious tenacity went into iterating on this PR to get to the point it's at now.

batch/batch.go Outdated
failIdx = i
}

return dbErr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to return the non-mapped version? For when IsSerializationError is false.

@@ -2622,8 +2622,14 @@ func (c *ChannelGraph) delChannelEdgeUnsafe(edges, edgeIndex, chanIndex,

// As part of deleting the edge we also remove all disabled entries
// from the edgePolicyDisabledIndex bucket. We do that for both directions.
updateEdgePolicyDisabledIndex(edges, cid, false, false)
updateEdgePolicyDisabledIndex(edges, cid, true, false)
err = updateEdgePolicyDisabledIndex(edges, cid, false, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -17,6 +17,11 @@ var (
// ErrRetriesExceeded is returned when a transaction is retried more
// than the max allowed valued without a success.
ErrRetriesExceeded = errors.New("db tx retries exceeded")

postgresErrMsgs = []string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: missing godoc comment.

@@ -21,6 +21,7 @@ var (
postgresErrMsgs = []string{
"could not serialize access",
"current transaction is aborted",
"not enough elements in RWConflictPool",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this gets returned when the instance runs out of shared memory. I wager this is popping up mainly due to the constrained environment that the CI runners execute on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right but I think we can retry this one specifically whereas the out of shared memory error tends to be less retriable. That's why I'm not detecting the error code, but only the string.

@@ -38,7 +38,7 @@ const (
SqliteBackend = "sqlite"
DefaultBatchCommitInterval = 500 * time.Millisecond

defaultPostgresMaxConnections = 50
defaultPostgresMaxConnections = 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the other fixes later in this commit, if we revert this (back to 50) do things still pass?

If not, then we may want to implement clamping for an upper limit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check and see if I can tune to be OK for 50.

Makefile Show resolved Hide resolved

// Apply each request in the batch in its own transaction. Requests that
// fail will be retried by the caller.
for _, req := range b.reqs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should just bypass the batch scheduler all together for postgres?

IIRC, we added it originally to speed up initial graph sync for bbolt, by reducing the number of total transactions we did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be able to take this commit out altogether, will check to see after I've fixed the last deadlock I'm working on now. Otherwise, I'll refactor to just skip the batch scheduler for postgres, should be simpler.

return catchPanic(func() error { return f(kvTx) })

err := f(kvTx)
// Return the internal error first in case we need to retry and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: missing a newline above.

@@ -1624,14 +1625,25 @@ func (s *UtxoSweeper) monitorFeeBumpResult(resultChan <-chan *BumpResult) {
}

case <-s.quit:
log.Debugf("Sweeper shutting down, exit fee " +
"bump handler")
log.Debugf("Sweeper shutting down, exit fee "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp commit that can be dropped?

Copy link
Contributor Author

@aakselrod aakselrod Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was hoping to get this deadlock in CI, but the deadlock happened in another test that didn't produce this output.

I'm able to reproduce the deadlock and think I've figured out how it happens. Running some tests to ensure it's fixed, then if it stays good, will submit a small PR to btcwallet with the fix. I lied, still working on a fix.

Copy link
Contributor Author

@aakselrod aakselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clean this up shortly, but responding to a few comments.

Note that the failure in CI from the previous push is the same deadlock we've seen before in htlc_timeout_resolver_extract_preimage_(remote|local) but in a different test, so it didn't end up showing the goroutine dump. But I think I've tracked it down and will submit a PR to btcwallet to fix it. I think I have a way to track it down, but am still working on it. It's definitely in waddrmgr.

Makefile Show resolved Hide resolved

// Apply each request in the batch in its own transaction. Requests that
// fail will be retried by the caller.
for _, req := range b.reqs {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be able to take this commit out altogether, will check to see after I've fixed the last deadlock I'm working on now. Otherwise, I'll refactor to just skip the batch scheduler for postgres, should be simpler.

@@ -21,6 +21,7 @@ var (
postgresErrMsgs = []string{
"could not serialize access",
"current transaction is aborted",
"not enough elements in RWConflictPool",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right but I think we can retry this one specifically whereas the out of shared memory error tends to be less retriable. That's why I'm not detecting the error code, but only the string.

@@ -1624,14 +1625,25 @@ func (s *UtxoSweeper) monitorFeeBumpResult(resultChan <-chan *BumpResult) {
}

case <-s.quit:
log.Debugf("Sweeper shutting down, exit fee " +
"bump handler")
log.Debugf("Sweeper shutting down, exit fee "+
Copy link
Contributor Author

@aakselrod aakselrod Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was hoping to get this deadlock in CI, but the deadlock happened in another test that didn't produce this output.

I'm able to reproduce the deadlock and think I've figured out how it happens. Running some tests to ensure it's fixed, then if it stays good, will submit a small PR to btcwallet with the fix. I lied, still working on a fix.

@@ -38,7 +38,7 @@ const (
SqliteBackend = "sqlite"
DefaultBatchCommitInterval = 500 * time.Millisecond

defaultPostgresMaxConnections = 50
defaultPostgresMaxConnections = 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check and see if I can tune to be OK for 50.

@Roasbeef
Copy link
Member

Re the shared memory issue, I think we can get around that by bumping up the size of the CI instance we use for these postgres tests: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/running-jobs-on-larger-runners

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 23, 2024

I think I've found the deadlock. With more than one DB transaction allowed in parallel for btcwallet, we're running into a deadlock similar to the following. This example is from the UTXO sweeper tests, but it can happen in other situations as well.

In one goroutine, the UTXO sweeper calls NewAddress() on the wallet. This:

  • Starts a DB transaction
  • Calls (*wallet.wallet) newAddress(), which
  • Calls FetchScopedKeyManager() on the waddrmgr.Manager, which
  • Calls RLock() and RUnlock() on the waddrmgr.Manager mutex and returns the requested scoped manager, holding no locks
  • Then it calls NextExternalAddresses() on the waddrmgr.ScopedManager, which
  • Calls Lock() on the ScopedManager's mutex, defers the `Unlock(), and
  • Calls nextAddresses() on the ScopedManager, which
  • Calls WatchOnly() on the parent Manager, which
  • Calls RLock() and RUnlock() on the Manager's mutex

So while the top-level Manager's mutex is locked first, it's immediately unlocked, and then the ScopedManager lock is held and then Manager is locked/unlocked

In another goroutine, we see the sweeper call GetTransactionDetails() on the wallet, which eventually:

  • Starts a DB transaction
  • Within that transaction, calls (*waddrmgr.Manager) Address(), which
  • Calls RLock() on the Manager, and defers the RUnlock()
  • Iterates through each ScopedManager to look for a matching address, which
  • Calls RLock() and then RUnlock() on the ScopedManager to check if the requested address is cached, and if not
  • Calls Lock() and then Unlock() on the ScopedManager to try to load/cache the requested address from the DB

So the sequence in this case is that the Manager lock is held and the ScopedManagers are locked/unlocked.

This has previously been mitigated by the fact that each of these happens inside a database transaction, which never ran in parallel. However, with parallel DB transactions made possible by this change, the inner deadlock is exposed.

I'll submit a PR next week to btcwallet to fix this, and then clean up this PR/respond to the comments above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to the database/storage of LND kvdb postgres
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracking issue: kvdb/postgres-Remove global application level lock
4 participants