Skip to content

Fix database commit timeout - sync hangs on write phase#593

Merged
bokelley merged 2 commits intomainfrom
bokelley/test-timeout-fixes
Oct 24, 2025
Merged

Fix database commit timeout - sync hangs on write phase#593
bokelley merged 2 commits intomainfrom
bokelley/test-timeout-fixes

Conversation

@bokelley
Copy link
Collaborator

Problem

AccuWeather syncs were hanging on the database write phase, not GAM API calls. The sync would complete discovery successfully but then hang indefinitely on "Writing Targeting Keys to DB".

Root Cause

The _flush_batch() method calls self.db.commit() which can block indefinitely if there's:

  • Lock contention on the GAMInventory table
  • Large transaction size
  • Database connection issues

Our previous timeout fixes only covered GAM API discovery operations, not database operations.

Solution

Added a 120-second timeout to the database commit operation in _flush_batch()

Impact

  • Database commits will timeout after 120s instead of hanging forever
  • Clear error message about lock contention or large transaction
  • Sync can be retried without manual intervention
  • Completes the timeout coverage: GAM API + database operations

Testing

Observed in production:

  • Sync ran for 34+ minutes, stuck on phase 5/7 (Writing Targeting Keys to DB)
  • Progress: count 0, Writing Targeting Keys to DB
  • Database commit was hanging (not GAM API)

With this fix, database commits will timeout gracefully instead of hanging indefinitely.

Related: #587 (GAM API timeouts)

…hangs

The sync was hanging on database commit (not GAM API calls). This adds a
120-second timeout to the _flush_batch commit operation to prevent
indefinite hangs caused by lock contention or large transactions.

Root cause: AccuWeather syncs were hanging on 'Writing Targeting Keys to DB'
phase because db.commit() can block indefinitely if there's lock contention.

Fixes the issue where GAM API discovery completed successfully but database
writes hung forever.
…syncs

Root cause: Database connections timeout after ~15min of inactivity, but sync
holds same session for 30+ min. When connection is lost, queries hang waiting
for response that never comes.

Fixes:
1. Add 'SELECT 1' connection keep-alive before critical queries
2. Add expire_all() to clear stale session state
3. Catch OperationalError/DBAPIError for lost connections
4. Better error messages indicating connection loss vs lock contention

This is the correct 'out of the box' solution - we commit after each batch
(short transactions) but use connection keep-alive for long-lived sessions.

No need to create new sessions per batch - that's wasteful. The session can
be long-lived as long as we verify connection health periodically.
@bokelley bokelley merged commit 0518f4b into main Oct 24, 2025
8 checks passed
danf-newton pushed a commit to Newton-Research-Inc/salesagent that referenced this pull request Nov 24, 2025
* Add 120s timeout to database commit operations to prevent indefinite hangs

The sync was hanging on database commit (not GAM API calls). This adds a
120-second timeout to the _flush_batch commit operation to prevent
indefinite hangs caused by lock contention or large transactions.

Root cause: AccuWeather syncs were hanging on 'Writing Targeting Keys to DB'
phase because db.commit() can block indefinitely if there's lock contention.

Fixes the issue where GAM API discovery completed successfully but database
writes hung forever.

* Add connection keep-alive and better error handling for long-running syncs

Root cause: Database connections timeout after ~15min of inactivity, but sync
holds same session for 30+ min. When connection is lost, queries hang waiting
for response that never comes.

Fixes:
1. Add 'SELECT 1' connection keep-alive before critical queries
2. Add expire_all() to clear stale session state
3. Catch OperationalError/DBAPIError for lost connections
4. Better error messages indicating connection loss vs lock contention

This is the correct 'out of the box' solution - we commit after each batch
(short transactions) but use connection keep-alive for long-lived sessions.

No need to create new sessions per batch - that's wasteful. The session can
be long-lived as long as we verify connection health periodically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant