Skip to content

Migration runner leaks idle-in-transaction sessions on postgres (v3.13.9) #51

@andrevanzuydam

Description

@andrevanzuydam

Summary

Tina4 v3.13's migration runner opens a transaction for the migration-tracking lookup queries and never closes/commits it. With psycopg2 (autocommit off by default) the connection is then left in idle in transaction state indefinitely. Over time these leak across pod restarts and eventually exhaust the postgres connection pool — at which point new boots fail with:

FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute

Tina4's autodiscovery opens a fresh connection per src/app/*.py it loads, so when the pool is saturated every module load fails and src/routes/* never register. The HTTP server still answers /health-check (probe passes → pod marked Ready), but every actual app route returns 404 — a silent "ready but broken" state.

Versions

  • tina4_python 3.13.9
  • psycopg2 (postgres driver, transactions=on by default)
  • Python 3.13 / Alpine
  • PostgreSQL 14 / 16 (reproduced on both)

Reproduction

  1. Boot a Tina4 v3.13 app against a postgres DB
  2. Look at pg_stat_activity shortly after startup:
SELECT pid, state, age(now(), xact_start) AS tx_age, substr(query, 1, 80)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND application_name = ''  -- the app
ORDER BY backend_start;

You'll see at least one session like:

pid       state                last_query                                                  tx_age
2371466   idle in transaction  SELECT MAX(batch) as max_batch FROM tina4_migration         8m

The transaction stays open for the lifetime of the connection (we've observed sessions sitting in this state for 2+ days, eventually being killed by idle_in_transaction_session_timeout).

Smoking gun in source

Two unwrapped reads in tina4_python/migration/runner.py:

  • Line 297 — _get_next_batch:
    row = db.fetch_one("SELECT MAX(batch) as max_batch FROM tina4_migration")
  • Line 491:
    row = db.fetch_one("SELECT MAX(batch) as max_batch FROM tina4_migration WHERE passed = 1")

Neither path commits or rolls back, and the db.fetch_one helper appears to start an implicit transaction (psycopg2 default). The transaction therefore sits open holding row-level locks on tina4_migration and a connection slot.

Impact

  • Each fresh pod start leaks (at minimum) one idle in transaction connection.
  • On a busy cluster with frequent restarts, the pool fills in days. We hit FATAL: remaining connection slots are reserved after ~89 restarts over 38 hours on a postgres with max_connections=100.
  • Symptoms are confusing because the pod looks healthy: liveness/readiness pass (just /health-check), but actual app routes 404 because the module autodiscovery silently failed during boot when connections were saturated.

Workaround we're using

Set a per-role timeout so postgres auto-kills the zombies:

ALTER ROLE <app_role> SET idle_in_transaction_session_timeout = '300s';

This stops the leak from saturating the pool, but doesn't fix the underlying transaction-handling bug.

Suggested fix

Wrap the read-only lookups in explicit with db.transaction(): ... (or commit immediately), or run them in autocommit mode since they're plain SELECTs that don't need transactional isolation. Same treatment likely needed in the analogous spots in tina4_python/database/ where helper-style db.fetch_one/db.fetch_all start implicit transactions.

Happy to send a PR if the team confirms the intended fix shape.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions