Goal
Make stale-lock recovery safe so two concurrent ensure runs can never both believe they hold a session lock.
Why it matters
In the blocking-wait path (locking.py:104-111), when a waiter sees the holder PID is dead it calls remove_lock(session) which unconditionally lock_path.unlink() (:308-326) WITHOUT re-checking the lock is still unheld. Between get_lock_holder() reading a dead PID and the unlink, a different process can open the same path, acquire the flock, and write its own PID; the racing waiter then deletes that live holder's lock file, and both proceed on different inodes each thinking they hold the lock. is_session_locked also has a TOCTOU window (acquire-then-immediately-release flock, :158-159). The module exists precisely to guarantee one ensure per session (docstring). Scope: only the opt-in --wait-lock path triggers this; the interleave is rare; this code is untested (test_locking.py covers only path-sanitization). But it's a correctness hole in a safety primitive — duplicate dispatch / racing send-keys / clobbered state.
Proposed approach
Preferred: drop the unlink-based recovery entirely and rely on flock's kernel-level auto-release on holder death — a dead PID's flock is already released, so the next LOCK_EX|LOCK_NB poll simply succeeds without any file deletion (the manual cleanup is both unnecessary and unsafe). Minimal: replace the blind remove_lock in the dead-holder branch with remove_stale_lock (which already gates on is_session_locked at :303), and re-check liveness under the about-to-be-held flock. Add a regression test for the dead-holder + concurrent-acquire interleave.
Effort / risk
Effort M. Risk: medium — concurrency code; test carefully. The flock-only approach is the cleanest and removes the most code (fits No-Backwards-Compat).
Goal
Make stale-lock recovery safe so two concurrent
ensureruns can never both believe they hold a session lock.Why it matters
In the blocking-wait path (
locking.py:104-111), when a waiter sees the holder PID is dead it callsremove_lock(session)which unconditionallylock_path.unlink()(:308-326) WITHOUT re-checking the lock is still unheld. Betweenget_lock_holder()reading a dead PID and theunlink, a different process can open the same path, acquire the flock, and write its own PID; the racing waiter then deletes that live holder's lock file, and both proceed on different inodes each thinking they hold the lock.is_session_lockedalso has a TOCTOU window (acquire-then-immediately-release flock,:158-159). The module exists precisely to guarantee oneensureper session (docstring). Scope: only the opt-in--wait-lockpath triggers this; the interleave is rare; this code is untested (test_locking.pycovers only path-sanitization). But it's a correctness hole in a safety primitive — duplicate dispatch / racing send-keys / clobbered state.Proposed approach
Preferred: drop the unlink-based recovery entirely and rely on flock's kernel-level auto-release on holder death — a dead PID's flock is already released, so the next
LOCK_EX|LOCK_NBpoll simply succeeds without any file deletion (the manual cleanup is both unnecessary and unsafe). Minimal: replace the blindremove_lockin the dead-holder branch withremove_stale_lock(which already gates onis_session_lockedat:303), and re-check liveness under the about-to-be-held flock. Add a regression test for the dead-holder + concurrent-acquire interleave.Effort / risk
Effort M. Risk: medium — concurrency code; test carefully. The flock-only approach is the cleanest and removes the most code (fits No-Backwards-Compat).