Skip to content

fix: prevent consumer stall on dead Redis connections #33

Description

@korutx

Context

Tracking issue for PR #32: #32

On 2026-06-30 14:13 UTC, GKE maintenance replaced the Redis pod shortly after the stream-processor pod started. Existing pooled TCP connections pointed at the old Redis pod IP and became black holes. Because Redis clients had no socket timeout, device workers hung forever on awaited Redis commands, queues filled, Pulsar consumption stopped, and the pod stayed apparently healthy.

Impact: live view was down fleet-wide for roughly two days, with a large backlog on persistent://streamhub/tracking/frames and stale playlists whose retained segment files had already expired.

Scope

  • Add fail-fast Redis client timeouts for playlist and session stores.
  • Configure socket_timeout, socket_connect_timeout, and health_check_interval via environment settings.
  • Ensure dead Redis connections raise instead of hanging forever.
  • Let redis-py reconnect on subsequent calls after timeout/failure.
  • Make per-device workers catch segment-generation failures per iteration.
  • Preserve pending frames so generation can retry on the next tick.
  • Ensure session-store refresh failures do not block segment generation when the segment itself can still be produced.
  • Deregister workers, queues, device state, and active-device gauge state when workers exit.
  • Keep graceful shutdown behavior intact.

Acceptance Criteria

  • Redis commands against dead connections time out instead of blocking forever.
  • Redis socket/connect timeout and health-check values are configurable through REDIS_* settings.
  • A generation failure does not permanently kill a device worker while leaving its queue registered.
  • Pending frames survive a failed generation attempt and can be retried.
  • Session-store failures are logged/counted but do not prevent segment generation when storage/encoding can proceed.
  • Workers deregister themselves on exit so later frames can recreate them.
  • Active-device gauge state remains balanced across worker restarts/shutdown.
  • Tests cover generation continuing after session-store failure and worker state cleanup.
  • Existing tests, ruff, and formatting pass.

Follow-Up

Consider adding a liveness probe or alert tied to consumption progress, such as msgRateOut=0 with a growing backlog, so stalled consumers cannot remain undetected while the metrics port still answers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions