fix: prevent consumer stall on dead Redis connections

## Context

Tracking issue for PR #32: https://github.com/microboxlabs/python-stream-processor/pull/32

On 2026-06-30 14:13 UTC, GKE maintenance replaced the Redis pod shortly after the stream-processor pod started. Existing pooled TCP connections pointed at the old Redis pod IP and became black holes. Because Redis clients had no socket timeout, device workers hung forever on awaited Redis commands, queues filled, Pulsar consumption stopped, and the pod stayed apparently healthy.

Impact: live view was down fleet-wide for roughly two days, with a large backlog on `persistent://streamhub/tracking/frames` and stale playlists whose retained segment files had already expired.

## Scope

- Add fail-fast Redis client timeouts for playlist and session stores.
- Configure `socket_timeout`, `socket_connect_timeout`, and `health_check_interval` via environment settings.
- Ensure dead Redis connections raise instead of hanging forever.
- Let redis-py reconnect on subsequent calls after timeout/failure.
- Make per-device workers catch segment-generation failures per iteration.
- Preserve pending frames so generation can retry on the next tick.
- Ensure session-store refresh failures do not block segment generation when the segment itself can still be produced.
- Deregister workers, queues, device state, and active-device gauge state when workers exit.
- Keep graceful shutdown behavior intact.

## Acceptance Criteria

- Redis commands against dead connections time out instead of blocking forever.
- Redis socket/connect timeout and health-check values are configurable through `REDIS_*` settings.
- A generation failure does not permanently kill a device worker while leaving its queue registered.
- Pending frames survive a failed generation attempt and can be retried.
- Session-store failures are logged/counted but do not prevent segment generation when storage/encoding can proceed.
- Workers deregister themselves on exit so later frames can recreate them.
- Active-device gauge state remains balanced across worker restarts/shutdown.
- Tests cover generation continuing after session-store failure and worker state cleanup.
- Existing tests, ruff, and formatting pass.

## Follow-Up

Consider adding a liveness probe or alert tied to consumption progress, such as `msgRateOut=0` with a growing backlog, so stalled consumers cannot remain undetected while the metrics port still answers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent consumer stall on dead Redis connections #33

Context

Scope

Acceptance Criteria

Follow-Up

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix: prevent consumer stall on dead Redis connections #33

Description

Context

Scope

Acceptance Criteria

Follow-Up

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions