Skip to content

Add upstream health poller and per-model circuit breaker #6

Description

@xodapi

Why

Droid/OpenCode failures are often upstream/model-specific rather than local proxy failures. The proxy should classify and react to repeated 429, 5xx, and timeout signals per model.

Scope

  • Background health poller with safe low-frequency checks.
  • Per-model state: available, limited, retry, error, untested.
  • Temporary circuit breaker with reset timers.
  • Dashboard and proxy-status should expose breaker state.

Acceptance criteria

  • A model with repeated transient failures is temporarily removed from round-robin.
  • Reset/recovery is visible in /metrics.
  • Tests cover state transitions without real network calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: observabilityMetrics, diagnostics, status, alertingarea: reliabilityRuntime reliability, retries, failover, watchdogsenhancementNew feature or requestpriority: p1High priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions