Skip to content

Add PII-safe health and Prometheus metrics endpoints #30

Description

@Justinabox

Motivation

Callstack is intended to run unattended on Raspberry Pi + GSM/LTE modem deployments, but the HTTP server currently has SMS/USSD routes only. Operators have no low-friction way to answer basic production questions such as: is the process alive, is the modem connected, how many SMS sends failed, are delivery reports arriving, is signal quality degrading, and did auto-reconnect start flapping?

This decomposes the PII-safe metrics slice from #9 into one implementable PR: add a small health/readiness endpoint and a Prometheus-style metrics endpoint without exposing phone numbers, SMS bodies, SIM identifiers, API keys, or webhook URLs.

User journey

  1. An operator starts Callstack on a Pi with the HTTP server enabled.
  2. A local service manager or monitoring agent polls /healthz to decide whether the process is alive and whether the modem is ready.
  3. Prometheus or a lightweight scraper polls /metrics to collect counters/gauges for SMS, delivery reports, calls, signal quality, reconnects, and uptime.
  4. The operator can diagnose modem/network problems from aggregate metrics without leaking private SMS/call data into logs or monitoring labels.

API / UX sketch

Health endpoint:

GET /healthz

Example response:

{
  "status": "ready",
  "modem_connected": true,
  "uptime_seconds": 1234.5,
  "sms_store_ready": true
}

Suggested status semantics:

  • 200 {"status":"ready"} when the server is alive and the modem is connected/initialized.
  • 503 {"status":"degraded"} when the server is alive but modem state is disconnected/reconnecting/unknown.

Metrics endpoint:

GET /metrics

Example text format:

# HELP callstack_uptime_seconds Seconds since HTTP app startup.
# TYPE callstack_uptime_seconds gauge
callstack_uptime_seconds 1234.5
# HELP callstack_sms_received_total Total completed inbound SMS messages.
# TYPE callstack_sms_received_total counter
callstack_sms_received_total 42

Initial metric names can stay dependency-free and manually rendered; adding prometheus-client is not required for this first slice.

Technical approach

  1. Add a small in-process stats collector, either in server.py or a focused helper such as callstack/metrics.py.
  2. Initialize it in create_app(modem, ...) and store it in app["callstack_metrics"] or an equivalent explicit location.
  3. Subscribe to typed events on modem.bus and increment counters/gauges for:
    • inbound SMS completed events,
    • outbound SMS sent events,
    • SMS delivery report statuses,
    • call state transitions / active call gauge,
    • signal quality RSSI/BER last values,
    • modem disconnect/reconnect counts,
    • USSD response count,
    • HTTP request counts by route/status class if this stays small.
  4. Add /healthz and /metrics routes in create_app.
  5. Keep labels static and low-cardinality. Do not label by phone number, sender, recipient, SMS body, webhook URL, SIM identifier, API key, modem serial/IMEI, or raw error strings.
  6. Decide endpoint auth behavior consistently with Secure HTTP server startup instead of disabling auth by default #4. If Secure HTTP server startup instead of disabling auth by default #4 is not landed yet, make the PR explicitly fail closed for non-loopback or require the same API-key middleware for /metrics; do not add unauthenticated network observability by accident.
  7. Ensure tests use fake modem/event bus objects and pytest-aiohttp; no real serial ports, SIM, carrier network, or Prometheus server should be required.

Affected modules and tests

Likely files:

  • Modify: server.py — add routes and wire stats collector.
  • Create (optional): callstack/metrics.py — collector/rendering helpers if keeping server.py small.
  • Create: tests/test_metrics.py — route behavior, event-driven counters, PII-safety assertions.
  • Modify: tests/test_api_auth.py only if auth middleware behavior for /healthz//metrics needs explicit coverage.
  • Docs follow-up: README HTTP operations section after implementation.

Existing context:

  • server.py already imports aiohttp, owns create_app(modem, api_keys=None), and registers SMS/USSD routes.
  • callstack.events.types already defines IncomingSMSEvent, SMSSentEvent, SMSDeliveryReportEvent, CallStateEvent, SignalQualityEvent, ModemDisconnectedEvent, ModemReconnectedEvent, and USSDResponseEvent.
  • Modem tracks _connected internally today; this PR can expose a safe read-only property if needed rather than reaching into private state from the server.

Hardware / modem caveats

  • Metrics must be useful even when no modem hardware is present in tests. Use event injection and fake modem state.
  • Signal quality values can be unknown or stale; expose clear unknown/absence semantics rather than pretending a modem recently reported.
  • Monitoring output must avoid PII and secrets by design. Metrics systems are often broadly readable on a LAN.

Acceptance criteria

  • GET /healthz returns JSON with process liveness, modem readiness/degraded state, and uptime.
  • GET /metrics returns Prometheus-compatible text with # HELP/# TYPE lines and stable metric names.
  • Counters/gauges cover at least SMS received, SMS sent, delivery report statuses, active call state, last signal RSSI/BER, modem reconnect/disconnect counts, and uptime.
  • Metrics labels/values do not include phone numbers, SMS bodies, webhook URLs, SIM identifiers, API keys, modem IMEI/serial values, or unbounded raw error strings.
  • Tests prove metrics update after emitting representative typed events on the event bus.
  • Tests prove health returns degraded/non-200 when modem readiness is false/unknown.
  • Endpoint auth/loopback behavior is explicitly tested and consistent with the HTTP security policy from Secure HTTP server startup instead of disabling auth by default #4.
  • No real modem hardware is required for tests.

Exact verification gates

git diff --check
PYTHONPATH=. uv run --no-project --with pytest --with pytest-asyncio --with pytest-aiohttp --with pyserial-asyncio --with aiosqlite pytest tests/test_metrics.py tests/test_api_auth.py -q
PYTHONPATH=. uv run --no-project --with pytest --with pytest-asyncio --with pytest-aiohttp --with pyserial-asyncio --with aiosqlite pytest tests/ -q

If adding any dependency, also run the packaging-oriented gate and record the exact result:

uv run --with pytest --with pytest-asyncio --with pytest-aiohttp --with pyserial-asyncio --with aiosqlite pytest tests/ -q

Non-goals

  • Building a dashboard UI.
  • Exporting per-message/per-phone-number labels.
  • Durable metric persistence across process restarts.
  • Alerting rules or Prometheus deployment configuration.
  • WebSocket realtime streaming; track that separately from this metrics slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions