## Observability Checklist - [ ] Logging: - [ ] Structured logging (JSON) for all services - [ ] Centralized logging (Loki/Grafana Loki or ELK) - [ ] Log rotation and retention policies - [ ] Structured log format (timestamp, level, service, trace_id, message) - [ ] Log aggregation dashboard (Grafana) - [ ] Metrics: - [ ] Prometheus metrics endpoint for each service - [ ] Custom application metrics (request latency, error rates, queue depths) - [ ] Infrastructure metrics (CPU, memory, disk, network via node-exporter/cadvisor) - [ ] Prometheus server + Alertmanager - [ ] Grafana dashboards for each service + system overview - [ ] Tracing: - [ ] Distributed tracing (Jaeger/Zipkin/Tempo) - [ ] OpenTelemetry instrumentation for all services - [ ] Trace context propagation across services - [ ] Trace sampling configuration - [ ] Alerting: - [ ] Alert rules for critical metrics (downtime, error rates, latency, resource exhaustion) - [ ] Alert routing (PagerDuty, Slack, Email, Telegram) - [ ] Runbook links in alerts - [ ] Add observability stack to docker-compose.yml: - [ ] Loki + Promtail - [ ] Prometheus + Alertmanager - [ ] Grafana - [ ] Tempo/Jaeger for traces - [ ] Node-exporter + cAdvisor - [ ] Add observability documentation to docs/
Observability Checklist