Skip to content

[Observability] Add logging, metrics, and tracing infrastructure #17

Description

@OneByJorah

Observability Checklist

  • Logging:
    • Structured logging (JSON) for all services
    • Centralized logging (Loki/Grafana Loki or ELK)
    • Log rotation and retention policies
    • Structured log format (timestamp, level, service, trace_id, message)
    • Log aggregation dashboard (Grafana)
  • Metrics:
    • Prometheus metrics endpoint for each service
    • Custom application metrics (request latency, error rates, queue depths)
    • Infrastructure metrics (CPU, memory, disk, network via node-exporter/cadvisor)
    • Prometheus server + Alertmanager
    • Grafana dashboards for each service + system overview
  • Tracing:
    • Distributed tracing (Jaeger/Zipkin/Tempo)
    • OpenTelemetry instrumentation for all services
    • Trace context propagation across services
    • Trace sampling configuration
  • Alerting:
    • Alert rules for critical metrics (downtime, error rates, latency, resource exhaustion)
    • Alert routing (PagerDuty, Slack, Email, Telegram)
    • Runbook links in alerts
  • Add observability stack to docker-compose.yml:
    • Loki + Promtail
    • Prometheus + Alertmanager
    • Grafana
    • Tempo/Jaeger for traces
    • Node-exporter + cAdvisor
  • Add observability documentation to docs/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions