ADR-0021: Centralize backend observability bootstrap and structured telemetry
- Status: Accepted
- Date: 2026-03-25
- Deciders: avm
- Supersedes:
- Superseded by:
Related ADRs
Context
High-level ADR-0004 already fixes observability as a first-class platform concern, but the template still needs a concrete backend baseline that is hard to misuse. Without a centralized runtime pattern, projects degrade quickly:
- tracing, metrics, logging and error tracking get initialized in random places such as
main.py, endpoints or workers; - request correlation becomes inconsistent between logs, traces and error events;
- service metadata, OTLP endpoints and DSN values get duplicated or hardcoded outside the settings layer;
- teams fall back to
print, stdlib loggers and ad-hoc middleware because there is no single platform path; - custom exception handlers and framework integrations start fighting each other instead of composing cleanly.
The maximum template therefore needs one canonical observability bootstrap that stays settings-driven, platform-owned and machine-enforced.
Decision
Backend adopts a centralized observability baseline under src/backend/core/observability.
Platform bootstrap rules:
setup_observability(app, settings)is the single orchestration entrypoint for runtime observability;- logging, tracing, metrics, request context and error tracking are implemented as separate modules with narrow public
setup_*functions; - application bootstrap calls the orchestration entrypoint once and does not inline instrumentation logic in routes, repositories or
main.py; - service name, service version, deployment environment, OTLP endpoint, metrics path and GlitchTip DSN come strictly from the settings layer.
- incoming HTTP context carries both
request_idandcorrelation_id; when the client omitscorrelation_id, the platform derives it fromrequest_idfor a deterministic default.
Logging rules:
- structured JSON logging via
structlogis the canonical logging baseline for backend runtime code; - backend modules must obtain loggers through the platform wrapper, not through direct
logging.getLoggeror rawstructlogsetup; print/pprintare forbidden in backend runtime code;- logs must include timestamp, log level, message, service, environment, request id, correlation id and active trace/span identifiers when available;
- observability processors must redact obvious sensitive fields such as authorization and cookie headers when they appear in log payloads.
Tracing and metrics rules:
- OpenTelemetry uses a
TracerProviderwithResourceattributes forservice.name,service.versionanddeployment.environment; - trace export uses OTLP with
BatchSpanProcessorand a settings-driven endpoint; - FastAPI, SQLAlchemy and Redis instrumentation are enabled only through the centralized tracing module and must avoid double instrumentation;
- Prometheus metrics are exposed through a settings-driven endpoint and coexist with existing health endpoints without replacing them.
Error tracking rules:
- GlitchTip-compatible error tracking uses
sentry-sdkand remains optional by configuration; - custom exception handlers keep the typed HTTP error contract from ADR-0020 and enrich captured events with request id, correlation id and trace correlation data;
- error tracking must not hardcode DSN, environment or release values and must stay safe when disabled.
Consequences
Positive
- tracing, logs, metrics and handled error events now share the same request-level correlation path;
- application and platform code get one obvious observability integration path instead of multiple competing patterns;
- settings remain the single source of truth for external endpoints and release metadata;
- local hooks and CI can statically forbid regressions back to
printand ad-hoc loggers.
Negative
- backend runtime gains a non-trivial platform module that must be maintained together with dependency versions;
- more instrumentation means more care is required around sampling, redaction and noise budgets.
Neutral
- individual exporters and processors can evolve over time, provided the centralized bootstrap and settings-driven ownership model remain intact.
Alternatives considered
- initialize logging, tracing, metrics and sentry directly inside
main.py; - let each bounded context own its own middleware and logger bootstrap;
- use plain stdlib logging without JSON structure or trace correlation;
- keep error tracking fully implicit and rely only on framework defaults despite custom exception handlers.
Follow-up work
- [ ] add worker/task context propagation rules so background jobs can carry request and trace correlation explicitly
- [ ] extend platform docs with log field schema and redaction expectations
- [ ] add contract-level tests for exported telemetry shape once external collectors are wired in integration environments