ADR-0021: Centralize backend observability bootstrap and structured telemetry

Status: Accepted
Date: 2026-03-25
Deciders: avm
Supersedes:
Superseded by:

Context

High-level ADR-0004 already fixes observability as a first-class platform concern, but the template still needs a concrete backend baseline that is hard to misuse. Without a centralized runtime pattern, projects degrade quickly:

tracing, metrics, logging and error tracking get initialized in random places such as main.py, endpoints or workers;
request correlation becomes inconsistent between logs, traces and error events;
service metadata, OTLP endpoints and DSN values get duplicated or hardcoded outside the settings layer;
teams fall back to print, stdlib loggers and ad-hoc middleware because there is no single platform path;
custom exception handlers and framework integrations start fighting each other instead of composing cleanly.

The maximum template therefore needs one canonical observability bootstrap that stays settings-driven, platform-owned and machine-enforced.

Decision

Backend adopts a centralized observability baseline under src/backend/core/observability.

Platform bootstrap rules:

setup_observability(app, settings) is the single orchestration entrypoint for runtime observability;
logging, tracing, metrics, request context and error tracking are implemented as separate modules with narrow public setup_* functions;
application bootstrap calls the orchestration entrypoint once and does not inline instrumentation logic in routes, repositories or main.py;
service name, service version, deployment environment, OTLP endpoint, metrics path and GlitchTip DSN come strictly from the settings layer.
incoming HTTP context carries both request_id and correlation_id; when the client omits correlation_id, the platform derives it from request_id for a deterministic default.

Logging rules:

structured JSON logging via structlog is the canonical logging baseline for backend runtime code;
backend modules must obtain loggers through the platform wrapper, not through direct logging.getLogger or raw structlog setup;
print/pprint are forbidden in backend runtime code;
logs must include timestamp, log level, message, service, environment, request id, correlation id and active trace/span identifiers when available;
observability processors must redact obvious sensitive fields such as authorization and cookie headers when they appear in log payloads.

Tracing and metrics rules:

OpenTelemetry uses a TracerProvider with Resource attributes for service.name, service.version and deployment.environment;
trace export uses OTLP with BatchSpanProcessor and a settings-driven endpoint;
FastAPI, SQLAlchemy and Redis instrumentation are enabled only through the centralized tracing module and must avoid double instrumentation;
Prometheus metrics are exposed through a settings-driven endpoint and coexist with existing health endpoints without replacing them.

Error tracking rules:

GlitchTip-compatible error tracking uses sentry-sdk and remains optional by configuration;
custom exception handlers keep the typed HTTP error contract from ADR-0020 and enrich captured events with request id, correlation id and trace correlation data;
error tracking must not hardcode DSN, environment or release values and must stay safe when disabled.

Consequences

Positive

tracing, logs, metrics and handled error events now share the same request-level correlation path;
application and platform code get one obvious observability integration path instead of multiple competing patterns;
settings remain the single source of truth for external endpoints and release metadata;
local hooks and CI can statically forbid regressions back to print and ad-hoc loggers.

Negative

backend runtime gains a non-trivial platform module that must be maintained together with dependency versions;
more instrumentation means more care is required around sampling, redaction and noise budgets.

Neutral

individual exporters and processors can evolve over time, provided the centralized bootstrap and settings-driven ownership model remain intact.

Alternatives considered

initialize logging, tracing, metrics and sentry directly inside main.py;
let each bounded context own its own middleware and logger bootstrap;
use plain stdlib logging without JSON structure or trace correlation;
keep error tracking fully implicit and rely only on framework defaults despite custom exception handlers.

Follow-up work

[ ] add worker/task context propagation rules so background jobs can carry request and trace correlation explicitly
[ ] extend platform docs with log field schema and redaction expectations
[ ] add contract-level tests for exported telemetry shape once external collectors are wired in integration environments

ADR-0021: Centralize backend observability bootstrap and structured telemetry ​

Related ADRs ​

Context ​

Decision ​

Consequences ​

Positive ​

Negative ​

Neutral ​

Alternatives considered ​

Follow-up work ​