Service-Layer Performance Budgets¶

Scope¶

This document defines performance budgets for the heavy service-layer sync and job paths that already have explicit runtime ownership, locks or tracked-operation boundaries.

The budgets are intentionally practical:

target is the normal operating envelope;
alert means the path is degrading and should page/emit an operational signal;
hard is the maximum tolerated runtime before the path must be split, chunked or rescheduled.

For TaskIQ jobs, the hard budget must never exceed the effective Redis lock timeout for the same path.

Global Rules¶

A service/job path that is forecast to exceed its hard budget must be chunked or delegated to a longer-running workflow, not left inline.
Alert budget breaches should produce telemetry before lock timeouts are exhausted.
Long-running sweeps must preserve restartability and partial-progress safety.
Budget reviews happen when a path changes batching strategy, upstream provider mix or fan-out behavior.

Budget Matrix¶

Domain / path	Target	Alert	Hard	Operational note
`market_data` coin history sync (`run_coin_history_job`, keyed coin sync)	`5m`	`15m`	`30m`	Hard budget aligns with `COIN_HISTORY_LOCK_TIMEOUT_SECONDS`; if a single coin needs longer, split backfill windows.
`market_data` latest history refresh sweep	`5m`	`10m`	`15m`	Must stay inside `HISTORY_REFRESH_LOCK_TIMEOUT_SECONDS`; provider stalls should fall back to later refresh rather than hold the global sweep lock.
`market_data` history backfill sweep	`15m`	`45m`	`60m`	Uses the longest global lock; wide backfills should chunk by symbol groups and resume safely.
`market_structure` single-source poll	`30s`	`90s`	`120s`	Must fit the per-source poll lock.
`market_structure` enabled-source poll sweep	`2m`	`4m`	`5m`	Sweep must not monopolize the global enabled-poll lock.
`market_structure` source health refresh	`30s`	`2m`	`3m`	Health refresh is lightweight; sustained overruns indicate source-count growth or hidden IO drift.
`news` single-source poll	`30s`	`90s`	`120s`	Bound by per-source poll lock and upstream provider responsiveness.
`news` enabled-source poll sweep	`2m`	`4m`	`5m`	Cursor-driven sweeps should remain incremental; oversized source sets must shard rather than extend the lock.
`portfolio` balance sync	`1m`	`3m`	`4m`	Hard budget equals `PORTFOLIO_SYNC_LOCK_TIMEOUT_SECONDS`; exchange fan-out must stay bounded.
`predictions` pending-evaluation sweep	`1m`	`4m`	`5m`	If evaluation volume grows, batch windows must split rather than hold the singleton evaluation lock.
`anomalies` enrichment / sector / market-structure scans	`1m`	`4m`	`5m`	Keyed scans are safe to rerun; long anomaly batches should partition by trigger tuple.
`hypothesis_engine` evaluation sweep	`1m`	`4m`	`5m`	Operation-tracked evaluation should surface long-running status before the singleton lock is exhausted.
`patterns` bootstrap / statistics / market-structure refresh	`30m`	`90m`	`120m`	These are intentionally heavyweight analytics jobs; over-budget runs must checkpoint and resume instead of expanding lock time.
`patterns` discovery / strategy discovery	`60m`	`180m`	`240m`	Discovery is the longest-running path; hard budget aligns with the existing 4h locks and should not grow further without repartitioning.

Review Checklist¶

Does the path have a target/alert/hard budget recorded here?
Is the hard budget still below or equal to the real runtime lock/operation boundary?
If the path exceeded alert budget, was the cause batching, provider latency or hidden cross-domain fan-out?
If the path approaches hard budget, can it be chunked or resumed without violating ADR 0010 and ADR 0014?