Accuracy

How accurate is Outlay — honestly.

The first question every eng and finance leader asks. Here's the straight answer: allocating what you've already spent is exact; forecasting future work is a measured range — and we backtest it on your own data instead of quoting a number.

"How accurate?" is two questions

Conflating them is how vendors lose credibility. We keep them separate:

  • 1
    Allocating spend that already happened — where did last quarter's AI cost go? This is arithmetic, not prediction.
  • 2
    Forecasting spend that hasn't happened yet — what will this quarter, or this backlog, cost? This is genuinely a range.

Allocating past spend: essentially exact

Token counts come straight from your provider's admin/usage API, and we cost them with a cache-aware model — cache reads bill ~0.1× and writes ~1.25× of base input, the thing naive trackers get wrong by 5–10× on agentic workloads. So the dollars are computed, not estimated. The only question on the backward-looking side is coverage — what fraction of spend we could tie to a real ticket — and we report that number plainly, including what we couldn't attribute (reconciled to your invoice, never dropped).

Forecasting future work: a measured range

Per-item AI cost is heavy-tailed — the same "bugfix" might be one cheap call or a 40-turn agentic loop. So we never hand you a single number for a single ticket; we give a p10–p90 band. But here's the key: the aggregate is far more predictable than any one item. Across a quarter of work, the per-item over- and under-shoots partially cancel (that's why our roadmap total is variance-pooled, not a naive sum of worst cases). Realistically:

  • ·
    A single ticket: wide — treat the band, not the midpoint, as the answer.
  • ·
    A quarter, a team, a portfolio: tighter — useful for budgeting, in the spirit of forecasting cloud spend or sprint velocity. Directionally reliable, not penny-precise.

We measure it on your data — we don't assert it

This is the part that matters. Outlay backtests its own forecast on your closed tickets (leave-one-out cross-validation) and reports the median error before you trust it. If the forecast isn't good enough on your data, you'll see that in week one — we won't hide it. Most tools quote a savings or accuracy figure from someone else's deployment; we show you the number from yours.

The estimate also states its own coverage: work types with too little history are counted, not guessed. An honest "we can't ground this yet" beats a confident wrong number.

Estimating work that isn't built yet

To budget against a roadmap, you need to cost work that has no history of its own. The estimate is only as good as what you feed it, so Outlay reads three things:

  • ·
    Your realized history — teaches the cost model (per-work-type distributions + cost-per-point).
  • ·
    The business requirements — acceptance criteria, integrations, and scope, read from the ticket.
  • ·
    Any design docs — where the real complexity lives (a new service, a migration, multi-tenant work).

From those, each planned item is classified, sized, and placed within your team's own historical cost range for that work type — carrying a confidence that rises with the input: high with story points + a fitted size model; medium when sized from requirements and design docs; low / declined on a bare title or a work type with no history. It even tells you what to add to tighten it. So a sprint or epic becomes a compute budget with an honest confidence interval — and the more scope you give it, the tighter that interval gets.

What makes it more — and less — accurate

More accurate: more history per work type · stable workflows and model mix · estimating at the quarter/team level rather than per ticket · good attribution coverage · story points on your backlog (size conditioning measurably tightens the estimate).

Less accurate — and the biggest one is change: AI usage per developer has grown roughly 18× in well under a year, and a forecast trained on history widens when the regime shifts — a new agent adopted, autonomy turned up, a vendor price change. We're upfront that a fast-moving team's forecast carries a wider band, and the backtest will show it.

The honest bottom line

We won't promise a crystal ball. What we promise is the first real forward visibility you have — a quarter budget with a measured error bar, and a flag on the epic about to blow its estimate weeks before the invoice. Even a ±20% forecast that warns you early is infinitely better than the blank page you have today — and unlike everyone else, we tell you exactly how good it is on your numbers.

See the accuracy on your own numbers.

In a two-week pilot we backtest the forecast on your real history and show you the error before you trust a dollar of it.