Cost per accepted change $ AC Cost Per Accepted Change

A survey

How teams measure AI cost

Every team trying to measure AI in their delivery pipeline ends up choosing among a handful of approaches. Most are partial. This page walks the landscape, what each option captures, what each misses, and where cost per accepted change fits.

The eight approaches in current use

1. Token / API cost

The most direct cost signal: how much you spend on the LLM provider per period. Promoted by the FinOps Foundation as the entry-level AI cost metric, and surfaced by every major model vendor.

What it catches: raw model spend; runaway agents; per-feature unit economics.
What it misses: the labor cost of producing and reviewing the work the model generates; the cost of rework when it ships defects. Token cost is a real input to delivery cost, but never the whole picture.

2. Volume metrics: lines of code, PRs merged, commits

The default in many engineering analytics products. "AI code share" — the percentage of merged code attributed to AI suggestions — is a 2024–2026 variant.

What it catches: activity. Easy to compute, easy to chart.
What it misses: everything that matters. PR count rises with AI even when delivered value does not; "AI code share" can rise while quality and stability fall. These are vanity metrics. They are also the metrics most commonly cited in vendor marketing.

3. Acceptance rate

The fraction of AI suggestions that a developer accepts. Tracked by Copilot, Cursor, and most other coding assistants. Often reported alongside "characters inserted from AI."

What it catches: short-term developer agreement with the model. A useful product-engineering signal for vendors.
What it misses: whether the accepted suggestion survived review, whether it shipped, whether it caused a defect three days later. Acceptance is a leading indicator; it is not an outcome.

4. DORA Four Keys

Deployment frequency, lead time for changes, change failure rate, and time to restore. Defined by DORA; popularized by Accelerate (Forsgren, Humble, Kim, 2018) and the annual State of DevOps reports.

What it catches: how a delivery system behaves. Industry-standard, well-defended, mature instrumentation.
What it misses: what that behavior costs. A team can have excellent DORA numbers and a terrible cost picture if velocity is bought with disproportionate review and rework — exactly the AI-augmented failure mode. DORA and CPAC are complementary.

5. SPACE framework

Satisfaction, Performance, Activity, Communication, Efficiency. A multi-dimensional developer-productivity framework from Microsoft Research and GitHub (Forsgren, Storey, Maddila, Zimmermann, Houck, Butler, 2021).

What it catches: a balanced view of productivity that resists single-metric gaming.
What it misses: a unit-cost number a CFO can act on. SPACE is the right framing for developer experience research; it is not a finance-facing metric.

6. DevEx scores

Developer Experience surveys — pulse scores on flow, feedback loops, and cognitive load. Promoted by DX, Faros, and others; aligned with the SPACE tradition.

What it catches: friction and toil. Strong leading indicator of attrition and burnout. Useful for diagnosing where AI tooling is helping or hurting team experience.
What it misses: dollars. DevEx data tells you where to invest; it does not tell you what your AI program costs per unit of delivered value.

7. Self-reported productivity surveys

"How much faster are you with AI?" — surveyed by BCG, McKinsey, GitHub Octoverse, Stack Overflow's annual developer survey, and others. Headline numbers in this category are commonly in the 20–55% range.

What it catches: sentiment, perception, what teams will say about AI.
What it misses: reality. In a randomized controlled trial of 16 experienced open-source developers across 246 real tasks, METR found that allowing AI tools increased completion time by 19%, while the same developers self-reported a 20% speedup. The perception-reality gap was large enough to invert the conclusion. Self-report is not a measurement; it is a hypothesis. (METR has since noted that experienced developers are increasingly unwilling to work without AI, which biases any new measurement of the gap.)

8. FinOps cost-to-serve

The unit cost of running software — cost per request, per active user, per transaction. The mature, board-defensible discipline of cloud cost management. See the FinOps Foundation.

What it catches: operational unit economics, well-instrumented and well-understood.
What it misses: the cost of producing the software. FinOps measures the right side of the deployment boundary; cost per accepted change measures the left.

Where cost per accepted change fits

Cost per accepted change does not replace any of the metrics above. It sits one layer above them:

The right operating posture is to report cost per accepted change as the headline number and pair it with two or three of these leading indicators for diagnosis. Without CPAC, the indicators do not roll up to anything a CFO can act on. Without the indicators, CPAC moves without explaining why.

Defending the choices behind the formula

Why "accepted" rather than "merged"

A merged pull request is not a unit of value if it gets reverted next week. "Merged" describes the moment the diff landed; "accepted" requires that the diff stayed in production through the measurement window. The denominator should reflect the work that kept, not the work that was attempted.

Why "and stayed there"

Without this clause, CPAC could be gamed by shipping recklessly and counting every merge. With it, the metric self-corrects: silent escapes that get quietly fixed are excluded from the denominator and their fix cost is added to the numerator. This is the rework defense, built into the metric instead of bolted on.

Why all five cost components, not just model cost

Model cost is a small fraction of total delivery cost in most organizations. Engineering and review time dominate the numerator. A "cost per change" metric that only counts model spend would understate the true cost by an order of magnitude — and would let teams optimize for the wrong thing.

Why "per change" and not "per developer-hour"

Developer-hours are an input. Accepted changes are an output. Cost-per-output is the right shape for a metric that asks "is this investment paying off?" Cost-per-input answers a different, more inward-facing question and is more easily gamed by changing how time is counted.

Why a 500-line normalization on "change"

Without size-normalization, the denominator could be gamed in either direction. A team shipping one massive 5,000-line merge as a single "accepted change" would be credited the same denominator as a team shipping a 100-line bug fix — even though the former is an order of magnitude more accepted work. Conversely, a team could inflate its denominator by splitting trivial changes into ever-smaller PRs.

The 500-line threshold cuts both knots and is grounded in what humans can actually verify: reviewer comprehension, defect-detection, and the willingness to leave substantive comments all drop sharply past about 400 lines in a single PR. 500 lines is a clean round number that sits just past that cliff, providing a sane buffer while remaining recognizably "one substantial change" in most engineering cultures.

A PR of 1–500 lines counts as 1 unit; a larger PR of N lines counts as ⌈ N / 500 ⌉ units. The rule applies uniformly to AI-assisted and non-AI work so comparisons remain meaningful. See the FAQ for examples.

Why not a productivity index that combines several signals

Composite indices look defensible and rarely are: every weighting choice is contested, every component drifts, and the index is hard to explain. CPAC is one number with a known formula. It is easy to recompute, easy to compare, and easy to challenge — three properties an index does not have.

Common critiques, addressed

"Acceptance is not correctness."

True. CPAC's defense is the "stayed there" window: lengthening the window catches more silent escapes; counting fix-up cost in the numerator catches the rest. CPAC is a defensible proxy for correctness, not a claim of perfect correctness. See the FAQ for window selection guidance.

"This punishes ambitious teams."

It does the opposite. A team that ships ambitious work that survives in production gets a lower CPAC than a team that ships timid work that gets reverted. The metric rewards kept ambition.

"What about value, not just cost?"

Value is captured upstream of CPAC, by what teams choose to work on, and downstream of CPAC, by FinOps cost-to-serve and revenue analytics. CPAC is the production-cost layer in between. It does not claim to be the only number on the dashboard.

"This is just X renamed."

It is not. None of the eight approaches above carries all five cost components in the numerator, the "stayed there" clause in the denominator, and the unit-output shape that maps directly to FinOps. If it were already named, the cost-management conversation around AI would look very different than it does in 2026.


Disagree with anything on this page? Open an issue at github.com/brennhill/cost-per-accepted-change/issues. Refinements, alternative framings, and additional approaches to survey are all welcome.