Engineering metric

Mean Time to Restore. How fast you recover — not just how rarely you break.

MTTR is how long it takes to get back to normal when something breaks. It's the DORA metric that measures a truth most teams under-invest in: you will never prevent every failure, so your ability to recover matters as much as your ability to avoid breaking in the first place. Elite teams are under an hour — a failure goes out, gets detected, and gets reverted before most customers even notice. Slow recovery turns a small failure into a long outage. Fast recovery is what makes bold shipping safe.

What it is

The average time from a production failure to full recovery. It measures response capability, not prevention. A team that breaks occasionally but recovers in minutes is in far better shape than one that breaks rarely but takes days to come back.

Measurement period

Per incident, averaged.

Measured across incidents as an average (watch the median too — one ugly outage skews the mean). Elite teams restore in under an hour.

Formula

Total recovery time

Number of incidents

= MTTR

Lower is better. Track median alongside mean — a single long outage distorts the average.

When to review

Per incident + trend.

Review every significant incident and watch the trend. A climbing MTTR means detection or recovery tooling is falling behind.

Why it matters

You can't prevent every failure. You can recover fast.

There's a seductive but impossible goal in engineering: never break anything. No team that ships at any reasonable pace achieves it, and chasing it leads to fear, batching, and slow deployment. The mature alternative is to accept that failures will happen and get extraordinarily good at recovering from them. MTTR measures exactly that capability — and it's the metric that lets a team ship boldly, because if a deploy goes wrong and you can revert it in minutes, the cost of failure is small and the risk of shipping is low.

This is also where MTTR protects the rest of the business. A failure that takes minutes to resolve barely touches your uptime budget; one that takes hours can blow it. The same failure, with a fast MTTR, is a blip customers barely register, and with a slow MTTR is an outage that spends real trust. Recovery speed is the difference between an incident and a crisis — and it's the safety net under your change failure rate, since the failures that do slip through cost far less when you bounce back quickly.

No team that ships at any pace prevents every failure. The mature move isn't to stop breaking things — it's to get extraordinarily good at recovering when you do.

Benchmarks

The DORA bands — and yes, lower is better here.

A shorter time is the good one, so these bands run best at the top, worst at the bottom. They're the standard DORA recovery tiers. For a SaaS product customers depend on, restoring in under an hour is the target worth building toward.

EliteUnder 1 hour

A failure is detected and resolved before most customers notice. This is what makes bold, frequent shipping safe — the cost of any single failure is tiny because it's gone in minutes. The target for a product customers run their business on.

HealthyUnder 1 day

Same-day recovery — solid for most SMB SaaS. Incidents are felt but contained, and they don't drag into multi-day outages. A defensible standard, and a base to push toward the under-an-hour tier as detection and rollback improve.

WatchUnder 1 week

Recovery measured in days means a single incident can become a prolonged outage — enough to breach an SLA and spend real customer trust. Usually a sign of weak detection, missing runbooks, or no fast rollback. Invest in recovery tooling before the next big incident.

CriticalOver 1 week

Recovery taking more than a week turns every failure into a crisis and makes shipping genuinely dangerous. The team can't ship boldly because it can't recover, so it slows down — losing velocity to fear. This is a recovery-capability problem that needs fixing before anything else.

When recovery is too slow

Three plays that actually move it.

MTTR breaks down into detect, diagnose, and recover — and the plays attack each. Faster recovery comes from seeing problems sooner, knowing what to do, and being able to undo quickly.

— 01 Detect it faster

You can't fix what you don't know is broken.

A big chunk of MTTR is often just the time between something breaking and anyone realizing it — especially if customers are the ones telling you. Monitoring and alerting that catch failures the moment they happen collapse that gap. The faster you know, the faster the clock starts on recovery, and the less of your uptime budget a failure consumes. Detection is the cheapest minutes to win back.

— 02 Make recovery a one-click revert

The fastest fix is undoing the change that caused it.

Most failures trace to a recent deploy, so the fastest recovery is usually just rolling that deploy back — which is only possible if your pipeline makes rollback fast and reliable. This is the same capability that makes frequent, mid-day deployment safe: ship small, and if it breaks, revert in minutes. Investing in instant rollback is the single biggest lever on MTTR for most teams.

— 03 Write the runbook before you need it

Diagnosis is faster when you're not improvising.

The diagnose step is where recovery stalls when no one knows what to do under pressure. Clear runbooks, defined on-call ownership, and blameless post-incident reviews turn chaotic firefighting into a practiced response. The team that has rehearsed "here's what we check, here's who owns it, here's how we revert" recovers in minutes; the team improvising at 2am does not. Prepare the response before the incident, not during it.

Common mistakes operators make with MTTR.

Over-investing in prevention, under-investing in recovery.

Trying to never break anything is an impossible goal that leads to fear and slow shipping. The mature posture is to accept failures will happen and get great at recovering. A team that breaks occasionally but recovers in minutes is healthier than one that breaks rarely but takes days — because the second can't ship boldly. Balance prevention with recovery capability; don't pour everything into avoiding failures you'll inevitably have.

Tracking mean and ignoring median.

One ugly multi-day outage can blow up your average and hide the fact that most incidents resolve quickly — or a wall of fast recoveries can mask a few that drag on. Track the median alongside the mean. The gap between them tells you whether you have a consistent recovery process or a couple of incident types that spiral.

Letting customers be your monitoring.

If your first signal that something broke is a support ticket, a big share of your MTTR is just detection lag. Monitoring and alerting that catch failures instantly are some of the cheapest minutes you can win back. Waiting for customers to tell you not only lengthens recovery — it spends trust you didn't need to spend.

Treating incidents as someone to blame.

A blame culture makes MTTR worse: frightened engineers hide problems, avoid bold fixes, and slow everything down. Blameless post-incident reviews that focus on the system — what monitoring missed, why rollback was slow, what the runbook lacked — produce faster recovery over time. The goal is a better response next time, not a scapegoat this time.

Read alongside

Fast recovery is what makes bold shipping safe.

MTTR is the safety net under your change failure rate and deployment frequency. When you can revert a bad deploy in minutes, the cost of a failure is small — which is exactly what lets a team ship often, mid-day, without fear.

Change Failure Rate guide →

How Upbeat helps

Recovery capability, visible before the next incident.

MTTR is easy to ignore between incidents — until a slow recovery becomes an SLA breach. Upbeat keeps it on the leadership scorecard next to uptime and change failure rate, so a climbing recovery time surfaces as a trend you can invest against before the big outage, not a number you only examine in the post-mortem.

See how it works →See pricing

Related metrics

Recovery, and what it protects.

Don't just break less. Recover faster.

Upbeat keeps mean time to restore next to uptime and change failure rate on your weekly scorecard — so recovery capability is visible before the next incident, not just examined after it.

Become a design partner →