Dwell Time: The Metric Security Teams Track But Rarely Move

Abstract timeline visualization showing dwell time metric shrinking

The industry dwell time metric — the elapsed time between an attacker's initial access and detection by the defending organization — has been measured annually by several incident response firms for more than a decade. The trend line has improved, but less dramatically than the security industry's technology investments would suggest. Median dwell time for organizations with active security programs has declined from 200+ days in the early 2010s to a current range of roughly 7 to 21 days for monitored environments, and 140+ days for organizations without continuous monitoring. That's progress. It's also not fast enough when the damage window for ransomware encryption or data exfiltration is measured in hours.

Understanding why dwell time is hard to move — and what actually moves it — requires separating detection speed from response speed. They're related but not the same problem, and conflating them leads to investments that improve the wrong metric.

Detection Speed vs. Response Speed: The Conceptual Gap

MTTD (mean time to detect) measures when the SOC first identifies that an incident is occurring. MTTR (mean time to respond) measures when containment or remediation begins. Dwell time is neither — it's the gap between initial access (when the attacker first achieved a foothold) and detection. Dwell time can be measured only in retrospect, by forensically determining when the attacker was first present in the environment and comparing that to when the incident was detected.

Improving MTTD doesn't necessarily reduce dwell time unless the detection fires close to the initial access event. Many detection rules are optimized for later-stage behaviors — command-and-control beaconing (T1071), credential dumping (T1003), lateral movement (T1021) — because those behaviors are more distinctive and generate fewer false positives than initial access techniques like phishing lure execution or valid account abuse (T1078). The consequence is that a SOC with excellent detection coverage of post-exploitation techniques can still have dwell times measured in days or weeks, because the attacker completed initial access and moved laterally before any of the well-tuned detection rules fired.

This is the detection coverage problem at its most consequential. Coverage of initial access and early-stage discovery techniques — T1566.001 (spearphishing attachment), T1566.002 (spearphishing link), T1189 (drive-by compromise), T1078.004 (valid cloud accounts) — is notoriously difficult to achieve without excessive false-positive rates, because early-stage behaviors overlap heavily with legitimate user activity. A user clicking a PDF attachment in email looks like phishing. A user authenticating from an unfamiliar location looks like valid account compromise. Behavioral baselines help, but baselines for individual user behavior require weeks to establish and are sensitive to change.

Where Dwell Time Actually Accumulates

Post-breach analysis consistently reveals that dwell time accumulates not at detection but at two other points: investigation latency and containment decision latency.

Investigation latency is the time between when an alert fires and when an analyst begins a meaningful investigation. This is a queue depth problem. An alert that accurately identifies initial access but sits in a 3,000-item queue for six days adds six days to dwell time, even though detection occurred immediately. This is why the MTTD metric — measured from detection event, not from initial access — can look excellent while dwell time remains high. MTTD measures when the SIEM fired. Dwell time measures the whole gap. An organization with a 4-hour MTTD but a 5-day investigation queue still has an effective dwell time measured in days.

Containment decision latency is the time between when an investigator concludes a true positive is present and when containment action is authorized and executed. For organizations with explicit change management processes — particularly federal contractors and healthcare providers with compliance obligations — containment actions like network segmentation, account suspension, or endpoint isolation may require approvals that take hours. In a ransomware scenario where the encryption payload launches within minutes of privilege escalation, a 4-hour approval chain for endpoint isolation is not a tenable control.

A Concrete Scenario: Lateral Movement Goes Unnoticed

Take a plausible incident reconstruction: an enterprise software company running a Sentinel deployment detects an anomalous Azure Active Directory sign-in for a service account (T1078.004) at day 0 of attacker presence. The alert fires correctly. It enters a queue of 4,800 alerts. The SOC team, running a two-analyst L1 rotation across two shifts, reviews approximately 700 alerts per week. The service account alert is medium severity — it would have been reviewed on approximately day 7.

Between day 0 and day 7, the attacker uses the service account to access an Azure Blob storage container, stages a small data set, establishes persistence via a scheduled task in a Windows VM (T1053.005), and completes initial reconnaissance. None of these generate critical-severity alerts because the service account's activity isn't anomalous enough in absolute terms — only relative to its own baseline. The baseline deviation alerts are medium severity. They're in the same queue.

On day 7, the L1 analyst reviews the initial sign-in alert. The service account activity now looks suspicious in context. The investigation expands. Containment begins on day 8. Dwell time: 8 days. MTTD (from detection event to analyst review): 7 days. The detection system worked. The throughput constraint produced the actual breach window.

What Moves the Metric

Three interventions have consistent evidence of reducing dwell time:

Early-stage detection investment. Adding detection coverage for initial access and discovery techniques — even at higher false-positive rates — catches attacker activity closer to day 0. The tradeoff is explicit: more noise, earlier detection. Teams need to make this tradeoff consciously, with analyst capacity to absorb the additional true-positive load. Adding early-stage coverage without throughput capacity just moves noise from one part of the queue to another.

Queue throughput improvement. Any mechanism that increases the effective alert review rate — automated enrichment, automated triage of high-confidence false positives, priority routing by asset criticality — directly reduces investigation latency and therefore dwell time. For many organizations, throughput improvement has higher return on dwell time reduction than additional detection rule development, because the coverage gap is already acceptable and the queue is the binding constraint.

Automated initial containment for high-confidence scenarios. Endpoint isolation triggered automatically on confirmed ransomware execution indicators (T1486 combined with T1490 inhibit system recovery) has eliminated the containment decision latency in the scenarios where it matters most. The risk of a false isolation (helpdesk ticket, analyst verification, 30-60 minute recovery) is explicitly accepted in exchange for preventing ransomware from spreading to additional hosts during a 4-hour containment approval window.

We're not saying every organization should implement automated containment — the cost-benefit analysis is environment-specific, and the calibration required to deploy it safely is substantial. What the data shows is that teams willing to do that calibration work have reduced their dwell time impact from ransomware attacks more significantly than teams that continued relying on human-reviewed containment approvals.

Measuring Dwell Time Honestly

Most organizations can't measure their own dwell time in real time — they can only measure it in retrospect, after an incident, by tracing forensic artifacts back to initial access. This creates a reporting gap: organizations track MTTD (measurable from alert timestamps) but not dwell time (only measurable forensically), which makes dwell time invisible as a KPI until an incident reveals it.

Proxy metrics help. The gap between an alert's event timestamp and the first analyst action on that alert is a measurable component of dwell time. The average queue depth expressed in alert-hours (total alerts in queue times average age) gives a rough picture of investigation latency. Purple-team exercises that model attacker timelines against real queue depth provide the most accurate simulation of what dwell time would look like in a real incident.

Dwell time reduction is not a detection engineering problem in isolation. It's a system problem that involves detection quality, analyst throughput, and containment velocity simultaneously. Organizations that treat it as any one of those problems in isolation consistently see improvement in one metric while the others remain constrained. The binding constraint for most organizations right now is throughput — which is why dwell time hasn't moved as much as detection technology improvement alone would predict.