Following the Thread Through the Outage: AI and Incident Response

On February 28, 2017, an engineer at a major cloud provider executed a routine maintenance command with an incorrect input parameter. According to the provider's postmortem, the command removed more server capacity than intended from a critical storage subsystem in the US-EAST-1 region.^[1] The storage service became unavailable, and over the course of approximately four hours, the outage cascaded across dozens of dependent services, affecting websites, APIs, and applications across a significant portion of the internet.

The root cause, as media widely reported it, was a typo (the provider's own postmortem described it as an input "entered incorrectly"). Finding that root cause, amid cascading failures across interconnected services, took considerably longer than the error took to make. The storage service's own health dashboard depended on the storage service itself, so the dashboard reported everything was fine while the system was down. The engineers debugging the outage were navigating a labyrinth where the map was lying to them.

This is a pattern that often defines production incidents in complex systems. The proximate cause may be straightforward. The system surrounding it typically is not.

The Labyrinth of the Incident

In Greek mythology, the Labyrinth of Crete was built to contain the Minotaur. The maze was so complex that even its architect could barely navigate it. The danger wasn't just the monster at the center; it was the structure that made reaching the monster, and finding your way back, nearly impossible.

Production incidents follow the same structure. The Minotaur is the root cause: a misconfigured service, a depleted resource pool, a race condition, a bad deployment. The labyrinth is everything surrounding it: the hundreds of services that depend on each other, the cascading failures that obscure the origin, the metrics that spike in misleading directions, the logs that contain the answer buried in terabytes of noise.

Cascading red warning signals propagating through interconnected labyrinth corridors, each failure triggering the next in a chain reaction — Cascading failures obscure the root cause. The labyrinth looks different from inside than it does in the postmortem.

Richard Cook, a researcher who studied failure in complex systems, articulated this dynamic in his influential 1998 paper "How Complex Systems Fail." Among his eighteen observations, several apply directly to incident response: catastrophe requires multiple failures, not just one; complex systems contain changing mixtures of latent failures within them; and hindsight biases post-accident assessments, making the path to failure look clearer than it was in real time.^[2]

That last point is particularly relevant. After an incident, the root cause often seems obvious. Of course the typo caused the outage. But during the incident, the engineers didn't know it was a typo. They saw cascading failures across dozens of services, contradictory signals from monitoring systems, and a health dashboard that was itself broken. The labyrinth looked different from inside than it does in the postmortem.

Observability: Mapping the Maze

The traditional response to labyrinth complexity in software has been observability: the practice of instrumenting systems to produce logs, metrics, and traces that make internal behavior visible from the outside. The term, borrowed from control theory, refers to the degree to which a system's internal state can be inferred from its external outputs.^[3]

Logs record what happened. Metrics record how much. Traces record the path a request took through the system. Together, they form a map of the labyrinth. But the map has a problem: in a system with hundreds of services, each potentially producing large volumes of log data per second, the observability data is itself a labyrinth. The answer to "what went wrong" is almost certainly in the data somewhere. Finding it is the challenge.

Distributed tracing, standardized through projects like OpenTelemetry, represents the closest technical analog to Ariadne's thread.^[4] A trace follows a single request as it propagates through service after service, recording the time spent at each step and the outcome of each call. When a request fails or slows down, the trace shows exactly where in the chain the problem occurred. It's a literal thread through the labyrinth: follow it, and you find the Minotaur.

A glowing trace line following a single path through a complex maze of services, illuminating each node it passes through while the rest remains dim — Distributed tracing follows a request through the labyrinth. The thread shows exactly where the path breaks down.

But tracing has limits. It works well for request-level problems (this API call is slow, this service returned an error) but less well for systemic issues (the database is running out of connections, a network partition is causing intermittent failures, a memory leak is slowly degrading performance across all services). The thread traces one path through the labyrinth. Some Minotaurs don't live on any single path; they live in the walls.

AI as the Thread: Correlating Signals Across the Maze

This is where AI-assisted observability is beginning to change incident response. The core promise is correlation at scale: instead of an engineer manually scanning dashboards, checking logs, and mentally correlating signals across services, AI can process the full volume of observability data and surface the signals most likely to be related to the incident.

Several observability platforms now offer AI-assisted root cause analysis features that aim to correlate anomalies across services, identify patterns that match known failure modes, and suggest probable causes.^[5] The approach typically combines statistical anomaly detection (identifying metrics that deviate from their baseline at the time of the incident) with pattern matching against historical incidents (this combination of symptoms has previously been associated with this root cause).

The value proposition is speed. In a system with hundreds of services, an experienced engineer might take hours to trace the causal chain from symptom to root cause, checking service after service, correlating timestamps, ruling out red herrings. AI-assisted tools aim to compress that process by surfacing the most relevant signals first, effectively providing a shorter thread through the labyrinth.

There's also a knowledge preservation dimension. Senior engineers carry mental maps of the system: they know which services are tightly coupled, which metrics are leading indicators, which failure modes produce which symptoms. When those engineers leave, the maps leave with them. AI-assisted observability can partially encode that institutional knowledge, learning from historical incidents which correlations matter and which are noise. The thread doesn't replace the experienced engineer's judgment, but it can preserve some of what that judgment was built on.

Chaos Engineering: Exploring the Labyrinth Before the Minotaur Appears

One of the more counterintuitive approaches to incident response is to cause incidents deliberately. Chaos engineering, a discipline pioneered at Netflix around 2011 with the creation of Chaos Monkey, involves injecting controlled failures into production systems to test their resilience.^[6] The premise is that if you're going to encounter the Minotaur eventually, it's better to meet it on your terms than to be surprised.

A controlled explosion in one corridor of a labyrinth while engineers observe from a safe vantage point, representing chaos engineering — Chaos engineering is a controlled expedition into the labyrinth: meet the Minotaur on your terms.

Chaos Monkey randomly terminates virtual machine instances in production, forcing engineers to build systems that can tolerate individual component failures. The broader discipline has expanded to include network partitions, latency injection, resource exhaustion, and dependency failures. Each experiment is a controlled expedition into the labyrinth: you enter a specific corridor, observe what happens, and use the results to improve your map.

The connection to AI is emerging in early-stage tools that aim to identify the most valuable chaos experiments to run based on system topology, dependency analysis, and historical failure data. The idea is that rather than randomly exploring corridors, AI could suggest which corridors are most likely to contain hidden Minotaurs, focusing experiments where they'll produce the most useful information. This area is still maturing, and the effectiveness of such approaches remains to be broadly validated.

The Thread's Limits

AI-assisted incident response carries limitations that are worth stating plainly.

AI correlation works best for failure modes it has seen before. Novel failures, by definition, don't match historical patterns. The 2017 cloud storage outage was unusual partly because the health dashboard itself was affected, a failure mode that monitoring systems weren't designed to detect because they assumed the monitoring infrastructure would remain available. AI trained on historical incidents would not have anticipated a failure that broke the tool used to detect failures.

There's also the risk of automation bias, a well-documented cognitive tendency to favor suggestions from automated decision-making systems over contradictory information from non-automated sources.^[7] If an AI tool suggests a probable root cause, engineers may anchor on that suggestion and stop investigating alternatives, even if the suggestion is wrong. In incident response, premature convergence on a wrong hypothesis can extend the outage significantly. The thread is useful, but following it blindly into a dead end is worse than having no thread at all.

The volume problem cuts both ways. AI can process more data than humans, but it can also surface more correlations than are meaningful. Distinguishing genuine causal relationships from coincidental correlations in high-dimensional observability data remains a hard problem. Not every metric that spiked during the incident is related to the incident. The AI thread can lead you to the Minotaur, but it can also lead you to a shadow on the wall.

Finally, AI-assisted observability is only as good as the instrumentation that feeds it. Systems with poor observability, sparse logging, missing traces, or inconsistent metric naming produce data that AI can't meaningfully analyze. The thread requires corridors to follow. In systems where the corridors aren't mapped (no traces, no structured logs, no consistent metrics), the AI has nothing to work with. The most sophisticated thread is useless in an uninstrumented labyrinth.

The Postmortem as Retrospective Thread

One practice that predates AI but aligns with the labyrinth metaphor is the blameless postmortem: a structured review conducted after an incident to understand what happened, why, and how to prevent recurrence. The postmortem is a retrospective thread-tracing exercise. You start at the outcome (the outage) and trace backward through the labyrinth to the root cause, documenting the corridors you traversed and the wrong turns you took along the way.

Effective postmortems don't just find the root cause. They map the labyrinth itself: which dependencies were hidden, which monitoring gaps existed, which assumptions proved wrong. Each postmortem adds to the organization's map of its own complexity. Over time, the accumulated maps make future navigation faster, whether that navigation is done by humans, AI, or both.

The labyrinth of a complex system will never be fully mapped. New corridors appear with every deployment. Old corridors shift when configurations change. The Minotaur moves. But the discipline of tracing the thread after each encounter, documenting the path, and sharing the map, is what separates organizations that learn from incidents from those that merely survive them.

References

[1] Amazon Web Services, "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region," March 2, 2017. https://aws.amazon.com/message/41926/

[2] Richard I. Cook, MD, "How Complex Systems Fail," Cognitive Technologies Laboratory, University of Chicago, 1998 (revised 2000). https://www.researchgate.net/publication/228797158_How_complex_systems_fail

[3] The term "observability" in software engineering is adapted from control theory, where it was formalized by Rudolf Kálmán in 1960. For its application to software systems, see Charity Majors, Liz Fong-Jones, and George Miranda, Observability Engineering, O'Reilly Media, 2022.

[4] OpenTelemetry is a vendor-neutral observability framework maintained by the Cloud Native Computing Foundation (CNCF). https://opentelemetry.io/

[5] For a survey of AI-assisted approaches to incident response and root cause analysis, see "A Survey of AIOps for Failure Management in the Era of Large Language Models," arXiv, June 2024. https://arxiv.org/abs/2406.11213

[6] Netflix Technology Blog, "Netflix Chaos Monkey Upgraded," October 2016. https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d See also Ars Technica, "Netflix attacks own network with 'Chaos Monkey'—and now you can too," July 2012. https://arstechnica.com/information-technology/2012/07/netflix-attacks-own-network-with-chaos-monkey-and-now-you-can-too/

[7] For a systematic review of automation bias, see Lyell, D. and Coiera, E., "Automation bias and verification complexity: a systematic review," Journal of the American Medical Informatics Association, Vol. 24, No. 2, 2017, pp. 423–431. https://pmc.ncbi.nlm.nih.gov/articles/PMC7651899/ See also Wikipedia, "Automation bias," for an accessible overview of the concept and its origins in human factors research.