How AI Is Reshaping Incident Investigation
Nils Bunge
Discover how AI agents are changing the way engineering teams investigate production incidents, and what observability platforms need to do to keep up
When something goes wrong in production, the question engineers ask is almost never "do we have enough data?" Modern systems are generous with telemetry. Logs accumulate, metrics stream, traces fan out across services. The harder question is always: where do I start?
That difficulty is growing. AI agents are now active participants in production environments, writing and deploying code, modifying configuration, triggering automated workflows around the clock. When something breaks, the investigation has to account for changes that no human directly made.
The nature of incidents hasn't changed, but the context has
Most incidents share a common origin: something changed. A deployment introduced a regression. A configuration update altered behaviour. Infrastructure shifted underneath a service. The investigation is, at its core, an exercise in connecting a symptom to a cause, and that cause is usually a change. What is different now is the volume and origin of those changes. AI-assisted workflows move faster and generate changes at a pace that traditional observability was not designed to absorb.
Observability should guide investigations, not just support them. When an alert fires, the platform already knows when the issue started, which services are behaving abnormally, and what has changed in the surrounding environment. That context can be used actively: highlighting the metric dimensions that explain a spike, connecting an anomaly to recent deployments, tracking what the investigation has already covered so handovers between teams do not mean starting from scratch.
AI as a colleague, not a replacement
AI agents can play a meaningful role in investigations. They can traverse signals across systems faster than any human and surface correlations that would take hours to find manually. But the most useful framing is not automation. It is augmentation. When a new colleague joins a team, they begin by observing, watching how investigations unfold, offering ideas. As they demonstrate good judgment, they earn greater autonomy. AI assistants in incident response should follow the same path, supporting the engineers who are ultimately accountable rather than replacing the judgment those engineers bring.
That also means the access AI agents have to sensitive telemetry needs to be scoped and governed. Prompts, completions, infrastructure state, business context. The value of an agent in an investigation depends entirely on the trust the team can place in how it operates.
Any approach that embeds investigation agents inside a single observability platform runs into a natural ceiling. Real investigations require signals from many systems: deployment pipelines, configuration management, infrastructure state, sometimes source control. The answer is not to centralise everything. It is to build observability so that external agents can work with it effectively through structured interfaces, under the governance policies the team defines. The data stays where it belongs, inside the environment it describes, under the control of the people responsible for it.
The shape of production systems is changing. The tools used to understand them need to change with it. If you are rethinking how your team investigates incidents at AI scale, we would like to show you what Tsuga makes possible. Get in touch.
The shape of production systems is changing. The tools used to understand them need to change with it.