You just resolved a P1. On-call is exhausted. The post-mortem is due by Friday. So you open ChatGPT, paste the PagerDuty alert, 200 lines of logs, a Slack thread, and type: "Write a post-mortem with timeline, root cause, and action items."
Thirty minutes later you have a polished document. Headers. Timeline. Root cause. Action items. Leadership reads it, nods approvingly, files it in Confluence.
Learning: zero.
The output reads like an experienced SRE wrote it. The problem is that no SRE would have written it from the data you provided — because the data you provided is a keyhole view of what actually happened. You chose what to paste. That choice is the first point of failure, and everything downstream inherits it.
The Ritual
Here's the copy-paste workflow. You've done it. Your team has done it. Half the incident response community is doing it right now:
- Open ChatGPT or Claude.
- Paste the PagerDuty alert that fired.
- Paste 200 lines of logs you think are relevant.
- Paste a Slack thread where the team was debugging.
- Add "Write a post-mortem with timeline, root cause, and action items."
- Marvel at the output. Clean it up. Ship it.
Thirty minutes instead of four hours. The economics are irresistible. The output is also fiction.
| What You Pasted | What You Didn't | Why It Matters |
|---|---|---|
| Application logs from the failing service | Logs from the upstream service that caused the failure | Root cause misattributed |
| The alert that fired | The three alerts that didn't fire but should have | Detection gaps invisible |
| Slack thread from the war room | The DM where someone said "I deployed at 2:47 and something feels off" | Timeline incomplete |
| Grafana screenshot of error rate | The deployment dashboard showing a rollout 4 minutes before | Correlation missed |
| The 500 errors | The 200 responses silently returning stale cached data | Blast radius underestimated |
You are the bottleneck. Your memory is incomplete. Your selection is biased toward what you already suspect. You're feeding the LLM a narrative, then asking it to confirm that narrative. It happily obliges.
Confident, Plausible, Wrong
When the LLM encounters gaps in the pasted data, it doesn't say "I don't have enough information." It fills the gap. Confidently.
Here's the kind of thing the workflow produces — composite examples, not verbatim quotes, but every pattern below has shown up in real post-mortems I've reviewed:
- "The connection pool was likely exhausted due to the increased traffic from the marketing campaign launched that morning." — There was no marketing campaign. The LLM invented a plausible cause for a traffic spike it couldn't explain.
- "The rollback was initiated at 14:32 UTC." — The rollback happened at 14:47. Fifteen minutes off. The LLM interpolated from surrounding context.
- "The database failover triggered as expected." — There was no failover. The LLM assumed a standard HA pattern and stated it as fact.
Each reads like a reasonable statement. Each is wrong. Because the document sounds authoritative, nobody questions it. The post-mortem becomes organizational fiction everyone treats as truth.
This is worse than no post-mortem. A missing post-mortem is an acknowledged gap. A wrong post-mortem is a confident lie that actively misdirects future decisions.
The Streetlight Effect, Automated
There's a parable about a drunk searching for his keys under a streetlight — not because he dropped them there, but because that's where the light is.
Copy-paste post-mortems are the streetlight effect industrialized. The LLM operates within the light cone you provide. It cannot know what's in the dark. And here's the feedback loop that makes it dangerous:
- Engineer suspects root cause is X.
- Engineer gathers evidence supporting X.
- Engineer pastes evidence into LLM.
- LLM confirms X with a well-structured argument.
- Post-mortem documents X as root cause.
- Nobody challenges it because it reads so well.
You've automated confirmation bias and given it a professional sheen. The most dangerous post-mortem is the one that's wrong but convincing — a bad manual post-mortem at least invites scrutiny. A polished LLM-generated one sails through review.
What Changes With Live Access
Instead of pasting data into a prompt, give the LLM direct access to the systems:
- Metrics storage (Prometheus, VictoriaMetrics, ClickHouse, Datadog) — raw PromQL/MetricsQL/SQL against the actual time-series database. Not a screenshot of a Grafana panel. The real data, at full resolution, for the exact incident window. The LLM writes the query, reads the result, adjusts the time range, drills into a label dimension — the same workflow a senior SRE does, minus the 15 tabs.
- Visualization (Grafana) — dashboard discovery and annotation. Which dashboards exist for this service? What do the SLO panels look like? Were any annotations left by previous responders?
- Alert management (Alertmanager, Healthchecks.io, Checkly, Uptime Kuma) — not just "what fired" but the full configuration. What thresholds are set? What was silenced and by whom? What should have fired based on the metrics but didn't because the alert doesn't exist? An agent that can read alert config and cross-reference it with actual metric values can identify detection gaps no human thought to look for.
- Traces (Jaeger, Tempo) — follow requests across service boundaries. Find the slow span. See where the cascade started.
- Logs (Loki, Elasticsearch) — search with its own queries, not your pre-selected snippets. Iterate on LogQL/Lucene the way you would, but without the fatigue of doing it at 3 AM.
- Deployment history (ArgoCD, GitHub Actions) — what changed and when.
- Git (blame, diff, commit history) — who changed what code and why. The actual diff, not a summary.
MCP servers exist for most of these now. The protocol is designed exactly for this: giving LLMs structured access to external tools.
The critical distinction: Grafana is a lens, not the data. Copy-pasting a Grafana screenshot gives the LLM a rendered image of one query at one zoom level chosen by a human. Querying Prometheus directly gives it the underlying time-series — it can re-aggregate, shift the window, compare labels, compute derivatives. The difference is handing someone a photo of a crime scene versus letting them walk through it.
The Same Incident, Different Investigation
Take the P1 from earlier. With copy-paste, the engineer pasted application logs and the LLM blamed a connection pool.
With live access:
The LLM receives the PagerDuty incident. Queries the on-call platform for timeline, responders, and related alerts.
Queries Prometheus directly: rate(http_requests_total{service="checkout", code=~"5.."}[5m]). Not a screenshot — the raw time-series. Sees the error rate spike, then broadens: queries the same metric across upstream services. Finds elevated latency on payment-gateway starting 4 minutes earlier.
Pulls a 7-day baseline from VictoriaMetrics: http_request_duration_seconds{service="payment-gateway"}. Latency was stable at 120ms for a week. It jumped to 4.2s at 14:23 UTC. Not normal variance. Something changed.
Queries the deployment pipeline: Finds a config change deployed to payment-gateway at 14:21 UTC. Queries Git for the diff. One line: connection timeout reduced from 30s to 5s.
Queries Loki: {service="payment-gateway"} |= "context deadline exceeded". Hundreds of matches starting at 14:23. The timeout was too aggressive for the downstream database under normal load.
Queries the alerting layer. Pulls the full alert rule config. No alert for payment-gateway connection timeouts. A latency alert exists, but the threshold is 10s — the 4.2s spike flew under it. There's also a silence active on payment-gateway alerts, set by an engineer two weeks ago for a maintenance window that ended. Nobody removed the silence.
Root cause: Config change reduced connection timeout below what the downstream database required. Contributing factor: stale silence masked the degradation for the duration of the incident.
Not a traffic spike. Not connection pool exhaustion. A one-line config change plus an orphaned silence — neither of which anyone would have copy-pasted into a prompt because nobody knew to look there.
The LLM found it because three properties of its workflow were different — none of them magic, all of them mechanical:
- It read alert configuration, not just fired alerts. This is the move almost no human makes during a post-mortem. You look at what fired. You don't audit what should have fired. The agent does both because both are just API calls to it.
- It iterated. It didn't try to synthesize an answer from one batch of inputs. It pulled data, formed a hypothesis, queried for evidence, adjusted, queried again. The same loop a senior SRE runs in their head — but without the cognitive cost.
- It crossed system boundaries without context-switching. Metrics → deploys → git → logs → alerts, in one continuous investigation. Humans tend to stay in whichever tool they opened first because tab-switching costs attention.
What the Investigation Actually Surfaces
The interesting finding from incidents like the one above isn't the immediate root cause. It's the structural problems that the investigation reveals on the way to the root cause. Three things came out of that mock investigation that have nothing to do with payment-gateway specifically:
- An orphaned silence. Someone set it for a maintenance window two weeks ago. Nobody removed it. The system has no policy for silence expiration. This is a governance gap, not an incident — but you only discover it because the agent looked at silence state.
- A coverage gap. Connection timeouts on
payment-gatewayhave no alert at all. Not a tuning problem — a missing monitor. Across the whole estate, how many other services have invisible failure modes? Nobody knows because nobody audits. - A signal-to-noise problem. The latency alert exists, but its threshold (10s) is high enough that a 4.2s spike — twice the baseline, severe enough to break a downstream service — never paged anyone. The threshold was set to suppress noise. It also suppressed signal.
These aren't post-mortem findings. They're standing weaknesses in the alerting estate that the incident exposed by accident. A connected LLM can surface this kind of structural problem because it has the breadth to look across config, not just the depth to look at one incident.
This is the part that compounds. Every incident becomes an audit of the alerting layer, not just a forensic exercise on what broke. The post-mortem stops being a backward-looking artifact and becomes input to a forward-looking maturity process: which thresholds are wrong, which silences are stale, which services are unmonitored, which alerts fire constantly without anyone caring. That's where the actual improvement loop lives.
What the LLM Actually Brings
To be clear: the LLM isn't doing anything a skilled SRE couldn't do manually. The value isn't superhuman reasoning. It's:
Breadth without fatigue. It can check 15 data sources in 30 seconds. An exhausted post-incident engineer checks the 3 they remember.
No confirmation bias. It doesn't have a hypothesis when it starts pulling data. It follows anomalies wherever they lead — including into services the on-call engineer doesn't own.
Cross-system correlation. It naturally queries across metrics, logs, traces, deploys, and alert config in a single investigation. Humans tend to stay in whichever tool they opened first.
Detection gap discovery. This is the one humans almost never do. The LLM can read the alert config, compare it against actual metric behavior during the incident, and say: "This threshold would not have caught this. This silence should have expired. This service has no alerts at all." That's not a post-mortem finding you get from copy-paste.
I built an internal version of this workflow against a metrics platform handling several billion active time series. The agent runs an OODA loop on a fresh PagerDuty incident in under 180 seconds — parsing the alert URL, fanning out parallel queries to alert metadata, check definitions, service catalog, and metric ranges, then synthesizing an incident briefing with hypotheses, blast radius assessment, and recommended first actions. The breadth-without-fatigue claim isn't theoretical. The infrastructure problem isn't "can it be built" — it can. The infrastructure problem is what comes next.
Governance Isn't Optional
Here's the part most "give the LLM access to your stack" articles skip.
Once your agent can query Prometheus, read alert config, pull Git diffs, and inspect deployment history, it has read access to a meaningful chunk of your production posture. That's already a security conversation most organizations haven't had. But read access is the easy case.
The moment you give the agent any write capability — silencing an alert during investigation, creating a Jira ticket from findings, posting to Slack, marking a deployment as suspect — you've turned a forensic tool into an actor. And actors need rules.
Concrete questions you need answered before this goes near production:
- Who approves what? Reading metrics is fine. Silencing an alert mid-incident isn't. Writing to PagerDuty isn't. Restarting a pod "to test a hypothesis" definitely isn't. The boundary between read and write needs a policy, and that policy needs an enforcement point that isn't the LLM itself.
- Who audits the agent? Every query the LLM made during the investigation should be logged with the same rigor as a human SRE running
kubectl exec. If the post-mortem cites a metric query, you need to be able to replay that exact query later. Not "approximately what the LLM looked at" — the literal request and response. Without this, you've just moved the unreliable narrator from the human to the machine. - What leaves the perimeter? A connected LLM is, by definition, sending production data to an inference endpoint. If that endpoint is a third-party API, your incident data is now somebody else's training data unless you have a contract that says otherwise. For regulated industries this is non-negotiable. For everyone else it's still embarrassing the first time it shows up in a breach disclosure.
- What about prompt injection? Logs contain user input. User input can contain instructions. If your investigation pulls logs into the LLM context window, a malicious user can — in principle — inject instructions that change the agent's behavior. This is not theoretical; it's the same class of attack that's already been demonstrated against agentic systems with tool access.
The right architecture puts a policy layer between the LLM and the systems it queries. Read operations flow through with logging. Write operations require an approval that doesn't originate from the LLM itself — out-of-band, signed, auditable. This isn't an abstract security concern. It's the difference between an agent that helps you investigate and an agent that becomes a new attack surface in your incident response process.
The governance layer for this doesn't ship in the box with most MCP servers today. That's the gap.
(Full disclosure: this is the gap I'm building for, in a side project called MCP Hangar. The arguments above stand on their own — I'm not pitching here, but you should know what I'm working on, because it shapes what I notice.)
Why Nobody's Doing This Yet
The MCP connectors for observability infrastructure are real and getting better — but the production-readiness varies wildly:
| System | MCP Server Status | Where It Falls Short |
|---|---|---|
| Grafana | Official, mature (40+ tools, actively developed) | Local stdio/SSE only; production deployment patterns still emerging |
| Grafana Cloud Traces | Built-in, public preview | Limited to Tempo/Cloud customers |
| Prometheus / VictoriaMetrics | Via Grafana datasource proxy or community wrappers | Direct community wrappers vary in quality; label cardinality discovery still weak |
| ClickHouse | Multiple community projects | Mostly raw SQL passthrough; schema-aware iteration uneven |
| Alertmanager | Community wrappers exist | Most read firing alerts only — silence and config introspection rare |
| Healthchecks / Uptime Kuma | Minimal | Effectively absent |
| PagerDuty / OpsGenie | Community wrappers | Incident read works; config and silence management partial |
| incident.io | Limited | API is solid; MCP layer thin |
| Loki / Elasticsearch | Community projects | Basic queries work; iterative log refinement clunky |
| Jaeger / Tempo (standalone) | Sparse | Cloud Traces has it; OSS Tempo less so |
| Datadog / New Relic | Vendor-side movement | Closed roadmaps; expect movement in 2026 |
The point isn't that the tooling doesn't exist — it does, more than even six months ago. The point is that assembling these into a coherent investigation workflow is still an integration project, not a product. You can wire Grafana's MCP server into Claude today and get real value. You can't yet hand a junior SRE a one-click "post-mortem agent" that understands your stack, respects your policies, and survives audit.
That gap is closing fast. It's the same trajectory as distributed tracing: messy early adoption, transformative once standardized, eventually table stakes.
Honest Inventory
Even with copy-paste, LLMs aren't useless for post-mortems. They structure raw notes into readable timelines. They flag missing sections. They draft action items from established patterns. These are writing assistance tasks — the LLM is a technical writer, not an investigator. That's fine, if you know the difference.
Where they're actively harmful: root cause analysis from incomplete data, blast radius estimation without access to customer-facing metrics, and counterfactual reasoning ("if the circuit breaker had been configured correctly..."). The LLM will do all three confidently. It will be wrong. The document will read well enough that nobody catches it.
Maturity progression for organizations adopting this:
- No post-mortem. Nothing to learn from.
- Manual post-mortem. Gold standard, expensive, inconsistent across teams.
- Copy-paste LLM as formatting assistant. Faster drafts, but encodes selection bias. This is where most "AI-assisted" organizations actually are right now, calling it innovation.
- Connected LLM as investigation assistant. Broader evidence, real correlation, structural problem discovery. Requires both MCP infrastructure and governance.
- Connected LLM with policy layer and human review. LLM investigates and surfaces findings; humans verify and add context machines can't reach (organizational politics, customer impact judgment, prior knowledge of recurring patterns). This is the level worth aiming for.
Most organizations doing "AI-assisted post-mortems" are at the third level and calling it the fifth. The infrastructure to bridge that gap exists in pieces. The governance to make it safe doesn't, yet, in a packaged form.
What This Actually Solves
The connected post-mortem doesn't replace the SRE. It replaces the 3 AM data-gathering marathon that precedes the SRE's analysis.
Today, post-mortem quality is a function of how thorough the on-call engineer was at collecting evidence while exhausted. That's a terrible dependency. The engineer who resolved the incident at 4 AM is not the person you want curating evidence for organizational learning.
A connected LLM inverts this. Evidence gathering becomes systematic, exhaustive, and cheap. The human's job shifts from "find the data" to "verify the conclusions and add the context the agent can't reach" — which is where human judgment actually matters.
The infrastructure isn't ready today as a turnkey product. The MCP wrappers need to mature past the early-adopter phase. The governance layer needs to exist as something other than a custom integration. But the direction is clear: the post-mortem of the future isn't a human pasting logs into a chatbot. It's an agent walking through your entire observability stack, pulling threads, cross-referencing timelines, surfacing both the immediate root cause and the structural weaknesses the incident exposed — for a human to validate, prioritize, and act on.
Until then, if you're copy-pasting data into ChatGPT and calling the output a root cause analysis — at least have the honesty to label it what it is. It's a draft. It's partial. It encodes your blind spots. Treat it accordingly.
"A post-mortem written from data you selected is an autobiography, not an investigation. Autobiographies are unreliable narrators by definition."