LLM Observability Maturity Scorecard

Diagnose where your team stands across 6 dimensions of LLM observability. Get a composite score, gap analysis, and concrete next steps.

The 5 Maturity Levels

Progression from no observability to closed-loop automated intelligence.

0[Dark]

Absence. No LLM-specific observability exists. The system is a black box. Failures are discovered by end users.

1[Theater]

Presence without utility. Instrumentation exists but does not inform decisions. Data is collected but not consumed. There is no operational difference between having these logs and not having them.

What changes at this boundary

The minimum for L1: at least one LLM endpoint has some form of logging or tracing. Even console.log counts. The bar is existence, not quality.

2[Measured]

Operational awareness. Infrastructure metrics are systematically collected and reviewed. The team can answer: "How fast? How often? How much?" But NOT: "How good?"

What changes at this boundary

This is the hardest boundary and where most organizations stall. L1 has data nobody uses. L2 has data that is reviewed on a defined cadence (even weekly), with defined operational metrics, by defined people. The diagnostic test: can you name the person who reviewed LLM metrics in the last 7 days? If no → L1.

3[Governed]

Quality awareness + ownership. Semantic quality is measured. Someone owns the outcomes. SLOs exist. The team can answer: "How good? Who's responsible? What's acceptable?"

What changes at this boundary

L2 measures infrastructure (latency, errors, tokens). L3 measures the quality of the output itself — relevance, correctness, safety, grounding. This is the level transition that is uniquely LLM-specific. In traditional observability, L2 is often sufficient. For LLMs, L2 is dangerously incomplete.

4[Predictive]

Proactive + closed-loop. Anomalies detected before user impact. Observability data drives automated or semi-automated decisions about models, prompts, and routing.

What changes at this boundary

L3 has humans reviewing quality data and making decisions. L4 has systems that detect anomalies and either trigger automated responses or surface actionable alerts before users notice degradation.

The 6 Assessment Dimensions

Each dimension is independently scoreable. Being strong in one does not imply strength in another.

Trace Coverage

What fraction of your LLM interactions are captured with enough structure to be queryable?

Measures

The breadth and structural quality of your instrumentation. This is the foundational layer — without data, nothing else in this scorecard matters.

Does NOT measure

Whether anyone looks at the traces (that's Feedback Loop). Whether the traces include quality metrics (that's Quality Signals). This dimension is purely about capture.

Quality Signals

Do you measure the quality of what your LLMs produce — not just whether they responded?

Measures

The breadth and sophistication of your quality measurement system. How many types of quality signals do you collect, and how systematically?

Does NOT measure

Hallucination detection specifically (that's Dimension 3). Cost (that's Dimension 4). Whether anyone acts on quality data (that's Dimension 6).

Why this dimension is separate

Quality Signals is about the breadth of your quality measurement. Hallucination Awareness is about the depth of your capability on the single most dangerous LLM failure mode. You can be L3 on Quality Signals (you measure relevance, completeness, tone) and L1 on Hallucination Awareness (you don't detect fabrications). These are independent axes.

Hallucination Awareness

How do you detect, classify, and respond to your models fabricating information?

Measures

Your specific capability to handle the failure mode that is unique to LLMs and does not exist in traditional software. A service can be slow, broken, or expensive — only an LLM can confidently lie to your users with a smile.

Does NOT measure

Broad quality measurement (relevance, completeness, tone) — that's Quality Signals. This dimension is specifically about fabrication detection.

Why this dimension is separate

You can have broad quality measurement (relevance, completeness, tone) and still be completely blind to hallucinations. Conversely, you can have excellent hallucination detection and no broader quality framework. These capabilities are independently valuable and independently assessable. A concrete example: an org can be L3 on Quality Signals — measuring relevance, tone, and completeness with automated evaluators — and simultaneously L0 on Hallucination Awareness because they have zero grounding verification in their RAG pipeline. This is not an edge case. It is the typical state of teams that adopted LLM-as-judge evals through Arize or Langfuse but never built source-claim verification. A single LLM-as-judge evaluator does not cover both dimensions.

Cost Visibility

Do you understand what your LLM usage costs — at the granularity that enables decisions?

Measures

Your ability to attribute, forecast, and optimize LLM spending. This is the FinOps dimension. LLM-specific factors: token economics are asymmetric (input vs. output pricing); model selection is a cost lever (same task can differ 10x); context window utilization affects spend; caching requires instrumentation to measure impact; provider pricing tiers are context-dependent.

Does NOT measure

Quality of outputs (that's Quality Signals). Infrastructure performance (that's Trace Coverage). This dimension is specifically about cost attribution and optimization.

Incident Ownership

When your model degrades or fails — is there a defined process, or is it chaos?

Measures

Organizational readiness to respond to LLM-specific incidents. This is deliberately a process/people dimension, not a technology dimension. LLM-specific incident types: quality degradation (model outputs get worse without any infrastructure signal); prompt regression (a prompt change degrades quality in untested cases); provider degradation (upstream model provider degrades quality); data contamination (retrieval pipeline surfaces incorrect data); cost explosion (code change causes a 10x cost spike).

Does NOT measure

The quality measurement system itself (that's Quality Signals). Technical tracing capability (that's Trace Coverage). This dimension is about the organizational response, not the detection tooling.

Why this dimension is separate

Observability without incident response is voyeurism. You can have the most sophisticated dashboards in the world — if nobody knows whose job it is to look at them when things go wrong, you have observability theater. This dimension is the bridge between "we can see the problem" and "we can fix the problem."

Feedback Loop

Does your observability data actually change what you do — or is it a dashboard nobody looks at?

Measures

The degree to which observability output is connected to engineering and product decisions about models, prompts, and system design. This is the "closed loop" dimension — where observability becomes engineering.

Does NOT measure

The quality of your evaluation (that's Quality Signals). The coverage of your tracing (that's Trace Coverage). This dimension measures whether the data you collect has a path back to the system that produced it.

Why this dimension is separate

Most organizations stop at "collect data" and "build dashboard." The leap to "data drives decisions" and then "data drives automated actions" is the maturity gap that separates orgs who improve from orgs who accumulate technical debt with better visibility. This is the capstone dimension — it only works when everything below it functions. It is intentionally ordered last to reflect that dependency, not to suggest it matters less.

How Scoring Works

Score yourself 0–4 on each of the 6 dimensions. The composite score is the arithmetic mean of all 6 dimension scores.

Overall maturity level = floor of the composite score. A composite of 2.8 → Level 2 (Measured).

Uneven profile warning fires when any dimension scores 0, or when a dimension is 2+ levels below your composite. A high composite with a critical gap dimension is still a broken system — the warning ensures you see it clearly.

Gap actions are sorted lowest-scoring dimensions first. Fix the foundation before decorating.