Executive Summary
If your primary method of detecting production outages is a scheduled query scanning terabytes of JSON for the string "ERROR", you do not have a monitoring system. You have a house of cards built on spelling, latency, and financial negligence.
Log-based alerting is the path of least resistance: no code changes, no redeploys, minimal thought. It also:
- Drives runaway observability spend.
- Causes silent failures when strings change.
- Adds alerting latency when you need speed most.
- Incentivizes misuse of logging platforms as ad‑hoc metrics stores.
This post draws a sharp line:
- Logs → Debugging (the “Why”).
- Metrics → Alerting (the “What”).
Misusing logs for “What” means paying a premium to be less reliable.
It covers:
- Fragility: Why a fixed typo can blind you to an outage.
- The Read-Time Tax: The economics of scanning text to count events.
- Logs-to-Metrics (L2M): Converting textual signals into durable metrics.
- Governance: A policy to ban lazy log alerts when code is owned.
- Implementation: Practical Vector / OTEL patterns.
- Cost Audit: How to identify expensive scheduled queries.
1. The Pathology: Textual Dependency
1.1 The “String Match” Gamble
A metric is a contract.
Example: payment_gateway_errors_total is an API between your application and your monitoring stack. Breaking it requires intentional removal or alteration of instrumentation.
A log line is prose. It depends on mood, spelling, refactors, localization, and stylistic churn.
Classic “Silent Failure” sequence:
- Setup: You have a P1 alert matching
grep "Conection failed". (Yes, misspelled.) - Change: A developer fixes the typo:
"Connection failed". - Result: The underlying failure persists, but the alert is mute. Dashboards falsely green.
You are not monitoring an error condition. You are monitoring a spelling choice. This is not theoretical—production incidents have dragged on for hours because string-based detection silently broke.
1.2 The Latency Gap
During cascading failures:
- Metrics: Lightweight, pre-aggregated, propagate in seconds.
- Logs: Heavy, serialized, compressed, shipped, ingested, indexed, then queried.
In high-load events, log ingestion pipelines often lag 5–15 minutes. If alerting depends on logs, you learn you are down only after customers have churned.
2. The Economic Insanity: The Read-Time Tax
Alerting on logs is magnitudes more expensive than alerting on metrics because of evaluation complexity.
2.1 O(1) vs O(N)
| Aspect | Metric Alert | Log Query Alert |
|---|---|---|
| Operation | Compare latest value to threshold | Scan documents, decompress, parse, filter, aggregate |
| Complexity | O(1) | O(N) |
| Latency | Milliseconds | Seconds to minutes (under load) |
| Cost Driver | Storage + minimal CPU | Repeated compute + storage + indexing |
| Failure Modes | Instrumentation removed | Typos, format changes, schema drift, ingestion lag |
2.2 The Bill Comes Due
Scheduled queries (every 1–5 minutes) re-scan the same data repeatedly. This creates a Read-Time Tax: paying compute over and over to re-derive counts you could have emitted as counters once.
Real-world audits show ~30–40% of a logging bill spent executing scheduled alert queries—not ingesting new data. This is “Technical Debt as a Service.”
3. The Architecture: Logs-to-Metrics (L2M)
If the only place a critical signal exists is inside a log (legacy closed source, appliance, vendor output), do NOT build recurring queries over storage.
3.1 Core Principle
Extract the signal at the edge: close to ingestion, then alert on a durable metric.
3.2 The L2M Pattern
- Intercept: Collector/agent reads log stream (Vector / Fluentd / OTEL).
- Match & Extract: Pattern or parse in memory.
- Increment: Update a local counter / histogram (
legacy_app_error_total). - Forward Text (Optional): Store raw log for forensics or drop if low value.
- Scrape / Export: Metrics pipeline consumes lightweight series.
3.3 Why L2M Wins
- Cost: One-time parse at write-time vs perpetual scans at read-time.
- Stability: Metric names are contracts; log strings are not.
- Retention: Metrics feasibly kept for 12–18+ months; raw logs rarely retained that long.
- Latency: No ingestion lag penalty for alerting.
4. Implementation: Tools of the Trade
Open source tooling fully supports L2M—no need for expensive proprietary features.
4.1 Vector (Preferred)
Vector’s VRL (Vector Remap Language) gives inline parsing + metric emission with minimal overhead.
Example (simplified):
# vector.toml
[transforms.logs_to_metrics]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
if .level == "ERROR" {
counter!("app_errors_total", labels: {
"service": .service,
"error_type": .error_code || "unknown"
})
}
'''
[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["logs_to_metrics"]
address = "0.0.0.0:9090"
4.2 OpenTelemetry Collector
Use processors/connectors (e.g., log parsers + metric generation) to convert log events into Prometheus metrics early.
5. Governance: The “Access to Source” Policy
A lightweight decision tree prevents costly misuse:
5.1 Decision Tree
-
Do you own the code?
- YES → Deny log alert. Instrument a metric. Add backlog ticket.
- NO → Proceed.
-
Can we deploy an edge processor (Vector / OTEL)?
- YES → Implement L2M. Alert on metric.
- NO → Grant exception (label “High Risk / High Cost”), review quarterly.
5.2 Legitimate Exception: Security
Security analytics (intrusion detection, anomaly hunting) often requires scanning high-detail text for unknown patterns. This post addresses operational reliability, not SIEM/Threat detection.
6. Closing: Move Up the Value Chain
Alerting from raw logs is a legacy convenience pattern. In modern cloud-native systems:
- Logs → Answer “Why?” (High cardinality, narrative, expensive).
- Metrics → Answer “What?” (Low cardinality, fast, cheap).
If you are writing regexes to wake an engineer, pause and instrument instead. Each log-based alert accrues cost, fragility, and latency debt.
Instrument your code. Adopt L2M where you cannot. Make log alerts the exception, not the norm.
Appendix A: Vector L2M Configuration
Production-ready example converting Nginx access logs into metrics (avoid storing millions of low-value lines while still alerting on 5xx rates):
# 1. Ingest Nginx Logs
[sources.nginx_logs]
type = "file"
include = ["/var/log/nginx/access.log"]
# 2. Parse and Convert to Metrics
[transforms.log_to_metric]
type = "remap"
inputs = ["nginx_logs"]
source = '''
. = parse_nginx_log!(.message, "combined")
.status = to_int!(.status)
# Latency histogram
histogram!("http_request_duration_seconds", to_float!(.request_time), labels: {
"method": .method,
"path": .path, # Normalize dynamic segments to prevent cardinality explosions
"status": .status
})
# Error counter
if .status >= 500 {
counter!("http_errors_total", labels: {
"host": .host,
"status": .status
})
}
'''
# 3. Expose to Prometheus
[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = ["log_to_metric"]
address = "0.0.0.0:9598"
default_namespace = "nginx_edge"
# 4. Optional: Drop logs (no other sink consumes nginx_logs, so only metrics persist)
Operational Notes
- Normalize paths (
/user/{id}→/user/:id) before labeling. - Guard histogram buckets; avoid overly granular latency distributions.
- Explicitly track transform failure counts for observability of the pipeline itself.
Appendix B: The Cost Audit SQL
Identify scheduled log queries imposing high Read-Time Tax.
-- Conceptual schema; adapt to your platform's audit log fields
SELECT
user_email AS alert_owner,
query_pattern,
COUNT(*) AS execution_count_24h,
SUM(total_bytes_processed) / 1024 / 1024 / 1024 AS tb_scanned_24h,
(SUM(total_bytes_processed) / 1024 / 1024 / 1024) * 5.0 AS estimated_daily_cost_usd -- Adjust $/TB rate
FROM `audit.log_query_history`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND is_scheduled_alert = TRUE
GROUP BY 1, 2
ORDER BY estimated_daily_cost_usd DESC
LIMIT 10;
Self-Check
If one alert costs $10/day (~$300/month) ask: Does it prevent ≥ $3,600/year in damage?
If not, delete or re-architect (instrument a metric, apply L2M).
Key Takeaways (TL;DR)
- String matching is not monitoring; it is syntax surveillance.
- Metrics deliver lower latency, lower cost, higher reliability.
- Convert log-derived signals at the edge (L2M) instead of scanning storage.
- Enforce a governance policy: if you own the code, instrument—don’t grep.
- Audit scheduled log queries; kill cost outliers.
- Reserve log-based alerting for unavoidable edge cases and security.
“Logs explain. Metrics detect.”
Instrument accordingly.