Executive Summary
"Parallel execution is faster than sequential." Congratulations—you've rediscovered physics. That's not a benchmark finding. That's a definition.
We built a benchmark suite for MCP Hangar: 6 scenarios, 4 baselines, 100 runs per configuration, 5,300+ individual measurements, zero errors. The headline number is a 19.6× speedup for parallel fan-out. The headline number is also boring.
What's not boring:
- MCP stdio is not serial. The protocol fully supports concurrent multiplexing over a single pipe. Most clients don't use it.
- Hangar adds zero measurable overhead. The entire framework—CQRS bus, provider lifecycle, health checks, domain model—costs -3.2% to +2.2% vs raw
asyncio.gather. That's noise, not overhead. - Cold start deduplication is a design win, not a parallelism win. Twenty simultaneous calls trigger one startup, not twenty. 18.5× speedup from a single-flight pattern.
Core thesis: The interesting performance story of an MCP framework isn't "parallel beats sequential." It's whether the abstraction costs you anything, whether the protocol you're using over stdio is actually as limited as you think, and whether your provider lifecycle handles the ugly edge cases. The benchmarks answer all three.
This post covers the non-obvious findings. For raw data, methodology, and reproducible code: mcp-hangar/benchmarks.
1. MCP stdio Is Not Serial
1.1 The Assumption Everyone Makes
MCP's default transport is stdio. stdin/stdout. Pipes. The mental model for most people—and most client implementations—is: send request, wait for response, send next request. A serial queue.
That mental model is wrong.
MCP uses JSON-RPC 2.0 with request IDs. The protocol is explicitly designed for multiplexing. You can fire 20 requests into stdin without waiting for a single response. The server processes them concurrently (assuming asyncio or equivalent), and responses come back tagged with their request IDs in whatever order they complete.
We proved this by bypassing Hangar entirely and testing raw MCP sessions:
| Mode | 10 calls via single stdio session | Result |
|---|---|---|
| Sequential (wait between calls) | 1,047ms | 10 × ~105ms |
| Parallel (fire all, collect all) | 107ms | All concurrent |
Same pipe. Same process. Same server. The difference is purely whether you wait between sends.
1.2 Why This Matters
The entire MCP ecosystem—Claude Desktop, Cursor, every client—communicates with providers over stdio. If your MCP client serializes tool calls because "it's stdio, so it must be serial," you're leaving a 10× speedup on the table for every fan-out operation.
The pipe isn't the bottleneck. Your request scheduling is.
This is not a theoretical concern. During the benchmarking process, we discovered that Hangar's own async facade was serializing calls through a ThreadPoolExecutor(max_workers=4)—not because of any protocol limitation, but because of an assumption baked into a single line of code. More on that in Section 4.
2. Zero-Overhead Abstraction (Actually)
2.1 The Standard Tradeoff
Every framework promises value. Every framework costs something. Connection pooling, request routing, lifecycle management, configuration parsing, error handling—each layer adds latency. The engineering question with any framework is never "is it useful?" It's: how much does it cost me?
If you're deciding between "roll my own MCP client management" and "use a framework," overhead is the deciding factor. Features don't matter if the framework adds 50ms to every call.
2.2 The Measurement
We measured Hangar's overhead by running identical workloads through Hangar and through direct MCP client calls. Same provider. Same tools. Same arguments. Same machine. The only variable: whether the call goes through Hangar's facade, provider model, and command bus—or directly to the MCP session.
| Scenario | Direct (ms) | Hangar (ms) | Overhead |
|---|---|---|---|
| S2 N=20, parallel | 111.6 | 108.0 | -3.2% |
| S2 N=5, parallel | 107.8 | 107.3 | -0.5% |
| S3 3 providers, parallel | 105.2 | 105.2 | 0.0% |
| S5 mixed latency, parallel | 505.1 | 505.0 | -0.0% |
| S1 50ms, sequential | 2,751.7 | 2,732.2 | -0.7% |
| S1 200ms, sequential | 10,255.9 | 10,223.6 | -0.3% |
| S2 N=10, parallel | 107.7 | 110.1 | +2.2% |
Range across all measurements: -3.2% to +2.2%. Mean: approximately -0.5%.
The negative numbers (Hangar faster than direct) aren't magic—they're connection reuse and warm provider caching. The positive outlier (+2.2%) is thread pool scheduling variance at the 2ms scale.
2.3 What This Means
Hangar's abstraction layers—CQRS command bus, provider lifecycle, health checks, rate limiting, the entire domain model—cost nothing measurable at the I/O boundary. N=100 runs, 95% confidence intervals of ±0.5ms. This is not "approximately zero." It is statistically indistinguishable from zero.
You get provider lifecycle management, hot/cold state transitions, health checks, parallel execution, and configuration-as-code. The overhead is your rounding error.
3. Cold Start Deduplication: A Design Win
3.1 The Problem
Twenty tool calls hit a cold provider simultaneously. The provider takes 500ms to start. What happens?
This is a real scenario. MCP providers are processes. Processes need to start. In agent workflows, the first burst of tool calls often arrives before any provider is warm.
3.2 The Naive Approach
Without Hangar (sequential): Each call checks if the provider is running, finds it cold, triggers startup, waits 500ms, then calls the tool. In practice, only the first call actually cold-starts—the rest find a warm provider. But sequential execution means 20 × 54ms per call = 1,079ms total.
3.3 The Single-Flight Pattern
With Hangar (parallel): All 20 calls hit invoke() concurrently. The first call triggers provider startup. The other 19 don't spawn additional instances—they enqueue behind a single-flight gate and wait for the same startup to complete. Then all 20 execute their tool calls in parallel.
Total: 58ms. That's 18.5× faster.
| Concurrent Cold Calls (N) | Sequential (ms) | Hangar Single-Flight (ms) | Speedup |
|---|---|---|---|
| 1 | 54 | 54 | 1.0× |
| 5 | 271 | 56 | 4.9× |
| 10 | 541 | 56 | 9.7× |
| 20 | 1,079 | 58 | 18.5× |
3.4 Why This Isn't About Parallelism
The 18.5× has nothing to do with parallel tool execution. It's about the startup not happening twenty times.
The single-flight pattern—borrowed from Go's concurrency primitives—coalesces concurrent requests for the same operation into a single execution. One in-flight operation, N waiters, all unblocked on completion.
Most MCP implementations don't handle this. You either get a race condition (20 provider processes spawning), a global lock (serializing everything), or a "just retry" approach that papers over the problem with latency. The single-flight pattern is the correct solution, and the benchmark proves the margin isn't marginal—it's an order of magnitude.
4. The Bonus: A Hardcoded "4"
During benchmarking, we noticed that Hangar's parallel execution showed a suspicious pattern: wall-clock time scaled as ceil(N/4) waves of ~100ms each. Twenty parallel calls took 520ms (5 waves), not 100ms (one wave).
We traced through the full stack—batch executor, provider model lock patterns, StdioClient message correlation, the MCP protocol itself—eliminating each layer as the bottleneck before finding the root cause:
# facade.py, line 333
self._executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="hangar-")
One line. A hardcoded 4 in the async facade's thread pool. Every hangar.invoke() call routes through loop.run_in_executor(self._executor, ...). With 4 workers, only 4 calls execute concurrently. The rest queue, producing sequential waves.
The protocol was parallel. The StdioClient was parallel. The provider model was parallel. The facade was silently serializing everything into batches of 4.
Before fix (max_workers=4):
| N | Expected (ms) | Actual (ms) | Pattern |
|---|---|---|---|
| 5 | ~100 | 208 | ceil(5/4) = 2 waves |
| 10 | ~100 | 313 | ceil(10/4) = 3 waves |
| 20 | ~100 | 521 | ceil(20/4) = 5 waves |
After fix (max_workers=20):
| N | Actual (ms) | Overhead vs direct |
|---|---|---|
| 5 | 107 | -0.5% |
| 10 | 110 | +2.2% |
| 20 | 108 | -3.2% |
The fix was changing one number. The debugging took ten layers of abstraction to trace.
The benchmark suite caught this before any user reported it. That's the actual value of publishing reproducible performance data: it forces you to confront what your code does, not what you think it does.
Methodology
All benchmarks use a controlled-delay MCP provider that sleeps for a configurable duration per call. This isolates framework behavior from tool implementation variance—no network jitter, no database contention, no external dependencies.
| Parameter | Value |
|---|---|
| Runs per configuration | 100 |
| Warmup iterations | 5 |
| Timing | time.perf_counter_ns() |
| Statistics | Mean, median, P95, StdDev, 95% CI (t-distribution) |
| Outlier detection | 3σ |
| Total measurements | 5,300+ |
| Total errors | 0 |
Scenarios:
| ID | What | Key Variable |
|---|---|---|
| S1 | Per-call overhead | Delay: 0–200ms, 50 calls |
| S2 | Parallel fan-out | N: 1–20 concurrent calls |
| S3 | Cross-provider parallelism | 3 providers × 100ms |
| S4 | Cold start deduplication | N: 1–20 concurrent cold calls |
| S5 | Mixed latency / head-of-line | 5×10ms + 3×100ms + 1×500ms |
| S6 | Agent workflow (3-step pipeline) | Dependent steps with per-step parallelism |
Full results, raw JSON, and chart generation: mcp-hangar/benchmarks.
Key Takeaways (TL;DR)
- MCP stdio supports full request multiplexing. If your client serializes calls, the bottleneck is your scheduling, not the protocol.
- Hangar's framework overhead is statistically zero (-3.2% to +2.2%). The abstraction is free.
- Cold start single-flight deduplication delivers 18.5× speedup—a design pattern win, not a parallelism win.
- Benchmark your own code. We found a hardcoded
ThreadPoolExecutor(max_workers=4)silently capping concurrency. The benchmark caught it before users did. - N=100 runs with 95% CI. The data defends itself.
"A benchmark that only proves parallel beats sequential isn't a benchmark. It's a tautology with error bars."