Benchmarking MCP Tool Calls: Three Findings That Aren't 'Parallel Is Faster'

Executive Summary

"Parallel execution is faster than sequential." Congratulations—you've rediscovered physics. That's not a benchmark finding. That's a definition.

We built a benchmark suite for MCP Hangar: 6 scenarios, 4 baselines, 100 runs per configuration, 5,300+ individual measurements, zero errors. The headline number is a 19.6× speedup for parallel fan-out. The headline number is also boring.

What's not boring:

MCP stdio is not serial. The protocol fully supports concurrent multiplexing over a single pipe. Most clients don't use it.
Hangar adds zero measurable overhead. The entire framework—CQRS bus, provider lifecycle, health checks, domain model—costs -3.2% to +2.2% vs raw asyncio.gather. That's noise, not overhead.
Cold start deduplication is a design win, not a parallelism win. Twenty simultaneous calls trigger one startup, not twenty. 18.5× speedup from a single-flight pattern.

Core thesis: The interesting performance story of an MCP framework isn't "parallel beats sequential." It's whether the abstraction costs you anything, whether the protocol you're using over stdio is actually as limited as you think, and whether your provider lifecycle handles the ugly edge cases. The benchmarks answer all three.

This post covers the non-obvious findings. For raw data, methodology, and reproducible code: mcp-hangar/benchmarks.

1. MCP stdio Is Not Serial

1.1 The Assumption Everyone Makes

MCP's default transport is stdio. stdin/stdout. Pipes. The mental model for most people—and most client implementations—is: send request, wait for response, send next request. A serial queue.

That mental model is wrong.

MCP uses JSON-RPC 2.0 with request IDs. The protocol is explicitly designed for multiplexing. You can fire 20 requests into stdin without waiting for a single response. The server processes them concurrently (assuming asyncio or equivalent), and responses come back tagged with their request IDs in whatever order they complete.

We proved this by bypassing Hangar entirely and testing raw MCP sessions:

Mode	10 calls via single stdio session	Result
Sequential (wait between calls)	1,047ms	10 × ~105ms
Parallel (fire all, collect all)	107ms	All concurrent

Same pipe. Same process. Same server. The difference is purely whether you wait between sends.

1.2 Why This Matters

The entire MCP ecosystem—Claude Desktop, Cursor, every client—communicates with providers over stdio. If your MCP client serializes tool calls because "it's stdio, so it must be serial," you're leaving a 10× speedup on the table for every fan-out operation.

The pipe isn't the bottleneck. Your request scheduling is.

This is not a theoretical concern. During the benchmarking process, we discovered that Hangar's own async facade was serializing calls through a ThreadPoolExecutor(max_workers=4)—not because of any protocol limitation, but because of an assumption baked into a single line of code. More on that in Section 4.

2. Zero-Overhead Abstraction (Actually)

2.1 The Standard Tradeoff

Every framework promises value. Every framework costs something. Connection pooling, request routing, lifecycle management, configuration parsing, error handling—each layer adds latency. The engineering question with any framework is never "is it useful?" It's: how much does it cost me?

If you're deciding between "roll my own MCP client management" and "use a framework," overhead is the deciding factor. Features don't matter if the framework adds 50ms to every call.

2.2 The Measurement

We measured Hangar's overhead by running identical workloads through Hangar and through direct MCP client calls. Same provider. Same tools. Same arguments. Same machine. The only variable: whether the call goes through Hangar's facade, provider model, and command bus—or directly to the MCP session.

Scenario	Direct (ms)	Hangar (ms)	Overhead
S2 N=20, parallel	111.6	108.0	-3.2%
S2 N=5, parallel	107.8	107.3	-0.5%
S3 3 providers, parallel	105.2	105.2	0.0%
S5 mixed latency, parallel	505.1	505.0	-0.0%
S1 50ms, sequential	2,751.7	2,732.2	-0.7%
S1 200ms, sequential	10,255.9	10,223.6	-0.3%
S2 N=10, parallel	107.7	110.1	+2.2%

Range across all measurements: -3.2% to +2.2%. Mean: approximately -0.5%.

The negative numbers (Hangar faster than direct) aren't magic—they're connection reuse and warm provider caching. The positive outlier (+2.2%) is thread pool scheduling variance at the 2ms scale.

2.3 What This Means

Hangar's abstraction layers—CQRS command bus, provider lifecycle, health checks, rate limiting, the entire domain model—cost nothing measurable at the I/O boundary. N=100 runs, 95% confidence intervals of ±0.5ms. This is not "approximately zero." It is statistically indistinguishable from zero.

You get provider lifecycle management, hot/cold state transitions, health checks, parallel execution, and configuration-as-code. The overhead is your rounding error.

3. Cold Start Deduplication: A Design Win

3.1 The Problem

Twenty tool calls hit a cold provider simultaneously. The provider takes 500ms to start. What happens?

This is a real scenario. MCP providers are processes. Processes need to start. In agent workflows, the first burst of tool calls often arrives before any provider is warm.

3.2 The Naive Approach

Without Hangar (sequential): Each call checks if the provider is running, finds it cold, triggers startup, waits 500ms, then calls the tool. In practice, only the first call actually cold-starts—the rest find a warm provider. But sequential execution means 20 × 54ms per call = 1,079ms total.

3.3 The Single-Flight Pattern

With Hangar (parallel): All 20 calls hit invoke() concurrently. The first call triggers provider startup. The other 19 don't spawn additional instances—they enqueue behind a single-flight gate and wait for the same startup to complete. Then all 20 execute their tool calls in parallel.

Total: 58ms. That's 18.5× faster.

Concurrent Cold Calls (N)	Sequential (ms)	Hangar Single-Flight (ms)	Speedup
1	54	54	1.0×
5	271	56	4.9×
10	541	56	9.7×
20	1,079	58	18.5×

3.4 Why This Isn't About Parallelism

The 18.5× has nothing to do with parallel tool execution. It's about the startup not happening twenty times.

The single-flight pattern—borrowed from Go's concurrency primitives—coalesces concurrent requests for the same operation into a single execution. One in-flight operation, N waiters, all unblocked on completion.

Most MCP implementations don't handle this. You either get a race condition (20 provider processes spawning), a global lock (serializing everything), or a "just retry" approach that papers over the problem with latency. The single-flight pattern is the correct solution, and the benchmark proves the margin isn't marginal—it's an order of magnitude.

4. The Bonus: A Hardcoded "4"

During benchmarking, we noticed that Hangar's parallel execution showed a suspicious pattern: wall-clock time scaled as ceil(N/4) waves of ~100ms each. Twenty parallel calls took 520ms (5 waves), not 100ms (one wave).

We traced through the full stack—batch executor, provider model lock patterns, StdioClient message correlation, the MCP protocol itself—eliminating each layer as the bottleneck before finding the root cause:

# facade.py, line 333
self._executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="hangar-")

One line. A hardcoded 4 in the async facade's thread pool. Every hangar.invoke() call routes through loop.run_in_executor(self._executor, ...). With 4 workers, only 4 calls execute concurrently. The rest queue, producing sequential waves.

The protocol was parallel. The StdioClient was parallel. The provider model was parallel. The facade was silently serializing everything into batches of 4.

Before fix (max_workers=4):

N	Expected (ms)	Actual (ms)	Pattern
5	~100	208	ceil(5/4) = 2 waves
10	~100	313	ceil(10/4) = 3 waves
20	~100	521	ceil(20/4) = 5 waves

After fix (max_workers=20):

N	Actual (ms)	Overhead vs direct
5	107	-0.5%
10	110	+2.2%
20	108	-3.2%

The fix was changing one number. The debugging took ten layers of abstraction to trace.

The benchmark suite caught this before any user reported it. That's the actual value of publishing reproducible performance data: it forces you to confront what your code does, not what you think it does.

Methodology

All benchmarks use a controlled-delay MCP provider that sleeps for a configurable duration per call. This isolates framework behavior from tool implementation variance—no network jitter, no database contention, no external dependencies.

Parameter	Value
Runs per configuration	100
Warmup iterations	5
Timing	`time.perf_counter_ns()`
Statistics	Mean, median, P95, StdDev, 95% CI (t-distribution)
Outlier detection	3σ
Total measurements	5,300+
Total errors	0

Scenarios:

ID	What	Key Variable
S1	Per-call overhead	Delay: 0–200ms, 50 calls
S2	Parallel fan-out	N: 1–20 concurrent calls
S3	Cross-provider parallelism	3 providers × 100ms
S4	Cold start deduplication	N: 1–20 concurrent cold calls
S5	Mixed latency / head-of-line	5×10ms + 3×100ms + 1×500ms
S6	Agent workflow (3-step pipeline)	Dependent steps with per-step parallelism

Full results, raw JSON, and chart generation: mcp-hangar/benchmarks.

Key Takeaways (TL;DR)

MCP stdio supports full request multiplexing. If your client serializes calls, the bottleneck is your scheduling, not the protocol.
Hangar's framework overhead is statistically zero (-3.2% to +2.2%). The abstraction is free.
Cold start single-flight deduplication delivers 18.5× speedup—a design pattern win, not a parallelism win.
Benchmark your own code. We found a hardcoded ThreadPoolExecutor(max_workers=4) silently capping concurrency. The benchmark caught it before users did.
N=100 runs with 95% CI. The data defends itself.

"A benchmark that only proves parallel beats sequential isn't a benchmark. It's a tautology with error bars."