The Invoice ■ Episode 19
"mTLS, observability, traffic management, zero-code retries. You need a service mesh."
Splendid. Let us examine what one is actually paying for.
A service mesh moves cross-cutting concerns (mTLS, retries, timeouts, traffic shifting, observability) out of application code and into a proxy that sits beside each pod. Istio, the archetype, launched in 2017 as a joint project of Google, IBM, and Lyft. It graduated within the CNCF in July 2023. In the 2024 CNCF Annual Survey, service-mesh adoption across respondents fell to 42 per cent, down from 50 per cent the year before. That is not a catastrophe. It is, however, the first full-year decline the category has ever posted. The industry is quietly reconsidering the deal.
The Complexity Invoice
Istio ships over a dozen primary custom resource definitions across three categories (traffic management, security, telemetry) and dozens more through its operator, telemetry plugins, Wasm extensions, and Gateway APIs. A minimally useful installation comprises:
- A control plane (
istiod) responsible for configuration distribution, certificate issuance, and xDS API serving to every sidecar - A per-pod sidecar (Envoy) injected into every workload, running a second container alongside the application
- An ingress gateway at the cluster edge, usually another Envoy in a standalone pod
- mTLS certificates rotated by
istiod, distributed via SDS to each sidecar - Policy resources: PeerAuthentication, RequestAuthentication, AuthorizationPolicy
- Telemetry bindings to send traces and metrics to external collectors
- A platform team that knows what each of those does, how they interact, and how to debug any given failure mode
The CNCF's own reports describe Istio as mature, powerful, and "operationally demanding". The second adjective is the one to watch. Installing Istio in a fresh cluster takes a senior SRE about two days. Operating it for six months takes roughly 0.5 to 1.0 FTE, scaling upwards with cluster size. Debugging it at three in the morning is a skill one acquires by losing two nights of sleep and one customer.
The Latency Invoice
Every inter-service HTTP or gRPC call now traverses two Envoy proxies: the caller's sidecar, then the callee's sidecar. Adding two proxies to every request path means adding latency. How much is now, happily for the debate, well-measured.
A 2025 peer-reviewed performance comparison from the DeepNess Lab (Performance Comparison of Service Mesh Frameworks: the mTLS Test Case) measured the overhead with mTLS enforced on otherwise identical workloads. The table below is, one regrets to say, unambiguous.
The headline number (plus 166 per cent for Istio sidecar with
mTLS) is surprising only to people who have never read the
benchmark. Envoy is fast; two Envoys in the path plus TLS
handshakes and certificate validation are not free.
Linkerd's Rust-based linkerd2-proxy is
measurably lighter because it was built for the job, not
adapted to it. Ambient mode, introduced in
Istio 1.23 in August 2024,
replaces per-pod sidecars with a shared node-level
ztunnel and produces dramatically less overhead.
Ambient is, in polite summary, Istio's own public admission
that the sidecar model had a problem it could not solve by
optimisation alone.
A sidecar also costs memory. The
Istio 1.24 performance documentation
reports approximately 60 MB of RAM and 0.20 vCPU per Envoy
sidecar at 1,000 HTTP RPS with 1 KB payloads. A cluster
with 1,000 pods is therefore paying roughly 60 GB of RAM
and 200 vCPU for the mesh before a single byte of
application code has executed. Ambient ztunnels
are smaller (approximately 12 MB RAM, 0.06 vCPU each) but
one now also pays for waypoint proxies where L7 features are
enabled. Either way, the total is non-zero. "Free" is a
marketing word.
The Debugging Invoice
When the mesh works, it is invisible. When it does not, the request path has doubled and so has the attack surface for bugs. A 500 that arrives at the client might originate in:
- The application code itself
- The caller's Envoy (wrong upstream cluster, circuit breaker tripped)
- The destination's Envoy (connection limits, bad cert rotation)
- A mis-parsed VirtualService or DestinationRule
- The mTLS trust chain (expired intermediate, wrong trust domain)
istiodfailing to push updated configuration within the retry window- A Wasm plugin throwing an exception
- A Kubernetes NetworkPolicy quietly dropping the packet
The distributed tracing one installed to understand the mesh is now required to understand the mesh. Troubleshooting skills become mesh-specific skills, which means they do not transfer and do not scale with engineer headcount in the obvious way.
The Honest Case For
In the interests of not selling a one-sided story: service meshes solve a real problem for a real set of operators. If one:
- Runs more than roughly 100 microservices with cross-team ownership
- Has strict compliance that mandates mTLS between every internal service
- Operates across multiple clusters or multiple clouds with incompatible primitives
- Needs uniform observability across polyglot services that cannot all ship an OpenTelemetry library
then the tax starts to pay for itself. Everyone else, which is most readers, is paying Google's architecture to solve problems a single load balancer and a sensible VPC already solved.
The Alternative
Direct HTTP or gRPC calls between services, over a network one already trusts. This is how the internet worked for three decades before sidecars existed. It was, one should note, a perfectly functional three decades.
mTLS terminated at a single ingress gateway (HAProxy, nginx, Envoy on its own, or whatever load balancer is already in the stack), because the VPC was a trust boundary before sidecars were a marketing category. Internal traffic over plaintext inside the VPC is fine for the vast majority of workloads, and mTLS between services is a compliance requirement for a minority of them, not an architectural necessity for all of them.
Tracing and metrics via an OpenTelemetry library linked into each service. OTel is language-agnostic, vendor-neutral, and roughly five lines of initialisation in most runtimes. It sends traces and metrics via OTLP to any collector. No proxy required.
Retries and timeouts in the client library. Go's
http.Client, Rust's reqwest,
Java's RestTemplate or OkHttp, Python's
httpx, Node's undici: all of them
ship configurable retries, timeouts, connection pools, and
circuit breakers. The retry logic that a service mesh
claims to provide "without code changes" is three lines of
configuration in any mature client, and has been so since
approximately 1995.
Authorisation at the application layer, because only the application knows what "this user may read this document" means. Delegating authorisation to a proxy is delegating it to a component that does not, on any reasonable reading, understand the data.
The Pattern
Service mesh is sold as "zero code changes". One gets that by paying:
- Two proxies of latency on every internal call, measurably more under mTLS
- A platform team of overhead to run istiod, gateways, policies, and upgrades
- A debugger's worth of new moving parts: VirtualService, DestinationRule, PeerAuthentication, Envoy configuration, trust chains, Wasm plugins
All to avoid writing retry logic that any mature HTTP client already provides in three lines of configuration.
The mesh was always, architecturally, a political solution to a technical problem. It existed because microservice teams did not trust each other's code, and a proxy in the middle was a way of enforcing cross-cutting concerns without convincing any one team to adopt them. The proxy became the architecture. The architecture became the operational cost centre. The cost centre produced Ambient Mode, which is the industry's second try at making sidecars not cost what sidecars cost.
Meanwhile, the original alternative (a library in each service, a trusted network below, and a single ingress gateway at the edge) has remained exactly what it has been since approximately 1995.
The side call was always there. One simply decided it wasn't enterprise enough.
Istio graduated CNCF July 2023. CNCF 2024 Survey: mesh adoption 42%, down from 50%. 2025 peer-reviewed benchmark: Istio sidecar +166% mTLS latency, Cilium +99%, Linkerd +33%, Istio Ambient +8%. 60 MB RAM per sidecar = 60 GB across 1,000 pods before code runs. Ambient is Istio's own admission that sidecars were a problem. The client retry library has shipped retries since the 1980s.