Your Python Service Doesn't Have a Memory Leak — Until It Does

How we spent a week proving ourselves wrong about memory, Kubernetes, and a third-party SDK

Jun 10, 2026

∙ Paid

You thought you had it figured out.

Memory was climbing on your Python MCP server under load. You profiled it, read the docs, nodded knowingly, and explained to the team: “That’s just Python. It never returns memory to the OS. It’s expected behavior.”

You were half right. And being half right is sometimes worse than being wrong.

The Setup

You run a UMS MCP server — a Python service built on Starlette that acts as a bridge between an AI orchestration layer and the Messaging Service. It handles HTTP requests, transforms them into Messaging API calls, and returns results. Simple proxy work. Shouldn’t be memory-hungry.

Under 1500 concurrent tasks, it was scaling from 4 pods to 8. Memory was the trigger, not CPU. So you started digging.

Phase 1: The Fixes That Weren’t

First, you pulled an obvious win: the server had a JWT cache configured but never actually used — dead code accumulating objects for nothing. Removed it. Merged it. Ran the load test.

No significant impact.

Then memray (Python’s memory profiler) pointed at SSL/TLS and HTTP/1.1 overhead dominating allocations — the cost of making HTTPS calls to APIs on every request. Legitimate overhead. Not something you eliminate; you just accept it.

Still scaling. Still memory-triggered.

Phase 2: The Explanation That Felt True

Here’s where you went wrong in a confident, well-reasoned way.

Python’s memory allocator (pymalloc) allocates memory in arenas. When objects are garbage collected, the memory returns to pymalloc’s internal pool — but pymalloc almost never gives those pages back to the OS. So the process RSS (what Kubernetes measures) stays at its peak long after load drops.

This is real. This is documented. This is expected Python behavior.

Python memory management in Kubernetes pod

So when memory stayed high after the load test finished — even hours later — you had your explanation. “Not a leak. Just Python holding onto pages.” The team relaxed. You moved on to tuning HPA: shift CPU to 60% threshold as the primary scaling signal, keep memory at 90% just to prevent OOM kills, bump memory requests from 512Mi to 800Mi.

Good diagnosis. Wrong conclusion.

Phase 3: The Evidence You Couldn’t Explain Away

After increasing memory requests, 4 pods could handle 1500 concurrent tasks without scaling. Progress. But then You noticed something that broke the “expected Python behavior” narrative entirely.

Continue reading this post for free, courtesy of Rakia Ben Sassi.

Or purchase a paid subscription.

The Engineering Wisdom