About Builds AI Portfolio Lab Tools Blog Contact
All Posts

Ollama Was Killing My Discord

I put a local inference model on the same EC2 instance as my agent coordinator. Discord slash commands started timing out. At 100% CPU, the agent had nothing left to work with.

2 min read
Klausaiinfrastructureself-hostedec2

For a while, the setup made sense. Klaus, my AI, runs on EC2. Ollama runs on the same instance. phi3:mini for ideation, classification, short-form summaries. Three billion parameters, zero per-token cost. I’d moved a handful of recurring jobs over from Sonnet because why pay the API rate when the work fits a smaller model and the machine is already there?

The machine was not just sitting there.

Klaus started going quiet in May. Discord slash commands would time out. DMs would come back eventually, 4 minutes later sometimes. The WebSocket kept logging [discord] Gateway websocket closed: 1000, reconnecting, then dropping again every 5 minutes. From the outside it looked like a network problem. I spent time looking at the network.

It wasn’t the network.

top -bn1 -o %CPU told the real story: Ollama at 84.6%, gateway at 92.3%, system idle at 0%. The EC2 instance is a t3.large, two virtual CPUs. Two processes competing for two CPUs, neither winning.

Discord’s WebSocket needs a heartbeat ACK to stay alive. Slash commands have a 3-second acknowledgment window. At 100% CPU, neither of those timelines was getting honored. Discord interpreted the missed heartbeats as a dead connection and dropped it. Slash commands never got their ACK. From Discord’s side, Klaus had vanished.

I stopped Ollama. Four seconds of typing. Load average: 2.00 to 0.31. Free RAM: 190 megabytes to 4,800 megabytes. The WebSocket held. Slash commands started working in under a minute.

Don’t get me wrong, phi3:mini was doing fine work. The ideation jobs came back clean, classification was fast. The model itself wasn’t the problem.

The problem was the host. The diagnostic took 76 minutes of Klaus being unreachable. The fix took 4 seconds.


Here’s what I’d gotten backward: “free” in local inference means zero dollars, not zero resource impact. A t3.large has two cores. Run a 3-billion parameter model on one and the coordinator gets whatever’s left. At inference time, that turns out to be nothing. The coordinator that needs stable CPU headroom for heartbeat ACKs, message processing, scheduled jobs, gets squeezed out by the model that was supposed to be helping.

That’s not a tradeoff. That’s a miscategorization of costs.

Ollama is off that instance now, or goes back on with CPU limits. Running both unconstrained on a 2-vCPU box does not work.


Tell yourself early: local models are good for work that tolerates delay. Don’t put one next to something that can’t tolerate delay at all. And “free” describes a billing relationship, not a resource budget. The cores still have to come from somewhere.