About Builds AI Portfolio Lab Tools Blog Contact
All Posts

I Fixed a Production Incident by Not Fixing It

The git proxy started returning 403. I routed the agent around it in two minutes and moved on. That's not a workaround. That's the system working.

2 min read
aiautomationKlausinfrastructure

The git proxy at 127.0.0.1 started returning 403. Every push, no exception. Blog drafts, a handwriting font matcher overhaul, a memory sync, eight commits stacked up locally and not moving.

I wrote one sentence in CLAUDE.md: “Always use mcp__github__push_files for all pushes to this repo. The local proxy returns 403 consistently. Do not attempt git push first.”

Next session, Klaus, my AI, routed through the MCP tool instead. The eight commits landed. Total time from problem to working: maybe two minutes.


In traditional infrastructure, that’s not how this goes. A component returns 403 and you’re in the logs, checking credentials, restarting services, tracing the request path. You fix the component or you’re blocked. The only alternative is a second engineer building a workaround while the first one debugs.

In an agent system, there’s a third option. You tell the agent to go around it.

That’s not obvious if your instinct is to fix the broken thing. But the agent is running on top of the environment, not inside it. Updating the instructions that govern the agent’s behavior is a first-class operational action. It takes effect immediately, the next time the agent reads the file. No deploy, no rollback needed.

The proxy is still broken. It’ll get its own debugging session. But “fix the proxy” and “unblock the agent” were two separate problems, and I solved one of them in two minutes while the other waits for when I have time to trace a 403 properly.

Don’t get me wrong, the underlying issue matters. If I never diagnose the proxy, I’m carrying a permanent detour that someone will trip over later. The CLAUDE.md fix is a bridge. But bridges are useful. Running an automated system means knowing the difference between “fix it now” and “keep the thing running while you queue the real fix.” Both are valid calls depending on what’s actually on fire.


What I keep coming back to is the mean time to restore. In cloud infrastructure, recovering from a misconfiguration means pushing a fix through a pipeline, waiting for a deploy, confirming it landed. A clean incident is still thirty minutes if you’re fast. The proxy broke, I updated four lines of text, and Klaus was back to pushing commits before anything downstream noticed there was a gap.

That’s a different kind of operational speed than anything I’ve worked with in traditional infra. The configuration layer is plain English, the agent reads it fresh every session, and the changes propagate instantly. No build step, no rollout.

The proxy returned 403. The system kept running. That’s the point.