My Idea Generator Was Working. The Ideas Were Gibberish.

There’s a cron job on my EC2 instance that runs every few hours and generates improvement ideas for Klaus. Not suggestions I have to prompt for. Actual GitHub issues, automatically filed, scored, and promoted into the work queue while I’m doing other things.

It started with Sonnet. The ideas came back clean, occasionally genuinely useful. But this was a job running six times a day whether I was paying attention or not, and every run cost tokens. I wanted to bring more of the system in-house anyway, away from paid APIs and toward things I owned and controlled.

phi3:mini was already on the instance for other tasks. Three billion parameters, free to run, no API call. I swapped it into ec2-ideate.sh and watched the output for a few days. The ideas got a little rougher around the edges, but the signal was still there. If anything it felt like validation that the local model could hold its own on something generative.

For a while, it worked fine.

Then issue #809 appeared in the research queue with the title “During antonia Shopr as if-1/Cross-Sublimatinglyrics…”. Issues #811 and #831 were similar. All three had been automatically promoted to high-priority research. All three were complete nonsense.

phi3:mini hadn’t generated a bad idea. It had injected something that looked like a fragment from a corrupted training corpus, part sentence, part random token sequence, no coherent structure. And then my scoring system looked at those fragments and gave them fours.

Two layers of failure for the price of one.

phi3:mini is genuinely useful for classification and short-form summarization. Good enough that I stopped treating it as a toy. That was the mistake. Ask a small model to generate creative content with no validation gate and you find out where the capability cliff actually is, which turns out to be lower than you’d assumed.

The scoring system failing was the more interesting part. I’d built the rubric on the assumption that garbage input would score poorly. That assumption was wrong. The rubric checked for relevance signals and surface-level specificity. Whatever phi3:mini generated had just enough structure to trip those signals without meaning anything at all.

A second-pass check before scoring would have caught all three. A 200-character title limit. A basic coherence filter. A prompt that asks “is this a real sentence?” before any quality score runs. I’d skipped all of it because I was extending trust to a model that hadn’t earned it for this particular use case.

The version of me that set this up would have said: small models fail sometimes, just review the queue manually. That’s a fine answer if you review the queue every day.

I don’t review the queue every day. I built the system so I wouldn’t have to.

If you’re giving any model autonomous write access to something you care about, validate the output before it lands. Not after three research tickets about nothing are already sitting in your backlog with a four-star rating.