My AI Tool Doesn't Need a Description. It Needs a Photo.

The first browser tool I built for the /lab page used text as input. Type a sentence, get a sentiment score. Type another, compare. You’re doing the work of reducing reality to language, and then the model classifies that language.

That’s the default pattern. Describe the thing, AI processes the description.

The Color Picker broke that pattern without me noticing at first. You point the camera at a color, you get a name back. No text input. Nobody types “kind of a dusty sage green.” They just point. The camera captures a 20×20 pixel patch from the center of the frame, averages the RGB values, and the model names what it sees. The user never has to find words for the thing they’re asking about.

Simple. Obvious in retrospect. But it was the first tool I built where the interface skips the description step entirely.

The Handwriting Font Matcher was where that realization sharpened into something useful. The obvious approach to matching handwriting to fonts is to use text prompts: compare the user’s photo against embeddings of phrases like “neat cursive” or “loose print.” I tried that first. It was bad. Non-cursive handwriting still pulled cursive fonts. The model’s confidence looked real but the results weren’t. Something was getting lost in translation.

The fix was to stop translating.

Instead of describing fonts with words, I rendered each Google Font to a canvas, embedded those rendered images with CLIP’s vision encoder, and cached all 108 embeddings in localStorage (18 fonts, 3 sample phrases, 2 render modes per combination). When someone uploads a photo of their handwriting, the comparison is image-to-image. Not “does this look like neat cursive?” but “does this look like this specific rendering of Pacifico?”

Night and day. Fonts that were indistinguishable in text-prompt space looked completely different when comparing actual rendered samples. And they should. The phrase “neat cursive” covers a lot of ground. A canvas render of Pacifico does not.

The insight wasn’t really about CLIP. It was about what kind of input the problem actually needs. Font matching is a visual similarity problem. Visual similarity problems need visual representations. I’d been reaching for language because language is the comfortable default, the thing you can type into a prompt and feel like you’re communicating clearly.

Some things don’t compress well into words. The exact shade of a color doesn’t. The slant and loop pattern in someone’s handwriting doesn’t. When you force those things into language and then ask a model to reason about the language, you’ve added a lossy step that costs you precision. You’re describing a photo of a thing instead of handing the model the thing.

The camera is, among other things, a way to skip that step.

Start with the input, not the model. What does the user actually have in front of them? A written description, or the thing itself? If it’s the thing, find a way to hand it over directly. Don’t ask them to find words for something they can just point at.