How to Rob a Sentiment Classifier in the Browser

If you change enough words in a glowing product review, at what exact point does the computer realize you’re insulting it?

I’ve been asking myself this because I recently added a /lab section to my personal site. I wanted a place to host interactive experiments that fit the blueprint-industrial aesthetic I’m building for the site. But I also didn’t want a web app making round-trip API calls to Anthropic, draining my credits every time someone clicks a button.

The solution was WebAssembly. If you haven’t looked at client-side machine learning lately, it’s gotten much easier. You don’t need a Python backend or a GPU anymore. The models run locally right in the browser’s memory.

One of the first toys I spun up is a quantized DistilBERT-SST2 sentiment analyzer. You type text, it tells you if you’re being positive or negative, and gives you a confidence score. It’s small enough to download quietly in the background when you load the page. Once cached, inference happens entirely on your machine. Zero server calls. Zero latency.

Within about ten minutes of getting it working, I stopped trying to use it normally and started trying to break it.

I’d write a genuinely awful review of a movie or a product. The model would score it at 99% negative. Then, my goal was to flip the polarity to 90% positive by substituting or adding as few words as possible, without changing the human-readable insult.

I’m calling the game Sentiment Heist.

Smaller sentiment models have massive blind spots. They anchor so heavily to specific adjectives that you can completely override a sentence just by dropping in a payload word.

Here’s a concrete example. If you write: “The plot was a complete disaster and the acting made me want to leave.” DistilBERT flags that as overwhelmingly negative, exactly as it should.

But if you change it to: “The plot was a complete disaster and the acting made me want to leave, which is an absolute masterpiece of cinematic irony.”

Suddenly the model gets confused. The word “masterpiece” carries so much positive weight in its training data that the math breaks. A human reader immediately recognizes the sarcasm, but the statistical model just sees a high-value positive token and adjusts the final score.

If you want to see the mechanics of how I handle classification in the browser, it looks something like this using transformers.js:

import { pipeline } from '@xenova/transformers';

// Load the sentiment model directly in the browser via WebAssembly
const classifier = await pipeline(
  'sentiment-analysis',
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
);

// Run the heist
const result = await classifier("The movie was a disaster, a true masterpiece of garbage.");
console.log(result);
// The weights clash, and the output is mathematically unpredictable

Once I got bored of breaking the sentiment analyzer, I dropped a Zero-Shot Classifier into the lab to see if I could automatically tag my woodworking shop notes. I fed it a paragraph about adjusting the fence on my table saw to prevent kickback.

The model categorized my shop notes under “finance” and “violence”.

Technically, it wasn’t wrong. “Kickback” is a financial crime, and getting hit in the ribs by a piece of maple going 120 miles per hour is absolutely violent. But it wasn’t exactly the tagging system I was hoping for.

It’s easy to forget how literal these systems are when we spend all day talking to massive models like Sonnet or GPT-4. Those models have billions of parameters designed to catch our sarcasm and figure out context.

Strip that away and run a tiny model locally, and you see the seams. It’s not actually understanding you; it’s doing math on your vocabulary. And if you know the math, you can make it say whatever you want.

I’m seriously considering spinning Sentiment Heist out into a standalone browser game. I just need to close the LinkedIn tab that’s eating a gigabyte of my RAM first, and maybe stop getting distracted by old Gregory Brothers videos on YouTube long enough to write the code.