How AI Detectors Actually Work (And Why They Keep Misfiring)

If you've ever had a piece of writing flagged as AI when you wrote it yourself, you're not alone, and you're not crazy. The technology under the hood is more fragile than the marketing pages suggest. Here's what's actually happening.

AI detection became a multi-million dollar category roughly five minutes after ChatGPT launched. Schools needed it. Publishers needed it. Hiring managers wanted it. Within a year you had GPTZero, Originality.ai, Turnitin, Copyleaks, Winston, and a long tail of smaller tools all claiming high accuracy.

The thing nobody at the time really emphasized is that all of these tools, even the ones built by people who genuinely know what they're doing, are working with a fundamentally fuzzy signal. There is no chemical test for AI writing. There is no DNA. There is only a statistical fingerprint, and that fingerprint blurs every time a new model ships.

Let's get into how the detection actually works, because once you see it, the failure modes make a lot more sense.

The two main approaches

Almost every AI detector on the market today uses some combination of two techniques. The names get fancier in marketing but the core ideas are simple.

Approach one: statistical signals like perplexity and burstiness

This was the original GPTZero approach and it's still the bedrock of most detectors. The idea is that language models are trained to produce high-probability next words, which means model-generated text tends to be statistically "smoother" than human text.

Two specific measures get used.

Perplexity is roughly a measure of how surprising each next word is, given the model's expectations. AI text has low perplexity because, by definition, the model picked the words a model would pick. Human text has higher perplexity because humans make weirder choices.

Burstiness is a measure of how much the perplexity varies across the document. Humans write in spurts. A few long sentences, then a short one, then a fragment, then a normal one. Models tend to produce more uniform output. Sentence after sentence at roughly the same complexity.

A detector using this approach computes perplexity and burstiness for the document, compares it to a learned baseline, and outputs a probability that the text is AI-generated.

This is elegant. It's also brittle. Because a careful human writer who happens to write smoothly, like an experienced technical writer or an academic, will produce low-perplexity, low-burstiness text. And the detector will flag them. This is the core source of the false positive problem.

Approach two: trained classifiers

The second approach is to train a neural network classifier on a giant corpus of human-written and AI-written text. The classifier learns to distinguish them based on whatever subtle features it can find, without anyone needing to specify perplexity or burstiness or any other hand-engineered feature.

Originality.ai, Copyleaks, and most of the modern commercial detectors lean heavily on this approach. They train on output from many specific models (GPT-3.5, GPT-4, Claude, Gemini, Llama, etc.) and they update the classifier as new models ship.

This is more powerful than the statistical approach in steady state. The classifier picks up on patterns no human would think to encode. The catch is that classifiers degrade fast when the distribution shifts, which happens every time OpenAI or Anthropic releases a new model with slightly different stylistic priors. Detectors that were 95 percent accurate against GPT-3.5 routinely drop to 70 percent or less against the next generation, until they retrain.

The watermarking detour

For a while there was a lot of optimism about a third approach. Cryptographic watermarking. The idea is that when a model generates text, it would subtly bias its word choices according to a secret key, in a way that's statistically detectable but invisible to readers. OpenAI publicly experimented with this. Google has shipped a version called SynthID.

Watermarking is real, and where it's been deployed, it works reasonably well on long enough samples. But it has three structural problems.

One, only the model provider can detect their own watermark. Anyone else has to ask politely. Two, watermarking gets destroyed by paraphrasing, translating, or running the text through another model. Three, no model provider has publicly turned watermarking on at the public API level, because it would hurt user experience and create competitive issues with non-watermarking competitors.

Watermarking is going to be part of the long-term answer, but in 2026 it's not a meaningful factor in any commercial detector you can buy.

Why false positives keep happening

Here is the core mathematical problem with AI detection. Even a detector with a stated 99 percent accuracy will produce a meaningful number of false positives in any real-world deployment.

Imagine you run a 99 percent accurate detector on a thousand human-written essays. By the stated accuracy, about ten of those essays will be falsely flagged as AI. If you're a student in that group, your accuracy doesn't matter. You are wrongly accused of cheating, and the school's policy probably says the burden is on you to prove otherwise.

And 99 percent accuracy is the marketing claim. In real-world tests, the numbers are lower. Studies have repeatedly found commercial detectors with 5 to 15 percent false positive rates on human writing, especially when the human writer is non-native English, or writes in a more formal register, or is a younger student whose prose is naturally less varied.

None of this is the detector vendors being dishonest. It's a hard problem. The signal they are trying to extract genuinely is fuzzy, and the underlying distribution shifts every quarter.

The specific patterns detectors look for

If you want to understand why your writing might get flagged, here's the rough list of features that any modern detector is implicitly looking at:

Sentence length variance (lower in AI text)
Word rarity distribution (AI prefers safe, common words in expected positions)
Punctuation patterns, especially em dash density and bullet usage
Transition word frequency (AI overuses "however," "moreover," "in addition," etc.)
Symmetric paragraph length
Triplet rhythm in lists ("X, Y, and Z" structures)
Hedging frequency ("it's worth noting," "it's important to remember")
Specific lifted vocabulary (the famous "delve," "tapestry," "ever-evolving" cluster)

If your writing exhibits a lot of these features, even if you wrote every word yourself, you're going to score high on most detectors. We wrote about which specific tells matter most in Em Dashes, "Delve", and the Other Dead Giveaways of AI Writing.

What this means in practice

If you're an educator or a publisher relying on AI detection as proof of anything, please don't. The detectors are useful as a soft signal. They are not useful as a verdict. A high AI score should trigger a conversation, not a punishment.

If you're a writer being flagged unfairly, the unfortunate practical answer is that you may need to humanize your style. Even if your writing is 100 percent yours. The detectors are pattern matchers and your patterns happen to overlap with what the model produces. This is unfair. It's also the world we live in.

The same humanizing techniques that fool detectors also produce better writing for human readers. Specific examples, varied rhythm, fewer hedges, less lifted diction. We laid out the full playbook in How To Humanize AI Writing in 2026.

The arms race nobody is winning

Every few months a detector vendor releases a new model that "solves" the latest GPT version. A few weeks later, the latest GPT version ships and the accuracy drops again. People build humanizers. Detectors train against humanizers. Humanizers train against the new detectors.

This will continue until either watermarking gets adopted universally, which would require regulation, or until enough false positives accumulate that the detectors lose institutional credibility. Both of those are possible. Neither is imminent.

In the meantime, if you're shipping AI-assisted writing at any scale, the practical move is to write well. The detectors that are easiest to fool are the same ones that flag the most innocent humans. The detectors that work better are the ones that respond to actual writing quality. Either way, the work is the same. Write specifically, write with a voice, vary your rhythm, and don't lean on the model's defaults.

If you want a tool that benchmarks your text against the same patterns the major detectors use, that's what we built Cloak for. It scores you on the actual signal the detectors are reading, then rewrites with natural cadence. Not a magic bullet, but it'll get you the first 70 percent of the way to safe.