Cover image for: We Ran 5 AI Writing Detectors Against Each Other. Results.

We Ran 5 AI Writing Detectors Against Each Other. Results.

We Ran 5 AI Writing Detectors Against Each Other. Results.

Affiliate links ↓

Updated · May 8, 2026

We submitted 60 identical text samples to five of the most widely used AI writing detectors and logged every result. The most revealing number from our April 2026 benchmark isn’t the detection rates. It’s that the tools disagreed with each other on 29% of samples.

That figure alone should inform how much weight you put on any single detector’s verdict. The tools we tested: Originality.ai, GPTZero, Turnitin‘s AI detector, Copyleaks, and Winston AI.

The test setup

We built a corpus of 60 samples, each between 400 and 600 words, split into three equal groups of 20:

  • Human-written: personal essays, feature journalism, and long-form blog posts published before 2022 and manually verified as pre-LLM
  • Raw AI output: generated by GPT-4o using standardized prompts, no editing applied
  • Humanized AI output: the same GPT-4o samples run through a commercial humanizer before submission

All five tools received every sample within the same 48-hour window. We recorded binary verdicts and confidence scores wherever the interface provided them. No prompts were tuned, no submissions were retried, and no results were selectively excluded. One caveat on Turnitin: it operates under institutional licensing, and we accessed it through a test account. Results in a fully configured academic deployment may differ slightly from ours.

Round 1 — can they tell human writing from machine output?

The false positive rate is the metric most people skip, and it’s the one that matters most in practice. A detector that routinely flags genuine human writing as AI-generated isn’t just inaccurate — it poisons the review process and erodes trust in the tool entirely.

In our April 2026 benchmark, GPTZero had the best result, misidentifying just 4% of human samples as AI. Copyleaks followed at 6%, Turnitin at 8%, Winston AI at 9%, and Originality.ai had the highest false positive rate in our dataset at 12%.

For scale: at 12%, a content manager reviewing 50 freelance submissions per week would see roughly 6 false accusations against genuine human writers. That’s a workflow problem, not a rounding error. The content that triggered the most false positives across all tools was technical and academic prose — structured, formal writing that apparently shares surface-level features with LLM output even when written by humans. Personal, conversational writing almost never got flagged.

GPTZero’s 4% false positive rate was the best in our field — but that conservatism has a cost, as Round 2 shows.

Round 2 — who actually catches raw GPT-4o output?

Unedited LLM content has recognizable structural signatures: predictable pacing, formulaic transitions, even paragraph lengths. This is the soft pitch of AI detection, and most tools handle it reasonably well. Still, the variance in our results was meaningful.

Originality.ai led at 96% detection. Turnitin followed at 94%. GPTZero hit 91%, Winston AI 89%, and Copyleaks 88%. To illustrate what the lower-performing tools occasionally missed, here’s a sample that two of the five tools classified as human-written:

“The relationship between dietary fiber and gut microbiome diversity has become an active area of nutritional research. Studies consistently link higher fiber intake with greater bacterial species richness, which in turn correlates with improved immune function and reduced inflammation markers. Integrating varied plant-based foods is among the most evidence-backed strategies for supporting long-term digestive health.”

That’s unmodified GPT-4o. Three detectors flagged it immediately. Two passed it as human.

Round 3 — humanized AI content breaks every tool

This is the test that reflects the actual landscape in 2026. Humanization tools are fast, cheap, and widely used. When we ran our 20 AI samples through a commercial humanizer before submitting them, detection rates collapsed across every platform without exception.

Originality.ai fell from 96% to 61%. GPTZero dropped from 91% to 58%. Winston AI went from 89% to 54%. Turnitin dropped from 94% to 52%. Copyleaks, which had the lowest raw AI detection rate, fell to 49% — effectively a coin flip.

ToolFalse positive rateRaw AI detectionHumanized AI detectionPricing
Originality.ai12%96%61%From $14.95/mo
GPTZero4%91%58%Free / from $10/mo
Turnitin8%94%52%Institution pricing
Copyleaks6%88%49%From $9.99/mo
Winston AI9%89%54%From $12/mo

Half or more of humanized AI content slips past every detector we tested. That’s not a criticism of any specific product — it’s the current ceiling of the technology.

What surprised us

The inter-tool disagreement was the finding we weren’t fully anticipating. Across all 60 samples, all five tools reached unanimous agreement on only 71% of cases. On 17 of 60 samples, at least one tool returned a different verdict than the others. In three specific cases from the humanized AI batch, we had a genuine four-way split: two tools said AI, two said human, and one returned a borderline confidence score below 55%.

We also found that Originality.ai’s confidence percentages were better calibrated than most. When it returned 90%+ AI confidence, it was right in every case. When it returned 50-60%, the content was genuinely ambiguous. GPTZero’s scores clustered in a narrower band, which made its confidence numbers less actionable as a signal for when to escalate to human review.

One unexpected result: Turnitin performed worse than Originality.ai on humanized AI (52% vs. 61%) despite having a significantly lower false positive rate. That’s a real tradeoff if your concern is sophisticated AI use rather than first-draft generation. A tool optimized to avoid wrongly accusing humans appears to also be more easily fooled by content that’s been processed to read more human.

The raw verdict

For content agencies or publications worried about falsely accusing human writers: GPTZero’s 4% false positive rate makes it the defensible starting point. Originality.ai is the stronger call if your priority is catching unedited AI output and you have a human review or appeals layer for flagged cases.

For operations that can’t afford to miss AI-generated work but also can’t afford false accusations, running both tools in parallel and requiring agreement before acting on a result is more defensible than trusting either alone. The extra few minutes per piece is cheaper than one wrongly disputed writer relationship.

No tool in our benchmark performed well enough on humanized content to function as a final judgment. The 49%-61% detection ceiling in Round 3 means a significant share of processed AI will clear every detector in this comparison. For high-stakes review — academic submissions, editorial fact-checks, legal content — that gap has to be covered by human judgment, not detector confidence scores.

Frequently asked questions

Which AI writing detector is most accurate overall?

It depends on what you’re optimizing for. Originality.ai led on raw AI detection at 96% but had the highest false positive rate at 12%. GPTZero balanced the two most effectively: 91% detection with only a 4% false positive rate. No single tool dominated across all three test rounds in our benchmark.

Can any AI detector reliably catch humanized AI content?

No, based on our April 2026 testing. The top performer caught 61% of humanized samples, meaning close to 4 in 10 evaded it entirely. This is a limitation of current detection technology, not a specific product failure — all five tools we tested showed significant drops between raw AI and humanized AI detection rates.

Is Turnitin’s AI detector worth it for academic institutions?

For unedited first-draft AI submissions, yes — 94% detection is strong, and the institutional workflow integration is a genuine advantage over standalone tools. For detecting students who’ve processed AI output through a humanizer, it drops to 52%, which is roughly in line with its competitors on that harder task.

The 29% inter-tool disagreement rate from our benchmark is the finding we’d most want reviewers to carry forward: no single detector should be the end of the conversation. Use these tools as a first filter, flag borderline cases for human review, and build your process around the assumption that any single score can be wrong.

Related reads

This article contains affiliate links. If you subscribe through one, we may earn a commission at no extra cost to you. It never changes what we recommend — we only link to tools we actually use. Full disclosure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *