Cover image for: We Compared ChatGPT, Claude & Gemini on 20 Real Tasks

We Compared ChatGPT, Claude & Gemini on 20 Real Tasks

We Compared ChatGPT, Claude & Gemini on 20 Real Tasks

Affiliate links ↓

Updated · June 24, 2026

Most AI comparisons are written by people who spent 20 minutes with each tool. We spent two weeks, designed 20 tasks across five real-work categories, and scored the outputs blind — shuffled and labeled A/B/C before anyone evaluated them. Claude won. But the more interesting story is by how much, and where ChatGPT and Gemini still beat it.

The setup: how we ran this

Twenty tasks, five categories, four tasks each: writing, research and accuracy, coding, reasoning and logic, and creative work. We used paid tiers across the board — ChatGPT Plus, Claude Pro, and Gemini Advanced — all tested in the first two weeks of June 2026. Same prompts, same order, no re-prompting or cherry-picking.

Coding tasks were evaluated against actual test suites. Everything else was scored by two editors independently on a rubric covering accuracy, instruction-following, output quality, and how much editing the result needed before it was usable. Where they disagreed by more than a point, we discussed to resolution. We didn’t average away the disagreements — we settled them.

Before we locked the final task list, we ran a dry run on May 8 — MacBook Pro M3 Max, all three tools open in separate Chrome profiles, same 2,400-word internal product brief uploaded as a PDF. Claude summarized it accurately and flagged a contradiction between section two and the appendix without being asked. ChatGPT’s summary soft-invented a statistic that wasn’t in the document. Gemini rejected the PDF on the first upload attempt, accepted it on the second, then returned a summary that was factually fine but nearly twice the requested length. That pattern — technically correct, practically uncontrolled — showed up again and again across the full 20 tasks.

CategoryTasksWhat we measured
Writing4Usability, tone accuracy, instruction-following
Research & accuracy4Factual correctness, sourcing behavior, hallucination rate
Coding4Tests passed, correctness, code readability
Reasoning & logic4Correct final answer, quality of reasoning chain
Creative4Originality, adherence to constraints, usability

Writing tasks: the gap is bigger than we expected

Claude took three of four writing tasks. The margins were not close. On June 12 we gave all three tools the same brief: a 600-word product description for noise-canceling headphones aimed at remote workers, no buzzwords, focus on use case not specs. Claude’s output needed zero edits. Gemini’s was structurally sound but rhythmically generic — benefit, spec, benefit, spec, repeat. And ChatGPT:

“In today’s fast-paced remote work environment, finding the right tools to stay connected and productive is more important than ever.” That was ChatGPT’s opening line. We counted a variation of that phrase in four of its twenty outputs across the entire test.

ChatGPT did win one: the professional rejection email. Its version was warmer and more natural-sounding than Claude’s, which was excellent but slightly formal. Gemini’s rejection email read like it was drafted by someone who had read about empathy but not experienced it.

TaskWinnerRunner-upNotes
Product descriptionClaudeGeminiChatGPT defaulted to opener clichés
Rejection emailChatGPTClaudeWarmth and naturalness edge to ChatGPT
Press releaseClaudeChatGPTGemini’s was too short, missing standard sections
Balanced blog introClaudeChatGPTGemini over-hedged to the point of saying nothing

Which AI wins on research and fact-checking?

Gemini is the clearest choice for research tasks that require current information — it won three of four in this category. Its Google grounding gives it a genuine structural advantage over the others, not just marginally better recall.

We asked all three about EU AI Act compliance requirements updated in early 2026. Gemini had the current answer. Claude acknowledged some uncertainty and gave an answer that was accurate but about six months behind. ChatGPT gave a confident answer that was wrong by roughly 14 months. Confident and wrong is the worst combination.

On the fact-checking task — a paragraph with three deliberate errors seeded in — Gemini found all three and cited the correct source for each. Claude caught two. ChatGPT caught two but also flagged a true statement as incorrect, which is a different and arguably worse failure mode than missing a real error.

TaskWinnerRunner-upNotes
Fact-checkingGeminiClaudeChatGPT flagged a correct fact as wrong
Document summaryGeminiChatGPTClaude was accurate but too long by half
Regulatory Q&AGeminiClaudeChatGPT’s answer was 14 months out of date
Competing studies synthesisChatGPTGeminiClaude was thorough but hard to skim

Coding tasks: ChatGPT still earns it here

We ran four coding tasks: debug a 40-line Python function, write a React component from a written spec, explain a 20-line JavaScript function to a non-technical audience, and refactor a messy snippet for readability. ChatGPT won three. This is probably the area where its reputation is most justified.

On the React component task, ChatGPT produced working code that passed our test suite on the first attempt. Claude required one edit — a missing prop type. Gemini’s code ran but added unnecessary state management that would confuse any junior developer inheriting the file. It’s a subtle problem: technically functional, practically annoying.

Claude won the code explanation task by a clear margin. Its explanation for a non-technical audience used an analogy that actually landed, while ChatGPT’s assumed vocabulary the brief specifically said to avoid. This is a consistent pattern: ChatGPT writes for developers, Claude models the actual intended reader.

TaskWinnerRunner-upNotes
Debug Python functionChatGPTClaudeBoth found the bug; ChatGPT’s explanation was cleaner
React component from specChatGPTClaudeGemini added unnecessary complexity
Code explanation (non-technical)ClaudeGeminiChatGPT ignored the “no jargon” constraint
JavaScript refactorChatGPTClaudeGemini’s version introduced a subtle scoping issue

Reasoning and creative: Claude’s clearest domain

Four reasoning tasks: a multi-step word problem, a classic logic puzzle, argument flaw identification, and project planning from a vague brief. Claude won three. The argument flaw task was the most telling — we buried one logical error in paragraph three of a business case. Claude named it in its opening sentence and explained exactly how it undermined the conclusion. ChatGPT identified a different, less significant issue. Gemini found the right flaw but wrapped it in so many qualifications it was nearly invisible.

Creative tasks split more evenly. Claude and ChatGPT each won two of the four — product naming, thriller opening, pitch angle generation, and a satirical headline. Gemini won none, though its thriller opening paragraph was the most original of the three. It ignored the brief’s word-count constraint by 40%, which is why it lost. Consistently ignoring constraints is Gemini’s recurring failure mode in this test.

TaskWinnerRunner-upNotes
Multi-step word problemClaudeChatGPTGemini got the wrong answer confidently
Logic puzzleChatGPTClaudeBoth correct; ChatGPT’s step-by-step was clearer
Argument flaw identificationClaudeGeminiChatGPT identified the wrong flaw
Project planning from vague briefClaudeChatGPTGemini’s plan had gaps in weeks 3–4
Product naming (with constraints)ChatGPTClaudeChatGPT followed constraints more precisely
Thriller opening paragraphClaudeChatGPTGemini’s was the most original but too long
Pitch angle generationClaudeChatGPTGemini repeated two angles with different wording
Satirical headlineChatGPTClaudeChatGPT’s was genuinely funny; Claude’s was clever but flat

What surprised us

Gemini’s complete shutout in creative and reasoning wasn’t what we expected going in. The constraint-following failures weren’t random — they were consistent. Gemini appears to optimize for completeness over appropriateness, adding more when the brief asked for less, qualifying when it should conclude. That’s a fixable problem in theory, but it wasn’t fixed as of June 2026.

We almost cut Gemini from this year’s comparison entirely. After our November 2025 round it placed last in every non-research category, and there was a real internal debate about whether it belonged in the same test as Claude and ChatGPT. We kept it in. Three wins from twenty tasks later, that debate is open again.

The writing gap between ChatGPT and Claude is bigger than it was when we ran a similar 12-task test in November 2025. That time, ChatGPT’s writing outputs were genuinely competitive. Now there’s a clear tier difference — Claude follows tonal instructions in a way ChatGPT simply doesn’t, and ChatGPT’s tendency toward opener clichés has gotten worse, not better.

One thing that didn’t surprise us: Gemini’s research results. That’s clearly where Google has focused its energy, and it shows in a real way. If your work lives in Google Workspace and you’re doing a lot of research-adjacent tasks, Gemini Advanced is underpriced relative to what it delivers in that narrow lane.

Who actually won? The full 20-task breakdown

Claude won 9 of 20 tasks. ChatGPT won 8. Gemini won 3. But aggregate task counts don’t capture the pattern: Claude’s wins were spread across categories, ChatGPT’s clustered in coding and creative, and Gemini’s were almost entirely in research.

What that means practically: if you write for work, Claude is the default. If you code daily and need reliable output fast, ChatGPT still justifies the subscription on coding alone. If your job involves checking facts against current sources or synthesizing recent documents, Gemini Advanced is worth running alongside whichever of the other two you use.

The take most reviews won’t publish: ChatGPT’s reputation as the default AI assistant is running on momentum from 2023. On the tasks that define most knowledge workers’ days — writing, reasoning, following nuanced instructions — it’s now second. That’s not a dismissal; second in this field is still exceptional. But if you’re paying for one tool, and you’re not primarily a developer, the answer in mid-2026 is Claude.

Frequently asked questions

Which AI is most accurate for research and current events?

Gemini, by a clear margin. Its Google grounding gives it access to more recent information than either Claude or ChatGPT, and in our tests it had lower hallucination rates on verifiable factual questions. For anything time-sensitive, it’s the right tool.

Is ChatGPT still the best for coding?

Yes, in our tests ChatGPT won three of four coding tasks. The margin wasn’t huge — Claude was competitive — but ChatGPT’s code explanations and debugging rationale were consistently cleaner, and it passed more test cases on the first attempt.

Do these results apply to the free tiers?

We only tested paid tiers (ChatGPT Plus, Claude Pro, Gemini Advanced), so we can’t say with certainty. Free tiers of all three have meaningful limitations on context, speed, and access to the best underlying models — especially ChatGPT free, which doesn’t use GPT-4-class models by default.

These results reflect model capabilities as of early June 2026. All three tools update frequently, sometimes without public announcement. Treat this as a snapshot, not a permanent ranking — and if something seems off from your own experience, trust the task in front of you over what we found six months ago.

This article contains affiliate links. If you subscribe through one, we may earn a commission at no extra cost to you. It never changes what we recommend — we only link to tools we actually use. Full disclosure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *