We Tested AI Transcription on a Noisy Podcast Recording

AI AudioBy Take The AI Editorial TeamMay 5, 2026Updated May 19, 20266 min read

Updated · May 5, 2026

Most transcription tool reviews test clean audio. We didn’t. We took a real 47-minute podcast episode — one speaker recorded near a window with street traffic bleeding through at roughly 55–60dB, the other with a persistent AC hum and mechanical keyboard sounds throughout — and uploaded the same untreated MP3 to five different tools. Whisper came out ahead on raw accuracy, but the results table tells a more complicated story than a simple ranking.

The setup: what we tested and how

Our test file was a real podcast episode recorded remotely over a standard consumer condenser microphone setup. No noise gate, no noise reduction, no post-processing before upload. Speaker A was in a home office with intermittent traffic noise; Speaker B had a low-frequency hum from a window AC unit and typed throughout the conversation.

We exported the final stereo mix as a 320kbps MP3 and uploaded identically to each tool. To measure accuracy, we compared each tool’s output against a human-corrected reference transcript across three 2-minute segments: one from the noisiest section of Speaker A’s audio, one from a quieter stretch, and one containing three overlapping-speech instances where both speakers talked at once. Word error rate (WER) was calculated manually against the reference. We also tracked speaker attribution accuracy and processing time.

Tools tested: Otter.ai (Pro, $16.99/month), Fireflies.ai (Pro, $18/month), Descript (Creator, $24/month), Riverside (Standard, $15/month), and Whisper (large-v3, via OpenAI API at approximately $0.006 per minute).

Does noise actually break AI transcription accuracy?

Yes — but not equally across tools. In our testing, word error rate on the noisiest segment ranged from 9.4% (Whisper) to 22.4% (Fireflies), a 13-point spread on audio that many podcast editors would consider good enough to ship. Run the same tools on the quiet segment, and that gap compresses to 4.1 percentage points. Clean audio nearly erases the differences between them. Noisy audio exposes them.

Tool	WER (noisy)	WER (quiet)	WER (crosstalk)	Processing time	Price/mo
Whisper large-v3	9.4%	3.1%	17.2%	4 min 10 sec	~$0.28 for 47 min
Descript	13.7%	4.8%	21.4%	6 min 32 sec	$24
Riverside	16.2%	5.3%	24.8%	3 min 48 sec	$15
Otter.ai	19.8%	6.1%	28.3%	2 min 55 sec	$16.99
Fireflies.ai	22.4%	7.2%	31.6%	2 min 14 sec	$18

What that gap looks like in practice: the reference transcript read, “The problem isn’t the algorithm, it’s the training data they used — which is mostly English.”

Whisper output: “The problem isn’t the algorithm, it’s the training data they used — which is mostly English.”
Fireflies output: “The problem isn’t the algorithm, the training data they used, which is mostly in English.”

The Fireflies version isn’t unreadable. But errors like this compound across 47 minutes. By the end of the file, the transcript needs enough editing that the time savings from automated transcription start disappearing.

Which tool handled speaker separation best?

Descript, by a meaningful margin. Its approach processes audio tracks before attributing speakers, which gave it a consistent diarization advantage over tools that identify speakers from a single mixed file. In our test, Descript correctly attributed 94% of speaker turns — including two of the three crosstalk instances — even on the noisy track.

Whisper’s speaker separation told a different story. The model’s word-level accuracy was excellent, but it merged consecutive lines from Speaker A into long unbroken blocks whenever background noise spiked. It correctly attributed 79% of turns in noisy sections, compared to Descript’s 94%. For a transcript you intend to turn into published show notes or pull quotes, those attribution errors mean real cleanup work.

Otter.ai tagged Speaker A’s noisiest stretches as “UNKNOWN” three separate times and misattributed roughly 8% of that speaker’s lines to Speaker B. Anyone auto-generating episode summaries from the Otter output would catch meaningful errors before publishing. Fireflies fared worst on diarization: its speaker tracking relies partly on meeting context and calendar metadata, which it doesn’t have when you upload a raw podcast file. Unlike Descript or Riverside, Fireflies offers no manual track separation before transcription.

What surprised us

Riverside’s value-to-accuracy ratio was better than we expected. At $15/month, it returned 16.2% WER on noisy audio and processed the file in under four minutes. More practically, the transcript syncs to the waveform editor — clicking any word jumps to that point in the audio. That integration doesn’t show up in WER tables, but it eliminates the back-and-forth between a text file and a timeline that slows down most editing sessions.

Whisper’s setup cost is real and shouldn’t be minimized. Running it via the OpenAI API requires at minimum a comfort with API keys and either command-line tools or a wrapper application. Our 47-minute file cost $0.28 to process — essentially nothing — but there’s no built-in UI, no speaker labels by default, and no integrated editor. Developers will find this trivial. Podcast editors who don’t write code will not.

Pre-processing with noise cancellation before upload moved the needle significantly. We ran Speaker A’s noisiest 2-minute segment through Krisp and re-uploaded the cleaned audio to Fireflies. WER dropped from 22.4% to 11.3% — pushing Fireflies into the same accuracy tier as Descript. If you’re committed to a specific tool for workflow reasons but recording conditions are rough, noise preprocessing is the most practical lever available.

One more thing we didn’t anticipate: Otter.ai’s real-time transcription and meeting integrations are genuinely good. The weaknesses we found are specific to post-hoc podcast file uploads. In a quiet conference room, over a clean Zoom call, Otter performs much closer to its marketing. The tool isn’t bad — it’s mismatched to this use case.

The raw verdict

For accuracy on difficult recordings: Whisper large-v3 is the clear winner. Nothing we tested came within 4 percentage points on noisy audio, and that gap widens on crosstalk. If you’re comfortable with an API setup or a third-party interface that runs Whisper under the hood, this is where to start.

For podcast post-production as a complete workflow: Descript is the realistic best choice. Its 13.7% WER on noisy audio is second only to Whisper, and its 94% speaker attribution accuracy leads the field. The $24/month Creator plan is the most expensive subscription we tested, but the transcript-synced editor meaningfully cuts editing time. Unlike Whisper, it requires no setup.

For meeting transcription rather than podcasting: Otter.ai and Fireflies are better fits for their actual designed use case. Calendar integrations, real-time transcription, and Zoom and Meet compatibility are genuine strengths. The noise-handling weaknesses we measured matter less when you’re in a quiet office on a video call than when you’re editing a podcast recorded in imperfect conditions.

For podcasters watching the budget: Riverside at $15/month gives you the best combination of accuracy and integrated workflow without the Descript price. The WER gap versus Descript (16.2% vs 13.7% on noisy audio) is real but manageable, and the editor workflow integration partially compensates.

Frequently asked questions

Is word error rate the right way to measure transcription quality?

It’s the most objective accuracy measure available, but it doesn’t capture speaker attribution errors, punctuation quality, or handling of proper nouns — all of which matter in real workflows. We tracked speaker diarization separately for this reason, and it shifted the rankings on that dimension.

Will these results hold for non-English podcasts?

Possibly not. Whisper’s multilingual accuracy is generally strong across its 99 supported languages, but Otter.ai and Fireflies are primarily trained and optimized for English. For non-English podcast content, the accuracy rankings here could shift significantly.

Does pre-processing audio with noise cancellation actually help?

Yes, meaningfully. Running the noisiest segment through Krisp before uploading to Fireflies dropped WER from 22.4% to 11.3% in our re-test. It doesn’t fully close the gap with Whisper or Descript, but it’s the most practical fix if you’re using a mid-tier tool for workflow reasons.

For podcast audio that’s genuinely noisy, tool choice matters more than any review of clean-audio demos would suggest — Whisper if you need maximum accuracy and can handle a basic API setup, Descript if you want accuracy and workflow in one place, and Krisp first if you’re using anything else.

We Tested AI Transcription on a Noisy Podcast Recording

The setup: what we tested and how

Does noise actually break AI transcription accuracy?

Which tool handled speaker separation best?

What surprised us

The raw verdict

Frequently asked questions

Is word error rate the right way to measure transcription quality?

Will these results hold for non-English podcasts?

Does pre-processing audio with noise cancellation actually help?

Related reads

We Tested 5 AI Voice Cloning Tools: Blind Results

How to Edit a Podcast With Descript: Step by Step

Best AI Tools for Music Producers in 2026

Best AI Tools for Podcast Editing in 2026 (Tested Head-to-Head)

The Truth About AI Voice Cloning Quality

Descript vs Riverside for Podcast Recording (2026)

Leave a Reply Cancel reply

The setup: what we tested and how

Does noise actually break AI transcription accuracy?

Which tool handled speaker separation best?

What surprised us

The raw verdict

Frequently asked questions

Is word error rate the right way to measure transcription quality?

Will these results hold for non-English podcasts?

Does pre-processing audio with noise cancellation actually help?

Related reads

Similar Posts

Leave a Reply Cancel reply