Cover image for: We Ran a Blind Test on 5 AI Customer Support Bots

We Ran a Blind Test on 5 AI Customer Support Bots

We Ran a Blind Test on 5 AI Customer Support Bots

Affiliate links ↓

Updated · April 28, 2026

Every AI customer support vendor leads with deflection rates. What the dashboards don’t show is whether the deflections were correct. We set up a test e-commerce store, loaded identical knowledge-base content into five platforms, and sent 40 support tickets to each bot — same tickets, same phrasing, scored blind by two reviewers who didn’t know which tool produced which response. The results split into a clear top half and a clear bottom half faster than we expected.

How we set up the test

We created a fictional outdoor gear brand with a standardized return policy, four product lines, and a loyalty program with specific tier rules. Each bot received the same 10 help-center articles. No custom flows. No extra training beyond what each platform’s default onboarding wizard offered — because most teams don’t have a dedicated AI trainer.

The 40 tickets covered four categories: factual product questions (10 tickets), policy clarifications (10 tickets), ambiguous or multi-part requests (10 tickets), and emotionally charged complaints (10 tickets). Two reviewers scored each response on accuracy (correct / partially correct / incorrect), tone match, and whether escalation was triggered appropriately. We resolved scoring disputes by consensus.

We tested five platforms: Intercom Fin and Zendesk AI Agent represent the enterprise tier; Tidio Lyro and Freshworks Freddy AI cover mid-market; and HubSpot Breeze was included as the free-tier baseline. All were configured on their current plans as of April 2026 — Intercom Fin at $0.99 per resolved conversation, Zendesk AI Agent Suite at around $85/agent/month, Tidio Lyro at $39/month, Freshworks Freddy AI bundled into the Growth plan at roughly $35/agent/month, and HubSpot Breeze on the free Service Hub tier.

According to Salesforce’s 2025 State of Service report, a fully automated support resolution costs under $1 versus $12–$15 for a human-handled ticket. The financial incentive to get this right is obvious. Whether any of these bots actually earn that saving under real-ticket conditions is what we set out to find.

Which AI support bot got the most answers right?

Intercom Fin scored highest, answering 18 of our 20 factual and policy questions correctly without inventing any information. Zendesk AI Agent came in second at 17 correct. HubSpot Breeze struggled most — 11 correct and three instances of hallucinated pricing details, including a return window our help articles never mentioned.

BotCorrect (of 20)Avg response timeHallucinated infoPartial answers
Intercom Fin18 / 202.8s02
Zendesk AI Agent17 / 203.4s12
Tidio Lyro16 / 202.1s03
Freshworks Freddy AI15 / 204.1s23
HubSpot Breeze11 / 202.0s35

Tidio Lyro’s zero-hallucination record stood out given its price point. Its mistakes were partial rather than fabricated — answering the core question correctly but missing a nuance, like failing to mention that the 30-day return window shrinks to 14 days on sale items. Not catastrophic, but the kind of gap that generates a follow-up ticket.

Freshworks Freddy AI’s two hallucinations followed the same pattern: when the knowledge base lacked a direct answer, Freddy extrapolated from adjacent information rather than admitting it didn’t know. Sending a customer confidently down the wrong path is worse than a straightforward “I don’t have that information.”

How do these bots handle an angry or confused customer?

Intercom Fin and Zendesk AI Agent reliably detected frustration signals in emotionally charged tickets and adjusted tone before attempting resolution. Tidio Lyro underperformed its factual scores here, defaulting to canned templates that ignored obvious emotional context. Freshworks Freddy AI escalated too aggressively — routing 28% of emotional tickets to human agents without any attempt at resolution first.

The starkest contrast came from this ticket: “I’ve been waiting three weeks for my order. No one is responding to my emails. I want a refund right now.” Intercom Fin replied:

“Three weeks is too long to wait, and I understand why you’re frustrated. Let me pull up your order details right now — if you can share your order number, I’ll check the status immediately and make sure this gets resolved for you today.”

Tidio Lyro, on the same ticket, responded:

“Thank you for contacting us. Our return policy allows refunds within 30 days of purchase. Please provide your order ID to proceed.”

That second response isn’t wrong. It’s just completely disconnected from a customer who clearly stated a problem that had nothing to do with the return policy. In our scoring, that counted as a partial fail on tone — and a real customer would bounce to a competitor or immediately demand a human agent.

A 2025 Gartner CX study found that 64% of customers who receive an unhelpful or incorrect chatbot response don’t retry — they escalate or abandon. Tone failures are functionally the same as answer failures when it comes to churn.

What surprised us

We expected Intercom Fin to be expensive and accurate. That part tracked. What we didn’t expect was how badly HubSpot Breeze would perform on ambiguous tickets — not just wrong answers, but confidently wrong ones. For teams already in the HubSpot ecosystem, the free bot is tempting. Based on our testing, it’s not ready for unsupervised production use on anything beyond the simplest FAQs.

Tidio Lyro’s consistency was the other surprise. It ranked third overall, but it showed the lowest variance across ticket types — predictable performance from a $39/month flat-fee product. For a small e-commerce team that wants a reliable bot without Intercom pricing, that consistency matters. The ceiling is lower, but so is the floor.

We also noticed a rough speed-accuracy tradeoff that maps to price. The two slower bots — Freshworks at 4.1 seconds and Zendesk at 3.4 seconds — appeared to be doing more careful knowledge-base retrieval. Intercom was both fast and accurate, which is harder to build than it sounds. HubSpot’s speed came at the cost of precision.

The raw verdict

For mid-market and enterprise teams where a wrong answer costs more than a resolution, Intercom Fin is the strongest performer we tested. It grounded responses tightly to the uploaded knowledge base, matched tone to emotional context, and didn’t fabricate information. The per-resolution pricing ($0.99/conversation) looks steep at first but becomes defensible once you calculate the actual cost of a human-handled ticket at scale.

Zendesk AI Agent is close behind and makes more sense for teams already running the Zendesk ecosystem who don’t want to bolt on a second vendor. Seat-based pricing is predictable and the accuracy was strong.

Tidio Lyro is the call for small teams and lean budgets. Flat monthly pricing, zero hallucinations on factual questions, and strong consistency make it the most predictable option at its tier. Just set expectations: complex or emotionally charged tickets need more careful flow design than Lyro provides out of the box.

Freshworks Freddy AI and HubSpot Breeze both require significant manual configuration and oversight before we’d recommend them for unsupervised deployment. Freddy’s tendency to extrapolate from incomplete context is a liability. Breeze’s hallucination rate on pricing questions is the kind of thing that generates refund disputes.

Frequently asked questions

Did you include ChatGPT or Claude as a baseline?

We didn’t test raw API-based builds — those require significant engineering effort and don’t reflect what most support teams deploy. For teams considering a custom bot on Claude or GPT-4o, the accuracy ceiling is likely higher than anything we tested, but the implementation and maintenance cost is incomparable to a turnkey platform.

How does per-resolution pricing compare to seat pricing at volume?

At 1,000 resolved conversations per month, Intercom Fin costs roughly $990. At 5,000, you’re at $4,950. Zendesk AI Agent Suite at $85/agent/month costs $850 for 10 agents regardless of conversation volume — better for high-volume teams, worse for those just starting to automate.

Did the bots perform differently on multi-part questions?

Yes, significantly. On tickets that combined a factual question with a complaint — “Why is my order late and can I still get the free gift with purchase?” — Intercom Fin addressed both parts 9 times out of 10. Tidio Lyro answered the factual part and dropped the complaint. HubSpot Breeze addressed whichever part appeared first in the ticket and ignored the rest.

Deflection rate is a vendor metric. Correct answer rate is a customer experience metric. In our test, those two numbers diverged — and the gap explains why some support teams automate their way into more complaints, not fewer.

Related reads

This article contains affiliate links. If you subscribe through one, we may earn a commission at no extra cost to you. It never changes what we recommend — we only link to tools we actually use. Full disclosure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *