Agentic AI

Jun 10, 2025

Salesforce study finds LLM Agents Fail 65% of CX Tasks: what this means for AI in Customer Support

A breakdown of Salesforce’s 2025 AI benchmark for customer experience leaders, and what it reveals about where enterprise AI support is still falling short for consumer-facing brands

Deepak Singla

Most AI agents today fail to deliver on the promise of customer support automation, with Salesforce’s own benchmark showing only 35% success in multi-turn conversations. This blog breaks down the critical failure modes of generic GenAI agents, including context loss, hallucinated actions, and lack of auditability. It then explains how Fini, a structured and secure AI support agent, overcomes these issues to achieve over 80% resolution accuracy at one-tenth the cost. For B2C CX leaders handling high ticket volumes and sensitive workflows, this is a blueprint for moving beyond chatbot hype into reliable, scalable AI operations.

Outline

In May 2025, Salesforce published a benchmark study showing that most AI support agents fall short in real-world customer service tasks. While 58% of single-turn questions were answered successfully, only 35% of multi-turn flows were resolved end-to-end.

These failures weren’t just anecdotal; they were systematic: context loss, slow responses, hallucinated actions, and no audit or confidentiality safeguards.

The implications are serious for consumer-facing brands. If your AI agent can't resolve the refund request, can’t track the order, or can’t escalate correctly; you're not just creating a bad support experience. You're eroding trust, damaging LTV, and creating downstream operational messes. For B2C CX leaders in sectors like e-commerce, fintech, and digital subscriptions, the data offers a wake-up call: generic AI support agents are not yet enterprise-ready.

In this post, we unpack what the Salesforce benchmark really says, trace where AI agents break, and offer a framework to assess whether your current AI support strategy is truly ready for scale.

What Salesforce’s AI Study Reveals About Agent Accuracy

The Salesforce study evaluated large language model (LLM)-based agents across single-turn and multi-turn customer support scenarios. These included:

Refund and cancellation requests
Account updates
Order tracking
Payment failures
General product inquiries

The findings raised concerns around both technical capability and policy risk. Here’s what the Salesforce benchmark uncovered:

Metric	Salesforce AI Agents	Source
Single-turn success rate	58%	Table 2
Multi-turn resolution rate	35%	Table 3
Confidentiality awareness	None	Section 4.3
Cost per resolved ticket	$3–$4 USD	Reddit AMA

Even in controlled environments with well-structured intents, the agents failed most multi-turn tasks; especially when:

Information from earlier messages had to be remembered
Secure actions had to be taken (e.g. refunds, escalations)
Policies constrained what could legally or contractually be said

💡 Want this broken down for your team? Book a live walkthrough and we’ll give you a free analysis of what AI support could look on your data

Why this should concern CX Leaders at B2C Brands

Customer experience leaders at B2C companies face a unique mix of requirements:

Scale: Tens or hundreds of thousands of tickets per month
Speed: Response latency needs to stay under 2–3 seconds
Security: Sensitive user data (PII, payments) is regularly involved
Accuracy: Regulatory environments (e.g. GDPR, PCI, FinCEN) require traceability and compliance

In these settings, AI agents are not just “fancy chatbots.” They are replacing core operational workflows. If they drop context, respond incorrectly, or fabricate an answer; the cost isn’t just poor CSAT. It’s real customer churn, regulatory exposure, and brand damage. Yet most agents today rely on prompt-only architectures with no true memory, control layer, or audit trail.

This is especially true in use cases like:

AI refund automation: A refund issued incorrectly can trigger a chargeback or reputational loss
Subscription cancellation: Mishandled cancellations break compliance with regional consumer rights laws
Order-related escalations: Delayed or fabricated responses during high-stakes moments (e.g. delayed shipments) hurt brand loyalty more than no response at all

The Most Common Failure Modes in Generic AI Support Agents

3.1 Context Loss After 3–4 Turns

In the Salesforce benchmark, multi-turn accuracy dropped sharply because agents lacked persistent memory. For instance:

Customer: I’m checking on an order I placed last week.
Agent: Sure, can I have your order ID?
Customer: It’s 3489-JB21.
Agent (2 messages later): What’s your order ID?

Without a persistent state, even well-trained LLMs revert to single-turn response logic, derailing flows.

3.2 Delays and Latency

The study shows that average latency exceeded 15 seconds on several tasks. This was due to:

Sequential model calls without prefetching
Uncached vector store lookups
API retries due to malformed agent logic

For support teams aiming to match or exceed live agent SLAs, this is unacceptable.

3.3 Hallucinated Actions

In action-oriented flows (e.g. refunds, plan upgrades), agents would fabricate API calls like:

"I’ve just processed your refund to your Visa ending in 1220."

...despite having no programmatic connection to the payment system. In regulated markets (e.g. financial services), this is not just incorrect — it's dangerous.

3.4 Confidentiality Blind Spots

One of the most alarming findings: agents lacked awareness of confidential or masked data. Some exposed:

Masked IBANs and emails
Internal database record IDs
Escalation notes meant for agents

This creates GDPR and SOC 2 violations in real-world production environments.

3.5 No Audit Trail

Salesforce’s study notes that agents offered no traceable justification or structured logs for responses. This makes quality assurance, incident review, and compliance audits nearly impossible.

What Should B2C CX Leaders Demand from AI Agents?

If you're leading CX at a B2C company, the bar is much higher than just “answering questions.” Specially if your company 50,000+ tickets per year, you should be asking:

Does the agent retain context across a full resolution lifecycle/ multi-turn conversations?
Can it execute secure, policy-aware actions via APIs (not just talk)?
Is all activity auditable; down to the API, user, and time?
Can you cap LLM inference costs and override fallback behaviors?
Can the system be deployed within your cloud/VPC for control?

The baseline isn’t chat quality anymore. It’s deterministic resolution, security, and governance.

And above all, it needs to do this at scale and under $1 per resolution; not the $3–$4 averages seen in current benchmarks.

One Alternative: How Fini Addresses These Gaps

Over the last 12 months, Fini has quietly become the AI agent of choice for enterprise-grade B2C support, particularly in Salesforce and Zendesk environments.

Unlike traditional LLM bots, Fini was designed around secure execution, auditability, and memory.

Failure Mode	Generic AI Agents	Fini’s Architecture
Context loss	Stateless	Memory across multi-turn conversations
Action execution	Prompt-based	API-typed Action Graph with deterministic paths
Model cost	Fixed premium LLMs	Dynamic routing with open-weight fallback
PII safety	Unchecked output	PCI-compliant tokenization (Stripe, VGS)
Auditing	Flat logs or none	SHA-256 hash-chained

Fini doesn’t replace your agents. It integrates into your existing Salesforce or Zendesk workflows, running securely with scoped permissions, cost controls, and compliance settings.

A Real-World Case: 600,000+ Tickets Resolved in Fintech

A leading US fintech brand used Fini to automate refund, plan change, and KYC flows.

Metric	Before Fini	After Fini
Multi-turn resolution rate	34.8%	91.2%
Cost per resolved ticket	$2.35	$0.70
CSAT (1–5 scale)	3.1	4.4
Payback period	—	<90 days
Agent headcount	unchanged	50% cost savings

📈 See the full case study

Because Fini works with structured logic and secure APIs, escalations were reduced, resolution time dropped, and agent load decreased.

What Fini’s Execution Flow Looks Like

To illustrate the difference, here’s what a secure refund looks like with Fini’s Action Graph:

No generative guesswork. No prompts triggering random flows. Just actions you can control, log, and review. This avoids hallucinations, ensures policy conformance, and enables full auditability.

A Copyable YAML Flow: Cancel + Refund Subscription

Fini provides ready-to-deploy YAML templates like this one:

trigger: user_requests_cancel
steps:
  - RAG_QUERY: subscription_status
  - IF active:
      - CREATE_CASE: type=cancel_refund
      - ISSUE_REFUND: provider=stripe amount="{{subscription.amount}}"
      - UPDATE_SUBSCRIPTION: status=canceled
      - INFORM_USER: "Refund issued. You're all set."

This YAML executes deterministically across systems like Salesforce, Stripe, VGS, and internal APIs.

A 5-second refund turnaround is more than efficient; it's a signal that you care. Post-Fini sentiment data showed a 2× increase in CSAT phrases like “felt cared for” and “genuinely impressed.” In A/B retargeting emails, users who interacted with Fini agents clicked through 17% more often on their next offer.

Want to review this on your brand? Request a data tour.

Ready to See What Reliable AI Support Looks Like?

If you're evaluating support automation in 2025, it's no longer just about chat.

The Salesforce paper confirmed what many already suspected: generic AI agents can’t handle real B2C workflows. But structured, audit-compliant AI agents can; and are already delivering results.

Fini runs live inside your Salesforce, Zendesk, or Intercom environment in under 10 days. No re-platforming. No hallucinations. Just actions done right.

If you’d like to see what does, Fini offers a live demo across:

Salesforce + Stripe workflows
Zendesk refund + escalation paths
GDPR/PII masking enforcement
API-based cancellation flows
Post-deployment audit logging

Book a 15-minute walkthrough - real flows, real data, and real resolutions.

Agentic AI

View all →

Agentic AI

Why Most RAG Applications Are Too Broad to Be Useful, And Why the Future Is RAGless

Jul 28, 2025

Agentic AI

Klarna Just Went All-In on AI. Should You?

Jun 27, 2025

Agentic AI

Why Box’s AI Pivot, and Agentic AI - Is Your Team’s Competitive Wake-Up Call

Jun 26, 2025

Deepak Singla

Co-founder

Deepak is the co-founder of Fini. Deepak leads Fini’s product strategy, and the mission to maximize engagement and retention of customers for tech companies around the world. Originally from India, Deepak graduated from IIT Delhi where he received a Bachelor degree in Mechanical Engineering, and a minor degree in Business Management