Agentic AI
Jun 10, 2025

Deepak Singla
IN this article
Most AI agents today fail to deliver on the promise of customer support automation, with Salesforce’s own benchmark showing only 35% success in multi-turn conversations. This blog breaks down the critical failure modes of generic GenAI agents, including context loss, hallucinated actions, and lack of auditability. It then explains how Fini, a structured and secure AI support agent, overcomes these issues to achieve over 80% resolution accuracy at one-tenth the cost. For B2C CX leaders handling high ticket volumes and sensitive workflows, this is a blueprint for moving beyond chatbot hype into reliable, scalable AI operations.
Outline
In May 2025, Salesforce published a benchmark study showing that most AI support agents fall short in real-world customer service tasks. While 58% of single-turn questions were answered successfully, only 35% of multi-turn flows were resolved end-to-end.
These failures weren’t just anecdotal; they were systematic: context loss, slow responses, hallucinated actions, and no audit or confidentiality safeguards.
The implications are serious for consumer-facing brands. If your AI agent can't resolve the refund request, can’t track the order, or can’t escalate correctly; you're not just creating a bad support experience. You're eroding trust, damaging LTV, and creating downstream operational messes. For B2C CX leaders in sectors like e-commerce, fintech, and digital subscriptions, the data offers a wake-up call: generic AI support agents are not yet enterprise-ready.
In this post, we unpack what the Salesforce benchmark really says, trace where AI agents break, and offer a framework to assess whether your current AI support strategy is truly ready for scale.
What Salesforce’s AI Study Reveals About Agent Accuracy
The Salesforce study evaluated large language model (LLM)-based agents across single-turn and multi-turn customer support scenarios. These included:
Refund and cancellation requests
Account updates
Order tracking
Payment failures
General product inquiries
The findings raised concerns around both technical capability and policy risk. Here’s what the Salesforce benchmark uncovered:
Metric | Salesforce AI Agents | Source |
---|---|---|
Single-turn success rate | 58% | Table 2 |
Multi-turn resolution rate | 35% | Table 3 |
Confidentiality awareness | None | Section 4.3 |
Cost per resolved ticket | $3–$4 USD |
Even in controlled environments with well-structured intents, the agents failed most multi-turn tasks; especially when:
Information from earlier messages had to be remembered
Secure actions had to be taken (e.g. refunds, escalations)
Policies constrained what could legally or contractually be said
💡 Want this broken down for your team? Book a live walkthrough and we’ll give you a free analysis of what AI support could look on your data

Why this should concern CX Leaders at B2C Brands
Customer experience leaders at B2C companies face a unique mix of requirements:
Scale: Tens or hundreds of thousands of tickets per month
Speed: Response latency needs to stay under 2–3 seconds
Security: Sensitive user data (PII, payments) is regularly involved
Accuracy: Regulatory environments (e.g. GDPR, PCI, FinCEN) require traceability and compliance
In these settings, AI agents are not just “fancy chatbots.” They are replacing core operational workflows. If they drop context, respond incorrectly, or fabricate an answer; the cost isn’t just poor CSAT. It’s real customer churn, regulatory exposure, and brand damage. Yet most agents today rely on prompt-only architectures with no true memory, control layer, or audit trail.
This is especially true in use cases like:
AI refund automation: A refund issued incorrectly can trigger a chargeback or reputational loss
Subscription cancellation: Mishandled cancellations break compliance with regional consumer rights laws
Order-related escalations: Delayed or fabricated responses during high-stakes moments (e.g. delayed shipments) hurt brand loyalty more than no response at all
The Most Common Failure Modes in Generic AI Support Agents
3.1 Context Loss After 3–4 Turns
In the Salesforce benchmark, multi-turn accuracy dropped sharply because agents lacked persistent memory. For instance:
Customer: I’m checking on an order I placed last week.
Agent: Sure, can I have your order ID?
Customer: It’s 3489-JB21.
Agent (2 messages later): What’s your order ID?
Without a persistent state, even well-trained LLMs revert to single-turn response logic, derailing flows.
3.2 Delays and Latency
The study shows that average latency exceeded 15 seconds on several tasks. This was due to:
Sequential model calls without prefetching
Uncached vector store lookups
API retries due to malformed agent logic
For support teams aiming to match or exceed live agent SLAs, this is unacceptable.
3.3 Hallucinated Actions
In action-oriented flows (e.g. refunds, plan upgrades), agents would fabricate API calls like:
"I’ve just processed your refund to your Visa ending in 1220."
...despite having no programmatic connection to the payment system. In regulated markets (e.g. financial services), this is not just incorrect — it's dangerous.
3.4 Confidentiality Blind Spots
One of the most alarming findings: agents lacked awareness of confidential or masked data. Some exposed:
Masked IBANs and emails
Internal database record IDs
Escalation notes meant for agents
This creates GDPR and SOC 2 violations in real-world production environments.
3.5 No Audit Trail
Salesforce’s study notes that agents offered no traceable justification or structured logs for responses. This makes quality assurance, incident review, and compliance audits nearly impossible.
What Should B2C CX Leaders Demand from AI Agents?
If you're leading CX at a B2C company, the bar is much higher than just “answering questions.” Specially if your company 50,000+ tickets per year, you should be asking:
Does the agent retain context across a full resolution lifecycle/ multi-turn conversations?
Can it execute secure, policy-aware actions via APIs (not just talk)?
Is all activity auditable; down to the API, user, and time?
Can you cap LLM inference costs and override fallback behaviors?
Can the system be deployed within your cloud/VPC for control?
The baseline isn’t chat quality anymore. It’s deterministic resolution, security, and governance.
And above all, it needs to do this at scale and under $1 per resolution; not the $3–$4 averages seen in current benchmarks.
One Alternative: How Fini Addresses These Gaps
Over the last 12 months, Fini has quietly become the AI agent of choice for enterprise-grade B2C support, particularly in Salesforce and Zendesk environments.
Unlike traditional LLM bots, Fini was designed around secure execution, auditability, and memory.
Failure Mode | Generic AI Agents | Fini’s Architecture |
---|---|---|
Context loss | Stateless | Memory across multi-turn conversations |
Action execution | Prompt-based | API-typed Action Graph with deterministic paths |
Model cost | Fixed premium LLMs | Dynamic routing with open-weight fallback |
PII safety | Unchecked output | PCI-compliant tokenization (Stripe, VGS) |
Auditing | Flat logs or none | SHA-256 hash-chained |
Fini doesn’t replace your agents. It integrates into your existing Salesforce or Zendesk workflows, running securely with scoped permissions, cost controls, and compliance settings.
A Real-World Case: 600,000+ Tickets Resolved in Fintech
A leading US fintech brand used Fini to automate refund, plan change, and KYC flows.
Metric | Before Fini | After Fini |
---|---|---|
Multi-turn resolution rate | 34.8% | 91.2% |
Cost per resolved ticket | $2.35 | $0.70 |
CSAT (1–5 scale) | 3.1 | 4.4 |
Payback period | — | <90 days |
Agent headcount | unchanged | 50% cost savings |
Because Fini works with structured logic and secure APIs, escalations were reduced, resolution time dropped, and agent load decreased.
What Fini’s Execution Flow Looks Like
To illustrate the difference, here’s what a secure refund looks like with Fini’s Action Graph:
No generative guesswork. No prompts triggering random flows. Just actions you can control, log, and review. This avoids hallucinations, ensures policy conformance, and enables full auditability.
A Copyable YAML Flow: Cancel + Refund Subscription
Fini provides ready-to-deploy YAML templates like this one:
This YAML executes deterministically across systems like Salesforce, Stripe, VGS, and internal APIs.
A 5-second refund turnaround is more than efficient; it's a signal that you care. Post-Fini sentiment data showed a 2× increase in CSAT phrases like “felt cared for” and “genuinely impressed.” In A/B retargeting emails, users who interacted with Fini agents clicked through 17% more often on their next offer.
Want to review this on your brand? Request a data tour.
Ready to See What Reliable AI Support Looks Like?
If you're evaluating support automation in 2025, it's no longer just about chat.
The Salesforce paper confirmed what many already suspected: generic AI agents can’t handle real B2C workflows. But structured, audit-compliant AI agents can; and are already delivering results.
Fini runs live inside your Salesforce, Zendesk, or Intercom environment in under 10 days. No re-platforming. No hallucinations. Just actions done right.
If you’d like to see what does, Fini offers a live demo across:
Salesforce + Stripe workflows
Zendesk refund + escalation paths
GDPR/PII masking enforcement
API-based cancellation flows
Post-deployment audit logging
Book a 15-minute walkthrough - real flows, real data, and real resolutions.
Co-founder
