When Unexpected Behavior Isn’t a Bug: Rethinking AI Performance

In traditional software, performance ≈ speed. Fix the code, add capacity, ship deterministic results. With AI, accuracy (quality) is not guaranteed — it’s a distribution. That changes how leaders must manage performance: set a quality floor, then tune latency and cost around it.

AI Performance Trilemma — Quality, latency, and cost are competing dimensions of AI performance.

Why this matters
#

In early pilots, LLMs look affordable. A few cents per thousand tokens feels trivial compared to enterprise software costs.

But once real integrations begin — with larger volumes, longer outputs, and constant retries — those “cheap” tokens compound into meaningful infrastructure bills. What seemed like negligible spend becomes material, and cost optimization naturally becomes the first performance conversation.

Yet cost is only one piece of the equation:

Quality: outputs vary; the same input may produce different answers.
Latency: two parts matter — time to first token and time to complete.
Cost: every token (the unit AI models process) adds up — longer prompts, retries, and context windows quickly compound into real spend.

Unlike traditional software, where wrong answers were bugs to fix, in AI accuracy is probabilistic. Ignoring that reality leads to overspending for marginal gains, or worse — shipping fast, cheap answers that quietly miss the mark.

The mindset gap
#

Old instincts fail: AI accuracy is probabilistic, and latency is felt from the first token.

This is where the mindset gap shows. In classic systems:

If a feature produced the wrong output, it was a bug. Fix it once, and the problem was gone.
Latency was rarely defined up front. Most “non-functional requirements” I’ve seen in enterprise specs left load and response time blank. Teams shipped, tested throughput later, and optimized only when real traffic forced the issue.

With AI, those defaults don’t work:

Accuracy isn’t binary. The same input can succeed once and fail the next. There is no “fix it once.” The only way forward is to measure quality continuously and set acceptable thresholds.
Latency matters from day one. Answers stream token by token, so users feel both the time to first response and the time to complete. A system that feels instant in a demo can frustrate at scale.

Together, these shifts mean leaders can’t rely on old instincts. Performance must be defined, measured, and budgeted before rollout — not left to discovery later.

Quality is probabilistic — treat it as a managed SLO, not a hope.

What leaders must decide
#

Set the quality floor per use case (non-negotiable).
Define minimum acceptable accuracy (e.g., task success or factuality). This is the guardrail. You do not trade below it.
Choose the primary optimization after the floor.
- Latency-first: live chat / agent assist.
- Cost-first: batch summarization / internal reporting.
- Quality-first: contract review / compliance checks.
You can favor one, but not all three equally.
Right-size the model and path to meet (not exceed) the targets.
Start with mid-size or distilled models; adopt serving tactics (caching, batching, pragmatic retries) and only scale up when evidence demands it.

The performance checklist (governance guardrails)
#

Define what “good enough” means and enforce it before scaling.

Define targets up front
- Quality: % correct or factual answers on a representative set.
- Latency: time to first token (start) and time to complete (finish).
- Cost: € per request/batch, with token budgets per task.
Bake evals into the process
- Run pre-prod evals before any rollout.
- Treat any change to the model, the prompt, or the AI settings as a formal deployment: test it, measure quality and cost, and have a rollback plan before rollout.
- Re-check quality after changes; keep history for trend tracking.
Release with guardrails
- Canary rollout (release to a small test group first); compare quality, latency, cost vs. baseline.
- Error budgets: define the allowable dip in quality (or spike in cost/latency). Breach → auto-rollback or throttle.
Optimize to thresholds, not vanity
- Prefer mid-size / distilled models that meet the quality floor.
- Use retrieval (RAG) and tooling to raise quality without jumping to the largest model.
- Cache repeated results; batch non-urgent jobs; cap output length where possible.
- Standardize decoding settings (the “randomness knobs” like temperature or nucleus sampling) to avoid silent quality drift.

Practical tactics (efficiency levers)
#

Day-to-day steps to cut cost/latency without lowering quality.

Distilled models first: often 80–95% of capability at a fraction of cost/latency.
Shorten prompts/outputs: every token costs time and money; tighter specs improve both.
Selective retries: retry only on low-confidence signals, not every failure.
Task decomposition: break one big ask into smaller, easier sub-tasks that meet the quality floor more reliably.

Common mistakes → Do this instead
#

Newest = best. → Use evidence. Start mid-size; scale only if evals demand it.
Optimizing speed/cost without measuring accuracy. → Track a quality KPI and enforce the floor.
Frequent prompt/setting tweaks in prod. → Change via PRs with canaries and rollback.
Single SLO for all workloads. → Define per-use-case targets (chat ≠ contract review).
Endless accuracy chasing. → Stop at the floor when users are satisfied; don’t buy marginal gains you don’t need.

Executive takeaways
#

Quality is probabilistic — treat it as a managed SLO, not a hope.
Set the floor, then optimize latency or cost around it.
Mid-size/distilled + discipline beats “largest by default.”
Treat every change (model, prompt, decoding) as a controlled deployment with evals, canaries, and error budgets.

Call to action (five questions for your next review)
#

What is the quality floor for each use case, and how do we measure it?
After that floor, which dimension do we optimize — latency or cost — and why?
Are we starting with mid-size/distilled models and proving the need to scale up?
Do we run evals before/after changes, with canaries and error budgets?
Where can we cut tokens (prompt, output length, retries) without hurting quality?

Quick summary
#

Accuracy is no longer a binary “bug/no bug.” It’s a distribution you must govern. Set a quality floor per use case, measure it, and only then tune latency and cost. That’s how you avoid paying for marginal gains — and avoid shipping answers that miss the mark.

Performance isn’t speed anymore. It’s a three-axis negotiation — and leaders who govern it with discipline turn AI from a demo into dependable scale.

When Unexpected Behavior Isn’t a Bug: Rethinking AI Performance

Why this matters
#

The mindset gap
#

What leaders must decide
#

The performance checklist (governance guardrails)
#

Practical tactics (efficiency levers)
#

Common mistakes → Do this instead
#

Executive takeaways
#

Call to action (five questions for your next review)
#

Quick summary
#

Discuss with your AI

Discuss with your AI

Why this matters#

The mindset gap#

What leaders must decide#

The performance checklist (governance guardrails)#

Practical tactics (efficiency levers)#

Common mistakes → Do this instead#

Executive takeaways#

Call to action (five questions for your next review)#

Quick summary#

💬 Discuss with your AI

🔗 Share this article

💬 Discuss with your AI

🔗 Share this article

Why this matters
#

The mindset gap
#

What leaders must decide
#

The performance checklist (governance guardrails)
#

Practical tactics (efficiency levers)
#

Common mistakes → Do this instead
#

Executive takeaways
#

Call to action (five questions for your next review)
#

Quick summary
#

Discuss with your AI

Share this article

Discuss with your AI

Share this article