I've audited 14 "AI agent" deployments in the last six months. Twelve of them had no measurement layer beyond "the founder thinks the responses look pretty good." The other two had dashboards full of token counts, latency percentiles, and other engineering metrics that did not answer the only question that matters: is this thing making us money or losing it?
Here is the smallest measurement layer I'd accept before I let an LLM touch a customer.
Of all customer interactions the agent handles, what percentage end without a human escalation? This is the leverage number. If resolution rate is below 60%, the agent is mostly creating ticket volume for your humans, not removing it. Above 85%, you have a real automation. Between, you have a hybrid that may or may not pay for itself depending on the next number.
Total monthly spend (model API + retrieval infrastructure + monitoring + the human supervisor's prorated salary) divided by total interactions resolved without escalation. Compare this to the fully-loaded cost of a human handling the same interaction. If the human is cheaper, you do not have an automation; you have a fashion accessory.
A weekly sample (50–100 interactions, randomly drawn) graded by a human against a published rubric. The rubric should be specific to your use case: tone, factual accuracy, brand voice, completeness, escalation correctness. Quality drift is the slope of this score over time. A flat slope is healthy. A negative slope is the agent regressing — usually because the model was updated, or because the input distribution shifted and the prompts haven't kept up.
Of customers who interact with the agent, what percentage explicitly complain, escalate angrily, or churn citing the interaction? This is the lagging indicator that overrides the other three. If this number is climbing, the agent is producing measurement-friendly outputs and customer-hostile experiences simultaneously — which happens more often than vendors will admit.
Daily operations. Resolution rate, escalation queue depth, error rate (last 24h vs. trailing 7d), cost per resolved interaction. Read by the operator (me, in my engagements) every weekday morning. Sub-2-minute review.
Monthly review. Quality drift trend, customer-reported error rate, contribution margin of the agent (cost saved on humans minus cost of agent infrastructure minus revenue impact of any error). Read by the founder. The month's one decision lives in this dashboard.
One toggle. Owned by the client, not by me. Tested on the first Monday of every month — actually flipped, briefly, to verify the manual fallback works. If you do not test the kill switch, you do not have one. You have a hope.
The number of agents I have seen with no kill switch — or with a kill switch that nobody could find when an incident happened — is the most consistent failure mode of the category. It is also the cheapest one to fix, and the one that determines whether the founder sleeps through Saturday night.
If your AI vendor cannot show you these four numbers and these two dashboards, on real production traffic, computed by a system you own — you do not have an agent in production. You have a demo paying rent on a subdomain. The difference is whether anyone is going to be on call when it breaks.